Skip to content
IDS AI Solutions
AI Pulse
Knowledge AI

Knowledge graphs + LLMs for Vietnamese enterprises: handling language nuance at scale

Vietnamese tone marks. Compound-noun word boundaries. Company-name conventions (Công ty Cổ phần / TNHH / JSC). Administrative restructuring of districts and wards. Code-switching with English. Regional vocabulary. Six realities that break off-the-shelf retrieval — and how a knowledge-graph layer handles them.

Admin
AdminFounder & Engineering Lead · May 19, 2026 · 7 min read

Building knowledge AI for a Vietnamese enterprise looks like building it for an English-speaking one — until you start to ship. Then six language-layer realities surface, and each of them breaks something that worked in English. Tone marks change meaning. Vietnamese compound nouns blur word boundaries. Company-name conventions defeat off-the-shelf entity extraction. Vietnamese addresses keep getting administratively restructured. Tech-adjacent content code-switches mid-sentence. And regional vocabulary varies. Knowledge graphs are the structural layer that lets Vietnamese enterprise RAG scale past these realities.

The six Vietnamese-language realities that bite

  • Tone marks are semantic. Users often type queries without diacritics. "Ma" vs "Má" vs "Mã" vs "Mạ" mean different things. Retrieval has to handle both diacritic-stripped queries and full-tone-mark corpus content.
  • Word segmentation matters. Vietnamese doesn't put spaces between syllables of compound nouns the way English does. Word boundaries are an explicit preprocessing step, not implicit.
  • Company-name conventions are distinct. "Công ty Cổ phần X", "X JSC", "X Joint Stock Company", "Công ty TNHH X", "Công ty TNHH MTV X" — same entities, multiple surface forms. Entity extractors trained on English business data miss all of these patterns.
  • Address hierarchy is deeper and unstable. Country → province/city → district → ward → street, and Vietnam restructures district and ward boundaries every few years. "Quận 2, TP.HCM" and "TP. Thủ Đức" can refer to the same physical place pre- and post-2021.
  • Tech content code-switches. Vietnamese with embedded English terms (UI labels, brand names, technical jargon) is the dominant style in enterprise tech communication. Splitting it by language loses meaning.
  • Regional vocabulary varies. Northern, Central, and Southern Vietnamese use different words for the same concept in casual content — common in customer support transcripts and internal communication.

Why knowledge graphs help

For a Vietnamese enterprise corpus, the structural layer the graph provides solves the most expensive disambiguation problems. Once you know that "Công ty Cổ phần X", "X JSC", and "X Joint Stock Company" refer to the same entity, semantic search across them stops being confused. Once an address chain resolves to a canonical district + ward even after the ward changed names, retrieval is consistent. The graph doesn't fix Vietnamese-language nuance — it captures the decisions you made about how to resolve it, and keeps those decisions stable as the corpus grows.

Practical patterns that work in production

  • Vietnamese-tuned entity extraction. Vietnamese-specific NER (PhoBERT and ViT5 fine-tunes, or commercial Vietnamese NER APIs) materially outperforms generic multilingual models on Vietnamese name patterns.
  • Hybrid canonical-entity resolution. Regex catches the formal patterns (Công ty Cổ phần, TNHH, JSC, MTV); ML resolves the messier cases (misspellings, mixed-form references, partial names).
  • Address normalizer with historical mapping. Maintain a "historical → current" lookup so older documents resolve to the correct entity after administrative restructuring. Critical for legal and government-facing corpora.
  • Code-mix-aware embeddings. Multilingual embedding models that handle Vietnamese / English mixed content (the mE5 family, BGE-M3) outperform Vietnamese-only or English-only embeddings on enterprise tech content.
  • LLM reasoning in Vietnamese with English-term retention. Don’t auto-translate technical terms unless the user did. Keep "API", brand names, and product nomenclature in their original form.

What goes in the graph

For a typical Vietnamese enterprise knowledge base — contracts, customers, partners, internal teams, products — the graph holds: organizations (with every Vietnamese surface form as an alias), people (with both Vietnamese-order and Western-order name variants), locations (with historical → current administrative mapping), products (with English and Vietnamese names), and the relationships between them. Documents reference graph entities; queries traverse the graph; the LLM gets richer Vietnamese-aware context that off-the-shelf retrieval can't construct.

A starting recipe

Don't boil the ocean. The pattern that works on first deployment:

  • Pick one domain — contracts, customer records, or regulatory filings — and limit the initial graph to that domain.
  • Build an entity registry covering the full Vietnamese surface-form variants you encounter in that domain.
  • Wire the registry into your existing retrieval as an entity-extraction pass before embedding.
  • Measure: does query accuracy improve on questions that require cross-referencing entities? If yes, expand to a second domain. If no, vanilla RAG is fine for that domain — don’t add complexity it doesn’t need.

IDS AI Solutions is based in HCMC and builds Vietnamese-language enterprise knowledge AI as a focused practice. The Enterprise RAG & Knowledge AI solution includes Vietnamese-tuned retrieval, entity-graph construction, and dialect-aware evaluation built into every engagement. Talk to our team to scope a pilot.

Frequently asked questions

Why not just use a strong multilingual embedding and skip the knowledge-graph layer?

Multilingual embeddings handle the language layer well — but they don’t resolve entity ambiguity. "Công ty Cổ phần X", "X JSC", and "X Joint Stock Company" embed to similar but distinct vectors. The graph captures the equivalence explicitly, which gives the retrieval and the LLM consistent context across all surface forms. Both layers are useful; the graph adds the disambiguation that embedding similarity alone misses.

How do we handle Vietnamese administrative restructuring (district / ward changes)?

Maintain a historical-to-current address map in the graph. When older documents reference, say, "Quận 2, TP.HCM", the resolver maps them to the current equivalent ("TP. Thủ Đức" post-2021). Without this, retrieval inconsistently returns chunks from before and after the restructuring as if they referred to different places. The map itself can be built from official government data + an annual review.

Which Vietnamese-specific models should we evaluate?

For NER and word segmentation: PhoBERT, ViT5, and Vietnamese-specific commercial NER services. For multilingual embeddings handling Vietnamese/English code-switched content: the mE5 family, BGE-M3, and similar. For LLM generation: most frontier models are now competent in Vietnamese but must be evaluated on your domain. Don’t trust public benchmarks — they often over-weight Northern Vietnamese broadcast text and underweight code-mixed enterprise content.

Knowledge graphs + LLMs for Vietnamese enterprises: handling language nuance at scale — IDS AI Solutions — IDS AI Solutions