Building knowledge AI for a Vietnamese enterprise looks like building it for an English-speaking one — until you start to ship. Then six language-layer realities surface, and each of them breaks something that worked in English. Tone marks change meaning. Vietnamese compound nouns blur word boundaries. Company-name conventions defeat off-the-shelf entity extraction. Vietnamese addresses keep getting administratively restructured. Tech-adjacent content code-switches mid-sentence. And regional vocabulary varies. Knowledge graphs are the structural layer that lets Vietnamese enterprise RAG scale past these realities.
The six Vietnamese-language realities that bite
- Tone marks are semantic. Users often type queries without diacritics. "Ma" vs "Má" vs "Mã" vs "Mạ" mean different things. Retrieval has to handle both diacritic-stripped queries and full-tone-mark corpus content.
- Word segmentation matters. Vietnamese doesn't put spaces between syllables of compound nouns the way English does. Word boundaries are an explicit preprocessing step, not implicit.
- Company-name conventions are distinct. "Công ty Cổ phần X", "X JSC", "X Joint Stock Company", "Công ty TNHH X", "Công ty TNHH MTV X" — same entities, multiple surface forms. Entity extractors trained on English business data miss all of these patterns.
- Address hierarchy is deeper and unstable. Country → province/city → district → ward → street, and Vietnam restructures district and ward boundaries every few years. "Quận 2, TP.HCM" and "TP. Thủ Đức" can refer to the same physical place pre- and post-2021.
- Tech content code-switches. Vietnamese with embedded English terms (UI labels, brand names, technical jargon) is the dominant style in enterprise tech communication. Splitting it by language loses meaning.
- Regional vocabulary varies. Northern, Central, and Southern Vietnamese use different words for the same concept in casual content — common in customer support transcripts and internal communication.
Why knowledge graphs help
For a Vietnamese enterprise corpus, the structural layer the graph provides solves the most expensive disambiguation problems. Once you know that "Công ty Cổ phần X", "X JSC", and "X Joint Stock Company" refer to the same entity, semantic search across them stops being confused. Once an address chain resolves to a canonical district + ward even after the ward changed names, retrieval is consistent. The graph doesn't fix Vietnamese-language nuance — it captures the decisions you made about how to resolve it, and keeps those decisions stable as the corpus grows.
Practical patterns that work in production
- Vietnamese-tuned entity extraction. Vietnamese-specific NER (PhoBERT and ViT5 fine-tunes, or commercial Vietnamese NER APIs) materially outperforms generic multilingual models on Vietnamese name patterns.
- Hybrid canonical-entity resolution. Regex catches the formal patterns (Công ty Cổ phần, TNHH, JSC, MTV); ML resolves the messier cases (misspellings, mixed-form references, partial names).
- Address normalizer with historical mapping. Maintain a "historical → current" lookup so older documents resolve to the correct entity after administrative restructuring. Critical for legal and government-facing corpora.
- Code-mix-aware embeddings. Multilingual embedding models that handle Vietnamese / English mixed content (the mE5 family, BGE-M3) outperform Vietnamese-only or English-only embeddings on enterprise tech content.
- LLM reasoning in Vietnamese with English-term retention. Don’t auto-translate technical terms unless the user did. Keep "API", brand names, and product nomenclature in their original form.
What goes in the graph
For a typical Vietnamese enterprise knowledge base — contracts, customers, partners, internal teams, products — the graph holds: organizations (with every Vietnamese surface form as an alias), people (with both Vietnamese-order and Western-order name variants), locations (with historical → current administrative mapping), products (with English and Vietnamese names), and the relationships between them. Documents reference graph entities; queries traverse the graph; the LLM gets richer Vietnamese-aware context that off-the-shelf retrieval can't construct.
A starting recipe
Don't boil the ocean. The pattern that works on first deployment:
- Pick one domain — contracts, customer records, or regulatory filings — and limit the initial graph to that domain.
- Build an entity registry covering the full Vietnamese surface-form variants you encounter in that domain.
- Wire the registry into your existing retrieval as an entity-extraction pass before embedding.
- Measure: does query accuracy improve on questions that require cross-referencing entities? If yes, expand to a second domain. If no, vanilla RAG is fine for that domain — don’t add complexity it doesn’t need.
IDS AI Solutions is based in HCMC and builds Vietnamese-language enterprise knowledge AI as a focused practice. The Enterprise RAG & Knowledge AI solution includes Vietnamese-tuned retrieval, entity-graph construction, and dialect-aware evaluation built into every engagement. Talk to our team to scope a pilot.
