Voice AI works well in English. For Vietnamese customer service, off-the-shelf voice stacks miss three things consistently — regional dialect variation, English/Vietnamese code-switching mid-call, and brand-appropriate Vietnamese register. Each one is a quality problem that surfaces fast in production and damages CSAT before the engineering team notices.
The dialect problem
Vietnamese has three major dialect groups — Northern (Hanoi-centric), Central (Hue / Da Nang), and Southern (Ho Chi Minh City). Off-the-shelf ASR (automatic speech recognition) models often train heavily on Northern Vietnamese broadcast audio, with thinner coverage of Central dialect and inconsistent results on Southern. For a customer-service deployment serving HCMC-based customers, Northern-biased ASR misses common Southern pronunciations and word-final tones at unexpectedly high rates. The fix is regional-tuned ASR — either a model fine-tuned on Southern-accented data or one explicitly multi-dialect — not a single national model.
Code-switching
Vietnamese tech-adjacent customers code-switch constantly. "Em check email rồi gửi link cho anh nha." Off-the-shelf voice pipelines often segment audio language-by-language and lose context across boundaries; the LLM downstream sees a transcript split mid-sentence and produces a less coherent response. The fix is a code-switch-aware ASR that handles mixed-language utterances natively, plus an LLM tuned on Vietnamese conversational data that knows when to keep an English term verbatim in the response.
Brand voice
Vietnamese has a richer register system than English — formal vs. informal pronouns, age-appropriate address terms (em, anh, chú, bác), respectful particles. A bank’s voice agent addressing a customer with the wrong pronoun isn’t just impolite, it sounds amateur. Brand voice has to be tuned: which pronouns the agent uses for the customer (usually defaulting to neutral-respectful), which particles it includes ("ạ" for politeness in service contexts), how it handles tone in escalation moments. Generic Vietnamese TTS often sounds too casual or too stiff; tuned voices match the brand.
A working stack
The pieces that work in production for Vietnamese voice customer service:
- Vietnamese-tuned ASR with code-switch handling (Whisper fine-tunes or commercial Vietnamese-specific ASR)
- A frontier LLM with Vietnamese-strong reasoning, evaluated rigorously on your domain
- An LLM-based intent classifier trained on your actual call transcripts, not generic Vietnamese
- Brand-voice-tuned Vietnamese TTS — multiple voice options matched to your brand register
- A conversation policy that handles register switching and graceful escalation to human agents
Each layer can be swapped independently as better models ship — the integration discipline is what makes this stack maintainable.
KPIs to actually track
Word error rate, stratified by dialect and by code-switching utterances. Intent-match rate against ground-truth tagged calls. CSAT measured in Vietnamese (not auto-translated). Escalation rate — the right escalation rate is non-zero; fully autonomous is a sign of either a very narrow scope or under-reporting. First-call resolution. Track all of these against the human baseline; the goal is parity with a trained human agent on routine calls and graceful escalation on the rest.
IDS AI Solutions is based in HCMC and runs Vietnamese voice CS deployments as a focused practice. The Vietnamese Voice AI Audit Sprint includes a dialect-stratified ASR evaluation against your real call recordings + a brand-voice rubric for your specific brand. Talk to our team.
