Blog · Stack

Speech to text (STT): codemix, accents, and the streaming game

Indian customers code-switch mid-sentence. Most STT engines do not. The provider you pick shapes how forgiving your agent feels.

2026-05-142 min read

Speech to text is the part of the pipeline that decides what your agent thinks the customer said. Pick the wrong engine and the rest of the agent is reasoning over a bad transcript. Every other component downstream is at the mercy of this one.

Codemix is the hard problem in India

A real customer call in India often switches between Hindi and English inside a single sentence. "Sir, mera plan ka renewal kab hai?" is a normal opening line. STT engines trained on monolingual data either pick one language and mistranscribe the other, or they segment poorly. An engine that natively handles codemix is not a nice-to-have for Indian voice products — it is the baseline.

What we wire up

  • Sarvam (Saaras) — Indian languages with native codemix and Hinglish-auto support across 15 profiles.
  • Deepgram (Nova family) — strong English-first transcription when the use case is English-only.

Streaming beats batch

A voice agent that waits for the caller to finish before transcribing is an agent that feels slow. Streaming STT emits interim transcripts as the caller speaks, which lets the rest of the pipeline start preparing. Combined with end-of-turn detection, this is how you get an agent that responds within the perceptual budget instead of a beat after it.

Things that quietly break transcription

  • Background noise on the caller's side — traffic, fans, other people talking.
  • Phone codecs that compress audio aggressively, especially on mobile networks.
  • Numbers, dates, and brand names — most engines hallucinate around these.
  • Caller accents the model did not see enough of during training.