Blog · Foundations

How a voice AI agent actually works

Before you pick providers or tune prompts, it helps to know what is happening inside a voice agent on every turn. This is the picture every team should have in their head.

2026-05-283 min read

When a customer talks to a voice agent, the conversation is not one model doing everything. It is a small pipeline of components, each with its own job. Knowing the pipeline is the difference between debugging a voice product blind and being able to actually fix it.

The pipeline on every turn

A single back-and-forth between a caller and an agent runs through four stages. They happen serially, in milliseconds, but understanding each stage matters because each one has its own provider, its own latency, and its own failure modes.

  • Speech to text (STT) — turns the caller's voice into text as they speak.
  • Reasoning (LLM) — decides what the agent should say or do based on the transcript and the agent's instructions.
  • Text to speech (TTS) — turns the model's reply back into a natural-sounding voice.
  • Telephony — carries audio between the customer's phone and the pipeline.

Each stage is also a tradeoff. STT can favor accuracy or speed. The LLM can favor reasoning or latency. TTS can favor expressive voices or low-latency ones. Picking the right stack for your product is half the work of running a good voice agent.

Why latency is the silent killer

Customers do not notice 200 ms. They feel 800 ms. Voice conversations have a perceptual budget — the gap between the customer finishing a sentence and the agent starting to speak. Past about 700 ms it feels off. Past 1.2 s it feels broken.

This budget is the sum of every stage in the pipeline plus the network. That is why teams care about whether STT can stream interim results, whether the LLM can start streaming the response before it finishes, and whether TTS can begin synthesising the first word while the rest is still arriving.

Conversation state, not single replies

A good voice agent is not a function that replies to one message. It is a stateful loop — it remembers what the caller said, what it has already answered, what it has already tried, and what is still pending. Knowledge bases, tools, transfer rules, and after-call analysis all hang off this state.