Overview
LiveKit Agents supports two main voice pipeline types, plus a hybrid that combines them:
- STT-LLM-TTS pipeline: Three specialized models for speech recognition, language understanding, and speech synthesis.
- Realtime model: A single speech-to-speech model that consumes and produces audio directly.
- Half-cascade: A realtime model for input understanding paired with a separate TTS for output.
For most production agents, an STT-LLM-TTS pipeline is the right default. The sections below cover each option and when to choose it.
At a glance
The following table compares the three options across the key dimensions that most often drive architecture selection. The sections below go into more depth on each one.
| Dimension | STT-LLM-TTS pipeline | Realtime model | Half-cascade |
|---|---|---|---|
| End-to-end latency | Moderate | Fastest | Moderate |
| Tool calling | Mature | Less mature | Less mature |
| Realtime transcription | Yes | Delayed | Delayed |
Scripted speech (say()) | Yes | No | Yes |
| Prosody-aware comprehension | No | Yes | Yes |
| Expressive speech output | Depends on TTS | Built-in | Depends on TTS |
| Auditability | Full text trail | Limited | Output text only |
STT-LLM-TTS pipeline
A pipeline (also called a sequential or cascaded pipeline) strings together three specialized models. Audio flows through them in sequence: speech-to-text (STT) transcribes the user's speech, a large language model (LLM) generates a text response, and text-to-speech (TTS) speaks the response back. Each stage has a clean interface, so you can swap any component independently or change models partway through a session.
Choose this for most production agents. It gives you full control over each stage and is the easiest path to debug and audit.
Strengths
- Modularity: You can mix and match providers for STT, LLM, and TTS, and replace any stage without modifying the others.
- Observability: Every stage produces text or audio you can inspect, log, and audit.
- Mature tool calling: Text-based LLM tool calling is more mature and predictable than audio-native alternatives.
- Realtime transcription: STT produces interim transcripts you can stream to your frontend or store as a record of the conversation.
- Scripted speech: TTS reads exact text, so methods like
say()produce predictable output.
Considerations
- Higher total latency: Each stage adds latency. Streaming overlaps the stages and keeps total latency low, but a pipeline still adds more end-to-end delay than a realtime model.
- Loss of vocal nuance: Because STT produces text, prosody and emotional cues in the user's speech don't reach the LLM.
Realtime models
A realtime model consumes and produces speech directly, in a single model. There's no transcription step on the way in and no separate TTS on the way out.
Choose a realtime model when latency or expressive output matter more than fine-grained control.
Strengths
- Lower end-to-end latency: Combining stages into one model removes the inter-stage hand-offs.
- Expressive output: Generated speech can carry emotion, emphasis, and other prosodic features that text-to-speech models don't capture.
- Richer input understanding: The model hears prosody, tone, and other verbal cues that get lost in transcription.
- Simpler setup: A single model with one provider, instead of three.
Considerations
- Delayed transcripts: Realtime models don't produce interim transcripts. User transcriptions can lag the agent's response. If you need live captions or transcription-driven logic, add a separate STT plugin.
- No scripted speech: The model follows instructions but doesn't read an exact script, so methods like
say()aren't supported the same way they are in a pipeline. - Less provider flexibility: You're committed to a single provider's model for the full speech-to-speech path.
- Harder to audit: Without a text trail at every stage, debugging and compliance review take more work.
For the full list of limitations, see Considerations and limitations on the realtime models page.
Half-cascade architecture
A half-cascade architecture pairs a realtime model with a separate TTS. The realtime model handles speech understanding only and returns a text response, and a TTS plugin speaks that response. This combines the input-side strengths of realtime with the output-side strengths of a pipeline. For configuration details, see Separate TTS configuration on the realtime models page.
Choose a half-cascade when you want both realtime speech understanding and full control over what your agent says.
Strengths
- Realtime speech comprehension: You keep the realtime model's ability to hear prosody and emotional cues.
- Output control: A separate TTS lets you read exact scripts, choose voices, and apply the same control you'd have in a pipeline.
- Stable speech output: A dedicated TTS avoids realtime-specific output quirks, like some realtime models defaulting to text-only output after loading long conversation histories.
Considerations
- Two models to manage: You configure and operate both a realtime model and a TTS, so the setup is closer to a pipeline than a pure realtime agent.
- Provider support varies: Not all realtime models support a text-only response modality. Check the relevant provider page before adopting this pattern.
Latency
Voice conversations feel natural when end-to-end response latency stays under one second. Each architecture has a different latency profile. Pipelines accumulate latency across stages but reduce it through streaming, while realtime models combine stages for a lower baseline. For a per-stage breakdown of voice agent latency, see Sequential pipeline architecture for voice agents.
Additional resources
Models overview
All models supported by LiveKit Agents, including STT, LLM, TTS, realtime, and avatar.
Realtime models
Configuration, considerations, and provider plugins for realtime models.
Sessions
Configure your AgentSession to use a pipeline, a realtime model, or a half-cascade.
Sequential pipeline architecture for voice agents
Detailed analysis of the cascaded pipeline pattern, including per-stage latency budgets.