Pipeline types | LiveKit Documentation

Overview

LiveKit Agents supports two main voice pipeline types, plus a hybrid that combines them:

STT-LLM-TTS pipeline: Three specialized models for speech recognition, language understanding, and speech synthesis.
Realtime model: A single speech-to-speech model that consumes and produces audio directly.
Half-cascade: A realtime model for input understanding paired with a separate TTS for output.

For most production agents, an STT-LLM-TTS pipeline is the right default. The sections below cover each option and when to choose it.

At a glance

The following table compares the three options across the key dimensions that most often drive architecture selection. The sections below go into more depth on each one.

Dimension	STT-LLM-TTS pipeline	Realtime model	Half-cascade
End-to-end latency	Moderate	Fastest	Moderate
Tool calling	Mature	Less mature	Less mature
Realtime transcription	Yes	Delayed	Delayed
Scripted speech (`say()`)	Yes	No	Yes
Prosody-aware comprehension	No	Yes	Yes
Expressive speech output	Depends on TTS	Built-in	Depends on TTS
Auditability	Full text trail	Limited	Output text only

STT-LLM-TTS pipeline

A pipeline (also called a sequential or cascaded pipeline) strings together three specialized models. Audio flows through them in sequence: speech-to-text (STT) transcribes the user's speech, a large language model (LLM) generates a text response, and text-to-speech (TTS) speaks the response back. Each stage has a clean interface, so you can swap any component independently or change models partway through a session.

Choose this for most production agents. It gives you full control over each stage and is the easiest path to debug and audit.

Strengths

Modularity: You can mix and match providers for STT, LLM, and TTS, and replace any stage without modifying the others.
Observability: Every stage produces text or audio you can inspect, log, and audit.
Mature tool calling: Text-based LLM tool calling is more mature and predictable than audio-native alternatives.
Realtime transcription: STT produces interim transcripts you can stream to your frontend or store as a record of the conversation.
Scripted speech: TTS reads exact text, so methods like say() produce predictable output.

Considerations

Higher total latency: Each stage adds latency. Streaming overlaps the stages and keeps total latency low, but a pipeline still adds more end-to-end delay than a realtime model.
Loss of vocal nuance: Because STT produces text, prosody and emotional cues in the user's speech don't reach the LLM.

Realtime models

A realtime model consumes and produces speech directly, in a single model. There's no transcription step on the way in and no separate TTS on the way out.

Choose a realtime model when latency or expressive output matter more than fine-grained control.

Strengths

Lower end-to-end latency: Combining stages into one model removes the inter-stage hand-offs.
Expressive output: Generated speech can carry emotion, emphasis, and other prosodic features that text-to-speech models don't capture.
Richer input understanding: The model hears prosody, tone, and other verbal cues that get lost in transcription.
Simpler setup: A single model with one provider, instead of three.

Considerations

Delayed transcripts: Realtime models don't produce interim transcripts. User transcriptions can lag the agent's response. If you need live captions or transcription-driven logic, add a separate STT plugin.
No scripted speech: The model follows instructions but doesn't read an exact script, so methods like say() aren't supported the same way they are in a pipeline.
Less provider flexibility: You're committed to a single provider's model for the full speech-to-speech path.
Harder to audit: Without a text trail at every stage, debugging and compliance review take more work.

For the full list of limitations, see Considerations and limitations on the realtime models page.

Half-cascade architecture

A half-cascade architecture pairs a realtime model with a separate TTS. The realtime model handles speech understanding only and returns a text response, and a TTS plugin speaks that response. This combines the input-side strengths of realtime with the output-side strengths of a pipeline. For configuration details, see Separate TTS configuration on the realtime models page.

Choose a half-cascade when you want both realtime speech understanding and full control over what your agent says.

Strengths

Realtime speech comprehension: You keep the realtime model's ability to hear prosody and emotional cues.
Output control: A separate TTS lets you read exact scripts, choose voices, and apply the same control you'd have in a pipeline.
Stable speech output: A dedicated TTS avoids realtime-specific output quirks, like some realtime models defaulting to text-only output after loading long conversation histories.

Considerations

Two models to manage: You configure and operate both a realtime model and a TTS, so the setup is closer to a pipeline than a pure realtime agent.
Provider support varies: Not all realtime models support a text-only response modality. Check the relevant provider page before adopting this pattern.

Latency

Voice conversations feel natural when end-to-end response latency stays under one second. Each architecture has a different latency profile. Pipelines accumulate latency across stages but reduce it through streaming, while realtime models combine stages for a lower baseline. For a per-stage breakdown of voice agent latency, see Sequential pipeline architecture for voice agents .

Additional resources

Models overview

All models supported by LiveKit Agents, including STT, LLM, TTS, realtime, and avatar.

Realtime models

Configuration, considerations, and provider plugins for realtime models.

Sessions

Configure your AgentSession to use a pipeline, a realtime model, or a half-cascade.

Sequential pipeline architecture for voice agents

Detailed analysis of the cascaded pipeline pattern, including per-stage latency budgets.