LiveKit docs › Models › Pipeline types

---

# Pipeline types

> Compare the voice pipeline types supported by LiveKit Agents and pick the right one for your agent.

## Overview

LiveKit Agents supports two main voice pipeline types, plus a hybrid that combines them:

- **STT-LLM-TTS pipeline**: Three specialized models for speech recognition, language understanding, and speech synthesis.
- **Realtime model**: A single speech-to-speech model that consumes and produces audio directly.
- **Half-cascade**: A realtime model for input understanding paired with a separate TTS for output.

For most production agents, an STT-LLM-TTS pipeline is the right default. The sections below cover each option and when to choose it.

## At a glance

The following table compares the three options across the key dimensions that most often drive architecture selection. The sections below go into more depth on each one.

| Dimension | STT-LLM-TTS pipeline | Realtime model | Half-cascade |
| End-to-end latency | Moderate | Fastest | Moderate |
| Tool calling | Mature | Less mature | Less mature |
| Realtime transcription | Yes | Delayed | Delayed |
| Scripted speech (`say()`) | Yes | No | Yes |
| Prosody-aware comprehension | No | Yes | Yes |
| Expressive speech output | Depends on TTS | Built-in | Depends on TTS |
| Auditability | Full text trail | Limited | Output text only |

## STT-LLM-TTS pipeline

A pipeline (also called a sequential or cascaded pipeline) strings together three specialized models. Audio flows through them in sequence: speech-to-text (STT) transcribes the user's speech, a large language model (LLM) generates a text response, and text-to-speech (TTS) speaks the response back. Each stage has a clean interface, so you can swap any component independently or change models partway through a session.

Choose this for most production agents. It gives you full control over each stage and is the easiest path to debug and audit.

### Strengths

- **Modularity**: You can mix and match providers for STT, LLM, and TTS, and replace any stage without modifying the others.
- **Observability**: Every stage produces text or audio you can inspect, log, and audit.
- **Mature tool calling**: Text-based LLM tool calling is more mature and predictable than audio-native alternatives.
- **Realtime transcription**: STT produces interim transcripts you can stream to your frontend or store as a record of the conversation.
- **Scripted speech**: TTS reads exact text, so methods like `say()` produce predictable output.

### Considerations

- **Higher total latency**: Each stage adds latency. Streaming overlaps the stages and keeps total latency low, but a pipeline still adds more end-to-end delay than a realtime model.
- **Loss of vocal nuance**: Because STT produces text, prosody and emotional cues in the user's speech don't reach the LLM.

## Realtime models

A [realtime model](https://docs.livekit.io/agents/models/realtime.md) consumes and produces speech directly, in a single model. There's no transcription step on the way in and no separate TTS on the way out.

Choose a realtime model when latency or expressive output matter more than fine-grained control.

### Strengths

- **Lower end-to-end latency**: Combining stages into one model removes the inter-stage hand-offs.
- **Expressive output**: Generated speech can carry emotion, emphasis, and other prosodic features that text-to-speech models don't capture.
- **Richer input understanding**: The model hears prosody, tone, and other verbal cues that get lost in transcription.
- **Simpler setup**: A single model with one provider, instead of three.

### Considerations

- **Delayed transcripts**: Realtime models don't produce interim transcripts. User transcriptions can lag the agent's response. If you need live captions or transcription-driven logic, add a separate STT plugin.
- **No scripted speech**: The model follows instructions but doesn't read an exact script, so methods like `say()` aren't supported the same way they are in a pipeline.
- **Less provider flexibility**: You're committed to a single provider's model for the full speech-to-speech path.
- **Harder to audit**: Without a text trail at every stage, debugging and compliance review take more work.

For the full list of limitations, see [Considerations and limitations](https://docs.livekit.io/agents/models/realtime.md#considerations-and-limitations) on the realtime models page.

## Half-cascade architecture

A half-cascade architecture pairs a realtime model with a separate TTS. The realtime model handles speech understanding only and returns a text response, and a TTS plugin speaks that response. This combines the input-side strengths of realtime with the output-side strengths of a pipeline. For configuration details, see [Separate TTS configuration](https://docs.livekit.io/agents/models/realtime.md#separate-tts) on the realtime models page.

Choose a half-cascade when you want both realtime speech understanding and full control over what your agent says.

### Strengths

- **Realtime speech comprehension**: You keep the realtime model's ability to hear prosody and emotional cues.
- **Output control**: A separate TTS lets you read exact scripts, choose voices, and apply the same control you'd have in a pipeline.
- **Stable speech output**: A dedicated TTS avoids realtime-specific output quirks, like some realtime models defaulting to text-only output after loading long conversation histories.

### Considerations

- **Two models to manage**: You configure and operate both a realtime model and a TTS, so the setup is closer to a pipeline than a pure realtime agent.
- **Provider support varies**: Not all realtime models support a text-only response modality. Check the relevant provider page before adopting this pattern.

## Latency

Voice conversations feel natural when end-to-end response latency stays under one second. Each architecture has a different latency profile. Pipelines accumulate latency across stages but reduce it through streaming, while realtime models combine stages for a lower baseline. For a per-stage breakdown of voice agent latency, see [Sequential pipeline architecture for voice agents](https://livekit.com/blog/sequential-pipeline-architecture-voice-agents).

## Additional resources

- **[Models overview](https://docs.livekit.io/agents/models.md)**: All models supported by LiveKit Agents, including STT, LLM, TTS, realtime, and avatar.

- **[Realtime models](https://docs.livekit.io/agents/models/realtime.md)**: Configuration, considerations, and provider plugins for realtime models.

- **[Sessions](https://docs.livekit.io/agents/logic/sessions.md)**: Configure your AgentSession to use a pipeline, a realtime model, or a half-cascade.

- **[Sequential pipeline architecture for voice agents](https://livekit.com/blog/sequential-pipeline-architecture-voice-agents)**: Detailed analysis of the cascaded pipeline pattern, including per-stage latency budgets.

---

This document was rendered at 2026-06-07T11:33:40.031Z.
For the latest version of this document, see [https://docs.livekit.io/agents/models/pipelines.md](https://docs.livekit.io/agents/models/pipelines.md).

To explore all LiveKit documentation, see [llms.txt](https://docs.livekit.io/llms.txt).