Skip to main content

Pipeline types

Compare the voice pipeline types supported by LiveKit Agents and pick the right one for your agent.

Overview

LiveKit Agents supports two main voice pipeline types, plus a hybrid that combines them:

  • STT-LLM-TTS pipeline: Three specialized models for speech recognition, language understanding, and speech synthesis.
  • Realtime model: A single speech-to-speech model that consumes and produces audio directly.
  • Half-cascade: A realtime model for input understanding paired with a separate TTS for output.

For most production agents, an STT-LLM-TTS pipeline is the right default. The sections below cover each option and when to choose it.

At a glance

The following table compares the three options across the key dimensions that most often drive architecture selection. The sections below go into more depth on each one.

DimensionSTT-LLM-TTS pipelineRealtime modelHalf-cascade
End-to-end latencyModerateFastestModerate
Tool callingMatureLess matureLess mature
Realtime transcriptionYesDelayedDelayed
Scripted speech (say())YesNoYes
Prosody-aware comprehensionNoYesYes
Expressive speech outputDepends on TTSBuilt-inDepends on TTS
AuditabilityFull text trailLimitedOutput text only

STT-LLM-TTS pipeline

A pipeline (also called a sequential or cascaded pipeline) strings together three specialized models. Audio flows through them in sequence: speech-to-text (STT) transcribes the user's speech, a large language model (LLM) generates a text response, and text-to-speech (TTS) speaks the response back. Each stage has a clean interface, so you can swap any component independently or change models partway through a session.

Choose this for most production agents. It gives you full control over each stage and is the easiest path to debug and audit.

Strengths

  • Modularity: You can mix and match providers for STT, LLM, and TTS, and replace any stage without modifying the others.
  • Observability: Every stage produces text or audio you can inspect, log, and audit.
  • Mature tool calling: Text-based LLM tool calling is more mature and predictable than audio-native alternatives.
  • Realtime transcription: STT produces interim transcripts you can stream to your frontend or store as a record of the conversation.
  • Scripted speech: TTS reads exact text, so methods like say() produce predictable output.

Considerations

  • Higher total latency: Each stage adds latency. Streaming overlaps the stages and keeps total latency low, but a pipeline still adds more end-to-end delay than a realtime model.
  • Loss of vocal nuance: Because STT produces text, prosody and emotional cues in the user's speech don't reach the LLM.

Realtime models

A realtime model consumes and produces speech directly, in a single model. There's no transcription step on the way in and no separate TTS on the way out.

Choose a realtime model when latency or expressive output matter more than fine-grained control.

Strengths

  • Lower end-to-end latency: Combining stages into one model removes the inter-stage hand-offs.
  • Expressive output: Generated speech can carry emotion, emphasis, and other prosodic features that text-to-speech models don't capture.
  • Richer input understanding: The model hears prosody, tone, and other verbal cues that get lost in transcription.
  • Simpler setup: A single model with one provider, instead of three.

Considerations

  • Delayed transcripts: Realtime models don't produce interim transcripts. User transcriptions can lag the agent's response. If you need live captions or transcription-driven logic, add a separate STT plugin.
  • No scripted speech: The model follows instructions but doesn't read an exact script, so methods like say() aren't supported the same way they are in a pipeline.
  • Less provider flexibility: You're committed to a single provider's model for the full speech-to-speech path.
  • Harder to audit: Without a text trail at every stage, debugging and compliance review take more work.

For the full list of limitations, see Considerations and limitations on the realtime models page.

Half-cascade architecture

A half-cascade architecture pairs a realtime model with a separate TTS. The realtime model handles speech understanding only and returns a text response, and a TTS plugin speaks that response. This combines the input-side strengths of realtime with the output-side strengths of a pipeline. For configuration details, see Separate TTS configuration on the realtime models page.

Choose a half-cascade when you want both realtime speech understanding and full control over what your agent says.

Strengths

  • Realtime speech comprehension: You keep the realtime model's ability to hear prosody and emotional cues.
  • Output control: A separate TTS lets you read exact scripts, choose voices, and apply the same control you'd have in a pipeline.
  • Stable speech output: A dedicated TTS avoids realtime-specific output quirks, like some realtime models defaulting to text-only output after loading long conversation histories.

Considerations

  • Two models to manage: You configure and operate both a realtime model and a TTS, so the setup is closer to a pipeline than a pure realtime agent.
  • Provider support varies: Not all realtime models support a text-only response modality. Check the relevant provider page before adopting this pattern.

Latency

Voice conversations feel natural when end-to-end response latency stays under one second. Each architecture has a different latency profile. Pipelines accumulate latency across stages but reduce it through streaming, while realtime models combine stages for a lower baseline. For a per-stage breakdown of voice agent latency, see Sequential pipeline architecture for voice agents.

Additional resources