Realtime models overview | LiveKit Documentation

Overview

Realtime models are capable of consuming and producing speech directly, bypassing the need for a voice pipeline with speech-to-text and text-to-speech components. They can be better at understanding the emotional context of input speech, as well as other verbal cues that may not translate well to text transcription. Additionally, the generated speech can include similar emotional aspects and other improvements over what a text-to-speech model can produce.

You can also use supported realtime models in tandem with a TTS instance of your choice, to gain the benefits of realtime speech comprehension while maintaining complete control over speech output

The agents framework includes plugins for popular realtime models out of the box. This is a new area in voice AI and LiveKit aims to support new providers as they emerge.

LiveKit is open source and welcomes new plugin contributions.

Plugins

The following table lists the available realtime model providers.

Provider	Python	Node.js
Amazon Nova Sonic	✓	—
Azure OpenAI Realtime API	✓	✓
Gemini Live API	✓	✓
NVIDIA PersonaPlex	✓	—
OpenAI Realtime API	✓	✓
Phonic Speech-to-speech	✓	✓
Ultravox Realtime	✓	—
xAI Grok Voice Agent API	✓	✓

Usage

Realtime model plugins have a constructor method to create a RealtimeModel instance. This instance can be passed directly to an AgentSession or Agent in its constructor, in place of an LLM instance.

from livekit.agents import AgentSession
from livekit.plugins import openai

session = AgentSession(
    llm=openai.realtime.RealtimeModel()
)

import voice from '@livekit/agents';
import * as openai from '@livekit/agents-plugin-openai';

const session = new voice.AgentSession({
   llm: new openai.realtime.RealtimeModel()
});

For additional information about installing and using plugins, see the plugins overview.

Separate TTS configuration

To use a realtime model with a different TTS instance, configure the realtime model to use a text-only response modality and include a TTS instance in your AgentSession configuration. In this setup, the realtime model doesn't generate audio; all speech synthesis is handled by the configured TTS.

This is commonly referred to as a half-cascade architecture (or informally as a half-duplex pipeline) and allows you to gain the benefits of realtime speech comprehension while maintaining complete control over the speech output.

session = AgentSession(
    llm=openai.realtime.RealtimeModel(modalities=["text"]), # Or other realtime model plugin
    tts="cartesia/sonic-3" # Or other TTS instance of your choice
)

const session = new voice.AgentSession({
   llm: new openai.realtime.RealtimeModel(modalities=["text"]), // Or other realtime model plugin
   tts: "cartesia/sonic-3" // Or other TTS instance of your choice
});

This feature requires realtime model support for a text-only response modality. Visit the relevant provider page for details about supported features.

Considerations and limitations

Realtime models bring great benefits due to their wider range of audio understanding and expressive output. However, they also have some limitations and considerations to keep in mind. For a side-by-side comparison of all pipeline types, see Pipeline types.

Turn detection and VAD

In general, LiveKit recommends using the built-in turn detection capabilities of the realtime model whenever possible. Accurate turn detection relies on both VAD and context gained from realtime speech-to-text, which, as discussed in the following section, isn't available with realtime models. If you need to use the LiveKit turn detector model, you must also add a separate STT plugin to provide the necessary interim transcripts.

Delayed transcription

Realtime models don't provide interim transcription results, and in general the user input transcriptions can be considerably delayed and often arrive after the agent's response. If you need realtime transcriptions, you should consider an STT-LLM-TTS pipeline or add a separate STT plugin for realtime transcription.

Scripted speech output

Realtime models don't offer a method to directly generate speech from a text script, such as with the say method. You can produce a response with generate_reply(instructions='...') and include specific instructions but the output isn't guaranteed to precisely follow any provided script. If your application requires the use of specific scripts, consider using the model with a separate TTS instance instead.

Loading conversation history

Current models only support loading call history in text format. This limits their ability to interpret emotional context and other verbal cues that may not translate well to text transcription. Additionally, the OpenAI Realtime API becomes more likely to respond in text only after loading extensive history, even if configured to use speech. For OpenAI, it's recommended that you use a separate TTS instance if you need to load conversation history.