Realtime model integrations

Guides for adding realtime model integrations to your agents.

Overview

Realtime models are capable of consuming and producing speech directly, bypassing the need for a voice pipeline with speech-to-text and text-to-speech components. They can be better at understanding the emotional context of input speech, as well as other verbal cues that may not translate well to text transcription. Additionally, the generated speech can include similar emotional aspects and other improvements over what a text-to-speech model can produce.

The agents framework includes plugins for popular realtime models out of the box. This is a new area in voice AI and LiveKit aims to support new providers as they emerge.

LiveKit is open source and welcomes new plugin contributions.

How to use

Realtime model plugins have a constructor method to create a RealtimeModel instance. This instance can be passed directly to an AgentSession or Agent in its constructor, in place of an LLM plugin.

from livekit.agents import AgentSession
from livekit.plugins import openai
session = AgentSession(
llm=openai.realtime.RealtimeModel()
)

For additional information about installing and using plugins, see the plugins overview.

Available providers

The following table lists the available realtime model providers.

Considerations and limitations

Realtime models bring great benefits due to their wider range of audio understanding and expressive output. However, they also have some limitations and considerations to keep in mind.

Turn detection and VAD

In general, LiveKit recommends using the built-in turn detection capabilities of the realtime model whenever possible. Accurate turn detection relies on both VAD and context gained from realtime speech-to-text, which, as discussed in the following section, isn't available with realtime models. If you need to use the LiveKit turn detector model, you must also add a separate STT plugin to provide the necessary interim transcripts.

Delayed transcription

Realtime models don't provide interim transcription results, and in general the user input transcriptions can be considerably delayed and often arrive after the agent's response. If you need realtime transcriptions, you should consider an STT-LLM-TTS pipeline or add a separate STT plugin for realtime transcription.

Scripted speech output

Realtime models don't offer a method to directly generate speech from a text script, such as with the say method. You can produce a response with generate_reply(instructions='...') and include specific instructions but the output isn't guaranteed to precisely follow any provided script. If you must use a specific script, you should add a separate TTS plugin to your AgentSession for use with the say method. For the most seamless experience, use a TTS plugin with the same provider and voice configuration as your realtime model.