Overview
Realtime models are capable of consuming and producing speech directly, bypassing the need for a voice pipeline with speech-to-text and text-to-speech components. They can be better at understanding the emotional context of input speech, as well as other verbal cues that may not translate well to text transcription. Additionally, the generated speech can include similar emotional aspects and other improvements over what a text-to-speech model can produce.
The agents framework includes plugins for popular realtime models out of the box. This is a new area in voice AI and LiveKit aims to support new providers as they emerge.
LiveKit is open source and welcomes new plugin contributions.
How to use
Realtime model plugins have a constructor method to create a RealtimeModel
instance. This instance can be passed directly to an AgentSession
or Agent
in its constructor, in place of an LLM plugin.
from livekit.agents import AgentSessionfrom livekit.plugins import openaisession = AgentSession(llm=openai.realtime.RealtimeModel())
For additional information about installing and using plugins, see the plugins overview.
Available providers
The following table lists the available realtime model providers.
Considerations and limitations
Realtime models bring great benefits due to their wider range of audio understanding and expressive output. However, they also have some limitations and considerations to keep in mind.
Turn detection and VAD
In general, LiveKit recommends using the built-in turn detection capabilities of the realtime model whenever possible. Accurate turn detection relies on both VAD and context gained from realtime speech-to-text, which, as discussed in the following section, isn't available with realtime models. If you need to use the LiveKit turn detector model, you must also add a separate STT plugin to provide the necessary interim transcripts.
Delayed transcription
Realtime models don't provide interim transcription results, and in general the user input transcriptions can be considerably delayed and often arrive after the agent's response. If you need realtime transcriptions, you should consider an STT-LLM-TTS pipeline or add a separate STT plugin for realtime transcription.
Scripted speech output
Realtime models don't offer a method to directly generate speech from a text script, such as with the say
method. You can produce a response with generate_reply(instructions='...')
and include specific instructions but the output isn't guaranteed to precisely follow any provided script. If you must use a specific script, you should add a separate TTS plugin to your AgentSession
for use with the say
method. For the most seamless experience, use a TTS plugin with the same provider and voice configuration as your realtime model.