Overview
Realtime models are capable of consuming and producing speech directly, bypassing the need for a voice pipeline with speech-to-text and text-to-speech components. They can be better at understanding the emotional context of input speech, as well as other verbal cues that may not translate well to text transcription. Additionally, the generated speech can include similar emotional aspects and other improvements over what a text-to-speech model can produce.
The agents framework includes plugins for popular realtime models out of the box. This is a new area in voice AI and LiveKit aims to support new providers as they emerge.
LiveKit is open source and welcomes new plugin contributions.
How to use
Realtime model plugins have a constructor method to create a RealtimeModel
instance. This instance can be passed directly to an AgentSession
or Agent
in its constructor, in place of an LLM plugin.
from livekit.agents import AgentSessionfrom livekit.plugins import openaisession = AgentSession(llm=openai.realtime.RealtimeModel())
For additional information about installing and using plugins, see the plugins overview.
Scripted speech with TTS
One drawback to the realtime model architecture is the lack of a way to directly generate speech from a text script, such as with the say
method in LiveKit Agents.
To work around this, you can add a TTS plugin to your AgentSession
which will be used by the say method. For the most seamless experience, use a TTS plugin with the same provider and voice configuration as your realtime model.
Available providers
The following table lists the available realtime model providers.