Realtime model integrations

Overview

Realtime models are capable of consuming and producing speech directly, bypassing the need for a voice pipeline with speech-to-text and text-to-speech components. They can be better at understanding the emotional context of input speech, as well as other verbal cues that may not translate well to text transcription. Additionally, the generated speech can include similar emotional aspects and other improvements over what a text-to-speech model can produce.

The agents framework includes plugins for popular realtime models out of the box. This is a new area in voice AI and LiveKit aims to support new providers as they emerge.

LiveKit is open source and welcomes new plugin contributions.

How to use

Realtime model plugins have a constructor method to create a RealtimeModel instance. This instance can be passed directly to an AgentSession or Agent in its constructor, in place of an LLM plugin.

from livekit.agents import AgentSession
from livekit.plugins import openai

session = AgentSession(
    llm=openai.realtime.RealtimeModel()
)

For additional information about installing and using plugins, see the plugins overview.

Scripted speech with TTS

One drawback to the realtime model architecture is the lack of a way to directly generate speech from a text script, such as with the say method in LiveKit Agents.

To work around this, you can add a TTS plugin to your AgentSession which will be used by the say method. For the most seamless experience, use a TTS plugin with the same provider and voice configuration as your realtime model.

Available providers

The following table lists the available realtime model providers.

	Provider	Plugin
	Azure OpenAI Realtime API	`openai`
	Gemini Multimodal Live	`google`
	OpenAI Realtime API	`openai`