Working with the MultimodalAgent class

Build natural-sounding voice assistants with the MultimodalAgent class.

The MultimodalAgent class is an abstraction for building AI agents using OpenAI’s Realtime API with multimodal models. These models accept audio directly, enabling them to 'hear' your voice and capture nuances like emotion, often lost in speech-to-text conversion.

MultimodalAgent class

Diagram showing MultimodalAgent

Unlike VoicePipelineAgent, the MultimodalAgent class uses a single primary model for the conversation flow. The model is capable of processing both audio and text inputs, generating audio responses.

MultimodalAgent is responsible for managing the conversation state, including buffering responses from the model and sending them to the user in realtime. It also handles interruptions, indicating to OpenAI's realtime API the point at which the model had been interrupted.

Usage

from __future__ import annotations
import logging
from livekit import rtc
from livekit.agents import (
AutoSubscribe,
JobContext,
WorkerOptions,
cli,
llm,
)
from livekit.agents.multimodal import MultimodalAgent
from livekit.plugins import openai
logger = logging.getLogger("myagent")
logger.setLevel(logging.INFO)
async def entrypoint(ctx: JobContext):
logger.info("starting entrypoint")
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
participant = await ctx.wait_for_participant()
model = openai.realtime.RealtimeModel(
instructions="You are a helpful assistant and you love kittens",
voice="shimmer",
temperature=0.8,
modalities=["audio", "text"],
)
assistant = MultimodalAgent(model=model)
assistant.start(ctx.room)
logger.info("starting agent")
session = model.sessions[0]
session.conversation.item.create(
llm.ChatMessage(
role="assistant",
content="Please begin the interaction with the user in a manner consistent with your instructions.",
)
)
session.response.create()
if __name__ == "__main__":
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

Advantages of speech-to-speech agents

Speech-to-speech agents offer several advantages over pipeline-based agents:

  • Natural Interactions: Callers can speak and hear responses with extremely low latency, mimicking human-to-human conversations.
  • Voice and Tone: Speech-to-speech agents are able to dynamically change the intonation and tone of their responses based on the emotions of the caller, making interactions more engaging.

Emitted events

An agent emits the following events:

EventDescription
user_started_speakingUser started speaking.
user_stopped_speakingUser stopped speaking.
agent_started_speakingAgent started speaking.
agent_stopped_speakingAgent stopped speaking.
user_speech_committedUser's speech was committed to the chat context.
agent_speech_committedAgent's speech was committed to the chat context.
agent_speech_interruptedAgent was interrupted while speaking.

Events example

When user speech is committed to the chat context, save it to a queue:

@agent.on("user_speech_committed")
def on_user_speech_committed(msg: llm.ChatMessage):
# convert string lists to strings, drop images
if isinstance(msg.content, list):
msg.content = "\n".join(
"[image]" if isinstance(x, llm.ChatImage) else x for x in msg
)
log_queue.put_nowait(f"[{datetime.now()}] USER:\n{msg.content}\n\n")