The MultimodalAgent
class is an abstraction for building AI agents using OpenAI’s Realtime API with multimodal models. These models accept audio directly, enabling them to 'hear' your voice and capture nuances like emotion, often lost in speech-to-text conversion.
MultimodalAgent class
Unlike VoicePipelineAgent, the MultimodalAgent
class uses a single primary model for the conversation flow. The model is capable of processing both audio and text inputs, generating audio responses.
MultimodalAgent
is responsible for managing the conversation state, including buffering responses from the model and sending them to the user in realtime. It also handles interruptions, indicating to OpenAI's realtime API the point at which the model had been interrupted.
Usage
from __future__ import annotationsimport loggingfrom livekit import rtcfrom livekit.agents import (AutoSubscribe,JobContext,WorkerOptions,cli,llm,)from livekit.agents.multimodal import MultimodalAgentfrom livekit.plugins import openailogger = logging.getLogger("myagent")logger.setLevel(logging.INFO)async def entrypoint(ctx: JobContext):logger.info("starting entrypoint")await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)participant = await ctx.wait_for_participant()model = openai.realtime.RealtimeModel(instructions="You are a helpful assistant and you love kittens",voice="shimmer",temperature=0.8,modalities=["audio", "text"],)assistant = MultimodalAgent(model=model)assistant.start(ctx.room)logger.info("starting agent")session = model.sessions[0]session.conversation.item.create(llm.ChatMessage(role="assistant",content="Please begin the interaction with the user in a manner consistent with your instructions.",))session.response.create()if __name__ == "__main__":cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
Advantages of speech-to-speech agents
Speech-to-speech agents offer several advantages over pipeline-based agents:
- Natural Interactions: Callers can speak and hear responses with extremely low latency, mimicking human-to-human conversations.
- Voice and Tone: Speech-to-speech agents are able to dynamically change the intonation and tone of their responses based on the emotions of the caller, making interactions more engaging.
Emitted events
An agent emits the following events:
Event | Description |
---|---|
user_started_speaking | User started speaking. |
user_stopped_speaking | User stopped speaking. |
agent_started_speaking | Agent started speaking. |
agent_stopped_speaking | Agent stopped speaking. |
user_speech_committed | User's speech was committed to the chat context. |
agent_speech_committed | Agent's speech was committed to the chat context. |
agent_speech_interrupted | Agent was interrupted while speaking. |
Events example
When user speech is committed to the chat context, save it to a queue:
@agent.on("user_speech_committed")def on_user_speech_committed(msg: llm.ChatMessage):# convert string lists to strings, drop imagesif isinstance(msg.content, list):msg.content = "\n".join("[image]" if isinstance(x, llm.ChatImage) else x for x in msg)log_queue.put_nowait(f"[{datetime.now()}] USER:\n{msg.content}\n\n")