The MultimodalAgent
class is an abstraction for building AI agents using OpenAI’s Realtime API with multimodal models. These models accept audio directly, enabling them to 'hear' your voice and capture nuances like emotion, often lost in speech-to-text conversion.
MultimodalAgent class
Unlike VoicePipelineAgent, the MultimodalAgent
class uses a single primary model for the conversation flow. The model is capable of processing both audio and text inputs, generating audio responses.
MultimodalAgent
is responsible for managing the conversation state, including buffering responses from the model and sending them to the user in realtime. It also handles interruptions, indicating to OpenAI's realtime API the point at which the model had been interrupted.
Usage
from __future__ import annotationsimport loggingfrom livekit import rtcfrom livekit.agents import (AutoSubscribe,JobContext,WorkerOptions,cli,llm,)from livekit.agents.multimodal import MultimodalAgentfrom livekit.plugins import openailogger = logging.getLogger("myagent")logger.setLevel(logging.INFO)async def entrypoint(ctx: JobContext):logger.info("starting entrypoint")await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)participant = await ctx.wait_for_participant()model = openai.realtime.RealtimeModel(instructions="You are a helpful assistant and you love kittens",voice="shimmer",temperature=0.8,modalities=["audio", "text"],)assistant = MultimodalAgent(model=model)assistant.start(ctx.room)logger.info("starting agent")session = model.sessions[0]session.conversation.item.create(llm.ChatMessage(role="user",content="Please begin the interaction with the user in a manner consistent with your instructions.",))session.response.create()if __name__ == "__main__":cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
Advantages of speech-to-speech agents
Speech-to-speech agents offer several advantages over pipeline-based agents:
- Natural Interactions: Callers can speak and hear responses with extremely low latency, mimicking human-to-human conversations.
- Voice and Tone: Speech-to-speech agents are able to dynamically change the intonation and tone of their responses based on the emotions of the caller, making interactions more engaging.