Overview
Building a great voice AI app requires careful orchestration of multiple components. LiveKit Agents is built on top of the Realtime SDK to provide dedicated abstractions that simplify development while giving you full control over the underlying code.
Agent sessions
The AgentSession
is the main orchestrator for your voice AI app. The session is responsible for collecting user input, managing the voice pipeline, invoking the LLM, and sending the output back to the user.
Each session requires at least one Agent
to orchestrate. The agent is responsible for defining the core AI logic - instructions, tools, etc - of your app. The framework supports the design of custom workflows to orchestrate handoff and delegation between multiple agents.
The following example shows how to begin a simple single-agent session:
from livekit.agents import AgentSession, Agent, RoomInputOptionsfrom livekit.plugins import openai, cartesia, deepgram, noise_cancellation, silerofrom livekit.plugins.turn_detector.multilingual import MultilingualModelsession = AgentSession(stt=deepgram.STT(),llm=openai.LLM(),tts=cartesia.TTS(),vad=silero.VAD.load(),turn_detection=turn_detector.MultilingualModel(),)await session.start(room=ctx.room,agent=Agent(instructions="You are a helpful voice AI assistant."),room_input_options=RoomInputOptions(noise_cancellation=noise_cancellation.BVC(),),)
RoomIO
Communication between agent and user participants happens using media streams, also known as tracks. For voice AI apps, this is primarily audio, but can include vision. By default, track management is handled by RoomIO
, a utility class that serves as a bridge between the agent session and the LiveKit room. When an AgentSession is initiated, it automatically creates a RoomIO
object that enables all room participants to subscribe to available audio tracks.
To learn more about publishing audio and video, see the following topics:
Agent speech and audio
Add speech, audio, and background audio to your agent.
Vision
Give your agent the ability to see images and live video.
Text and transcription
Send and receive text messages and transcription to and from your agent.
Realtime media
Tracks are a core LiveKit concept. Learn more about publishing and subscribing to media.
Camera and microphone
Use the LiveKit SDKs to publish audio and video tracks from your user's device.
Custom RoomIO
For greater control over media sharing in a room, you can create a custom RoomIO
object. For example, you might want to manually control which input and output devices are used, or to control which participants an agent listens to or responds to.
To replace the default one created in AgentSession
, create a RoomIO
object in your entrypoint function and pass it an instance of the AgentSession
in the constructor. For examples, see the following in the GitHub repository:
Toggling audio
Create a push-to-talk interface to toggle audio input and output.
Toggling input and output
Toggle both audio and text input and output.
Voice AI providers
You can choose from a variety of providers for each part of the voice pipeline to fit your needs. The framework supports both high-performance STT-LLM-TTS pipelines and speech-to-speech models. In either case, it automatically manages interruptions, transcription forwarding, turn detection, and more.
You may add these components to the AgentSession
, where they act as global defaults within the app, or to each individual Agent
if needed.
TTS
Text-to-speech integrations
STT
Speech-to-text integrations
LLM
Language model integrations
Multimodal
Realtime multimodal APIs
Capabilities
The following guides, in addition to others in this section, cover the core capabilities of the AgentSession
and how to leverage them in your app.