Building voice agents

In-depth guide to voice AI with LiveKit Agents.

Overview

Building a great voice AI app requires careful orchestration of multiple components. LiveKit Agents is built on top of the Realtime SDK to provide dedicated abstractions that simplify development while giving you full control over the underlying code.

Agent sessions

The AgentSession is the main orchestrator for your voice AI app. The session is responsible for collecting user input, managing the voice pipeline, invoking the LLM, and sending the output back to the user.

Each session requires at least one Agent to orchestrate. The agent is responsible for defining the core AI logic - instructions, tools, etc - of your app. The framework supports the design of custom workflows to orchestrate handoff and delegation between multiple agents.

The following example shows how to begin a simple single-agent session:

from livekit.agents import AgentSession, Agent, RoomInputOptions
from livekit.plugins import openai, cartesia, deepgram, noise_cancellation, silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel
session = AgentSession(
stt=deepgram.STT(),
llm=openai.LLM(),
tts=cartesia.TTS(),
vad=silero.VAD.load(),
turn_detection=turn_detector.MultilingualModel(),
)
await session.start(
room=ctx.room,
agent=Agent(instructions="You are a helpful voice AI assistant."),
room_input_options=RoomInputOptions(
noise_cancellation=noise_cancellation.BVC(),
),
)

RoomIO

Communication between agent and user participants happens using media streams, also known as tracks. For voice AI apps, this is primarily audio, but can include vision. By default, track management is handled by RoomIO, a utility class that serves as a bridge between the agent session and the LiveKit room. When an AgentSession is initiated, it automatically creates a RoomIO object that enables all room participants to subscribe to available audio tracks.

To learn more about publishing audio and video, see the following topics:

Custom RoomIO

For greater control over media sharing in a room, you can create a custom RoomIO object. For example, you might want to manually control which input and output devices are used, or to control which participants an agent listens to or responds to.

To replace the default one created in AgentSession, create a RoomIO object in your entrypoint function and pass it an instance of the AgentSession in the constructor. For examples, see the following in the GitHub repository:

Voice AI providers

You can choose from a variety of providers for each part of the voice pipeline to fit your needs. The framework supports both high-performance STT-LLM-TTS pipelines and speech-to-speech models. In either case, it automatically manages interruptions, transcription forwarding, turn detection, and more.

You may add these components to the AgentSession, where they act as global defaults within the app, or to each individual Agent if needed.

Capabilities

The following guides, in addition to others in this section, cover the core capabilities of the AgentSession and how to leverage them in your app.