Building voice agents | LiveKit Docs

Overview

Building a great voice AI app requires careful orchestration of multiple components. In addition, the voice AI end-user experience is particularly sensitive to latency and responsiveness. For these reasons, LiveKit Agents includes a dedicated set of abstractions to make building your own custom voice AI app simple, while giving you full control over the underlying code.

Agent sessions

The AgentSession is the main orchestrator for your voice AI app. The session is responsible for collecting user input, managing the voice pipeline, invoking the LLM, and sending the output back to the user.

Each session requires at least one Agent to orchestrate. The agent is responsible for defining the core AI logic - instructions, tools, etc - of your app. The framework supports the design of custom workflows to orchestrate handoff and delegation between multiple agents.

The following example shows how to begin a simple single-agent session:

from livekit.agents.voice import AgentSession, Agent, room_io
from livekit.plugins import openai, cartesia, deepgram, noise_cancellation, silero, turn_detector

session = AgentSession(
    stt=deepgram.STT(),
    llm=openai.LLM(),
    tts=cartesia.TTS(),
    vad=silero.VAD.load(),
    turn_detection=turn_detector.EOUModel(),
)

await session.start(
    room=ctx.room,
    agent=Agent(instructions="You are a helpful voice AI assistant."),
    room_input_options=room_io.RoomInputOptions(
        noise_cancellation=noise_cancellation.BVC(),
    ),
)

Voice AI providers

You may choose among many providers of various components of the voice pipeline to suit your needs. The framework has support for both a high-performance STT-LLM-TTS pipeline, as well as lifelike multimodal models. In either case, the framework automatically handles interruptions, transcription forwarding, turn detection, and more.

You may add these components to the AgentSession, where they act as global defaults within the app, or to each individual Agent if needed.

TTS

Text-to-speech plugins

STT

Speech-to-text plugins

LLM

Language model plugins

Multimodal

Realtime multimodal APIs

Capabilities

The following guides, in addition to others in this section, cover the core capabilities of the AgentSession and how to leverage them in your app.

Workflows

Orchestrate complex tasks among multiple agents.

Tool definition & use

Use tools to call external services, inject custom logic, and more.

Pipeline nodes

Add custom behavior to any component of the voice pipeline.