AI voice agents | LiveKit Docs

Companies like OpenAI, Character.ai, Retell, and Speak have built their conversational AI products on the LiveKit platform. AI voice agents are one of the primary use cases for LiveKit's Agents framework.

Features

Programmable conversation flows
Integrated LLM function calls
Provide context to the conversation via RAG
Leverage connectors from an open-source plugin ecosystem
Send synchronized transcriptions to your frontend

Multimodal or voice pipeline

LiveKit offers two types of voice agents: MultimodalAgent and VoicePipelineAgent.

MultimodalAgent uses OpenAI’s multimodal model and realtime API to directly process user audio and generate audio responses, similar to OpenAI’s advanced voice mode, producing more natural-sounding speech.
VoicePipelineAgent uses a pipeline of STT, LLM, and TTS models, providing greater control over the conversation flow by allowing applications to modify the text returned by the LLM.

	Multimodal	Voice pipeline
Python	✅	✅
Node.JS	✅	✅
Model type	single multimodal	STT, LLM, TTS
Function calling	✅	✅
RAG	via function calling	✅
Natural speech	more natural
Modify LLM response		✅
Model vendors	OpenAI	various
Turn detection	VAD	VAD and turn detection model

Handling background noise

While humans can easily ignore background noise, AI models often struggle, leading to misinterpretations or unnecessary pauses when detecting non-speech sounds. Although WebRTC includes built-in noise suppression, it often falls short in real-world environments.

To address this, LiveKit has partnered with Krisp to bring best-in-class noise suppression technology to AI agents. For instructions on enabling Krisp, see Krisp integration guide

Turn detection

Endpointing is the process of detecting the start and end of speech in an audio stream. This is crucial for conversational AI agents to understand when a user has finished speaking and when to start responding to user input.

Determining the end of a turn is particularly challenging for AI agents. Humans rely on multiple cues, such as pauses, speech tone, and content, to recognize when someone has finished speaking.

LiveKit employs two primary strategies to approximate how humans determine turn boundaries:

Voice activity detection (VAD)

LiveKit Agents uses VAD to detect when the user has finished speaking. The agent waits for a minimum duration of silence before considering the turn complete.

Both VoicePipelineAgent and MultimodalAgent use VAD for turn detection.
For OpenAI Multimodal configuration, refer to the MultimodalAgent turn detection docs.
VoicePipelineAgent uses Silero VAD to detect end of speech. The min_endpointing_delay parameter in the agent constructor specifies the minimum silence duration to consider the end of a turn.

Turn detection model

While VAD provides a simple approximation of turn completion, it lacks contextual awareness. In natural conversations, pauses often occur as people think or formulate responses.

To address this, LiveKit has developed a custom, open-weights language model to incorporate conversational context as an additional signal to VAD. The turn-detector plugin uses this model to predict whether a user is done speaking.

When the model predicts that the user is not done with their turn, the agent will wait for a significantly longer period of silence before responding. This helps to prevent unwanted interruptions during natural pauses in speech.

Here's a demo of the model in action.

Benchmarks

In our testing, the turn detector model demonstrated the following performance:

85% true positive rate: avoids early interruptions by correctly identifying when the user is not done speaking.
97% true negative rate: accurately determines the end of a turn when the user has finished speaking.

Using turn detector

Currently, this model is supported for VoicePipelineAgent in Python. To use it, install the livekit-plugins-turn-detector package.

Then, initialize the agent with the turn detector:

from livekit.plugins import turn_detector

agent = VoicePipelineAgent(
    ...
    turn_detector=turn_detector.EOUModel(),
)

Before running the agent for the first time, download the model weights:

python my_agent.py download-files

Agent state

Voice agents automatically publish their current state to your frontend, making it easy to build UI that reflects the agent’s status.

The state is passed to your frontend as a participant attribute on the agent participant. Components like useVoiceAssistant expose the following states:

disconnected: either agent or user is disconnected
connecting: agent is being connected with the user
initializing: agent is connected, but not yet ready
listening: agent is listening for user input
thinking: agent is performing inference on user input
speaking: agent is playing out a response

Transcriptions

LiveKit provides realtime transcriptions for both the agent and the user, which are sent to your frontend via the transcription protocol.

User speech transcriptions are delivered as soon as they are processed by STT. Since the agent’s text response is available before speech synthesis, we manually synchronize the text transcription with audio playback.