Companies like OpenAI, Character.ai, Retell, and Speak have built their conversational AI products on the LiveKit platform. AI voice agents are one of the primary use cases for LiveKit's Agents framework.
Features
- Programmable conversation flows
- Integrated LLM function calls
- Provide context to the conversation via RAG
- Leverage connectors from an open-source plugin ecosystem
- Send synchronized transcriptions to your frontend
Multimodal or voice pipeline
LiveKit offers two types of voice agents: MultimodalAgent
and VoicePipelineAgent
.
- MultimodalAgent uses OpenAI’s multimodal model and realtime API to directly process user audio and generate audio responses, similar to OpenAI’s advanced voice mode, producing more natural-sounding speech.
- VoicePipelineAgent uses a pipeline of STT, LLM, and TTS models, providing greater control over the conversation flow by allowing applications to modify the text returned by the LLM.
Multimodal | Voice pipeline | |
---|---|---|
Python | ✅ | ✅ |
Node.JS | ✅ | ✅ |
Model type | single multimodal | STT, LLM, TTS |
Function calling | ✅ | ✅ |
RAG | via function calling | ✅ |
Natural speech | more natural | |
Modify LLM response | ✅ | |
Model vendors | OpenAI | various |
Turn detection | VAD | VAD and turn detection model |
Handling background noise
While humans can easily ignore background noise, AI models often struggle, leading to misinterpretations or unnecessary pauses when detecting non-speech sounds. Although WebRTC includes built-in noise suppression, it often falls short in real-world environments.
To address this, LiveKit has partnered with Krisp.ai to bring best-in-class noise suppression technology to AI agents. For instructions on enabling Krisp, see Krisp integration guide
Turn detection
Endpointing is the process of detecting the start and end of speech in an audio stream. This is crucial for conversational AI agents to understand when a user has finished speaking and when to start responding to user input.
Determining the end of a turn is particularly challenging for AI agents. Humans rely on multiple cues, such as pauses, speech tone, and content, to recognize when someone has finished speaking.
LiveKit employs two primary strategies to approximate how humans determine turn boundaries:
Voice activity detection (VAD)
LiveKit Agents uses VAD to detect when the user has finished speaking. The agent waits for a minimum duration of silence before considering the turn complete.
Both VoicePipelineAgent and MultimodalAgent use VAD for turn detection.
For OpenAI Multimodal configuration, refer to the MultimodalAgent turn detection docs.
VoicePipelineAgent uses Silero VAD to detect end of speech. The
min_endpointing_delay
parameter in the agent constructor specifies the minimum silence duration to consider the end of a turn.
Turn detection model
While VAD provides a simple approximation of turn completion, it lacks contextual awareness. In natural conversations, pauses often occur as people think or formulate responses.
To address this, LiveKit has developed a custom, open-weights language model to incorporate conversational context as an additional signal to VAD. The turn-detector plugin uses this model to predict whether a user is done speaking.
When the model predicts that the user is not done with their turn, the agent will wait for a significantly longer period of silence before responding. This helps to prevent unwanted interruptions during natural pauses in speech.
Here's a demo of the model in action.
Benchmarks
In our testing, the turn detector model demonstrated the following performance:
- 85% true positive rate: avoids early interruptions by correctly identifying when the user is not done speaking.
- 97% true negative rate: accurately determines the end of a turn when the user has finished speaking.
Using turn detector
Currently, this model is supported for VoicePipelineAgent
in Python. To use it, install the livekit-plugins-turn-detector
package.
Then, initialize the agent with the turn detector:
from livekit.plugins import turn_detectoragent = VoicePipelineAgent(...turn_detector=turn_detector.EOUModel(),)
Before running the agent for the first time, download the model weights:
python my_agent.py download-files
Agent state
Voice agents automatically publish their current state to your frontend, making it easy to build UI that reflects the agent’s status.
The state is passed to your frontend as a participant attribute on the agent participant. Components like useVoiceAssistant
expose the following states:
disconnected
: either agent or user is disconnectedconnecting
: agent is being connected with the userinitializing
: agent is connected, but not yet readylistening
: agent is listening for user inputthinking
: agent is performing inference on user inputspeaking
: agent is playing out a response
Transcriptions
LiveKit provides realtime transcriptions for both the agent and the user, which are sent to your frontend via the transcription protocol.
User speech transcriptions are delivered as soon as they are processed by STT. Since the agent’s text response is available before speech synthesis, we manually synchronize the text transcription with audio playback.