Companies like OpenAI, Character.ai, Retell, and Speak have built their conversational AI products on the LiveKit platform. AI voice agents are one of the primary use cases for LiveKit's Agents framework.
Features
- Programmable conversation flows
- Integrated LLM function calls
- Provide context to the conversation via RAG
- Leverage connectors from our open-source plugin ecosystem
- Send synchronized transcriptions to clients
Multimodal or Voice Pipeline
LiveKit offers two types of voice agents: MultimodalAgent
and VoicePipelineAgent
.
- MultimodalAgent uses OpenAI’s multimodal model and realtime API to directly process user audio and generate audio responses, similar to OpenAI’s advanced voice mode, producing more natural-sounding speech.
- VoicePipelineAgent uses a pipeline of STT, LLM, and TTS models, providing greater control over the conversation flow by allowing applications to modify the text returned by the LLM.
Multimodal | Voice Pipeline | |
---|---|---|
Python | ✅ | ✅ |
Node.JS | ✅ | coming soon |
Model type | single multimodal | stt, llm, tts |
Function calling | ✅ | ✅ |
Natural speech | more natural | |
Modify LLM response | ✅ | |
Model vendors | OpenAI | various |
Handling background noise
While humans can easily ignore background noise, AI models often struggle, leading to misinterpretations or unnecessary pauses when detecting non-speech sounds. Although WebRTC includes built-in noise suppression, it often falls short in real-world environments.
To address this, LiveKit has partnered with Krisp.ai to bring best-in-class noise suppression technology to AI agents. For instructions on enabling Krisp, see Krisp integration guide
Turn detection
Endpointing is the process of detecting the start and end of speech in an audio stream. This is crucial for conversational AI agents to understand when a user has finished speaking and is ready to process the input.
Both VoicePipelineAgent and MultimodalAgent use Voice Activity Detection (VAD) to detect the end of turn. For OpenAI Multimodal configuration, refer to the MultimodalAgent turn detection docs.
VoicePipelineAgent uses Silero VAD to detect end of speech. The min_endpointing_delay
parameter in the agent constructor specifies the minimum silence duration to consider the end of a turn.
In future versions, we plan to use custom models trained on speech corpora to better detect when a user has finished speaking. This will work alongside VAD for more accurate endpointing.
Agent state
Voice agents automatically publish their current state to clients, making it easy to build UI that reflect the agent’s status.
The state is passed to clients as a participant attribute on the agent participant. Client components like useVoiceAssistant
expose the following states:
disconnected
: either agent or user is disconnectedconnecting
: an agent is being connected to the userinitializing
: agent is connected, but not yet readylistening
: agent is listening for user inputthinking
: agent is performing inference on user inputspeaking
: agent is playing out response
Transcriptions
LiveKit provides realtime transcriptions for both the agent and the user, which are sent to clients via the transcription protocol.
User speech transcriptions are delivered as soon as they are processed by STT. Since the agent’s text response is available before speech synthesis, we manually synchronize the text transcription with audio playback.