AI Voice Agents

Guide to building AI voice assistants

Companies like OpenAI, Character.ai, Retell, and Speak have built their conversational AI products on the LiveKit platform. AI voice agents are one of the primary use cases for LiveKit's Agents framework.

Features

  • Programmable conversation flows
  • Integrated LLM function calls
  • Provide context to the conversation via RAG
  • Leverage connectors from our open-source plugin ecosystem
  • Send synchronized transcriptions to clients

Multimodal or Voice Pipeline

LiveKit offers two types of voice agents: MultimodalAgent and VoicePipelineAgent.

  • MultimodalAgent uses OpenAI’s multimodal model and realtime API to directly process user audio and generate audio responses, similar to OpenAI’s advanced voice mode, producing more natural-sounding speech.
  • VoicePipelineAgent uses a pipeline of STT, LLM, and TTS models, providing greater control over the conversation flow by allowing applications to modify the text returned by the LLM.
MultimodalVoice Pipeline
Python
Node.JScoming soon
Model typesingle multimodalstt, llm, tts
Function calling
Natural speechmore natural
Modify LLM response
Model vendorsOpenAIvarious

Handling background noise

While humans can easily ignore background noise, AI models often struggle, leading to misinterpretations or unnecessary pauses when detecting non-speech sounds. Although WebRTC includes built-in noise suppression, it often falls short in real-world environments.

To address this, LiveKit has partnered with Krisp.ai to bring best-in-class noise suppression technology to AI agents. For instructions on enabling Krisp, see Krisp integration guide

Turn detection

Endpointing is the process of detecting the start and end of speech in an audio stream. This is crucial for conversational AI agents to understand when a user has finished speaking and is ready to process the input.

Both VoicePipelineAgent and MultimodalAgent use Voice Activity Detection (VAD) to detect the end of turn. For OpenAI Multimodal configuration, refer to the MultimodalAgent turn detection docs.

VoicePipelineAgent uses Silero VAD to detect end of speech. The min_endpointing_delay parameter in the agent constructor specifies the minimum silence duration to consider the end of a turn.

In future versions, we plan to use custom models trained on speech corpora to better detect when a user has finished speaking. This will work alongside VAD for more accurate endpointing.

Agent state

Voice agents automatically publish their current state to clients, making it easy to build UI that reflect the agent’s status.

The state is passed to clients as a participant attribute on the agent participant. Client components like useVoiceAssistant expose the following states:

  • disconnected: either agent or user is disconnected
  • connecting: an agent is being connected to the user
  • initializing: agent is connected, but not yet ready
  • listening: agent is listening for user input
  • thinking: agent is performing inference on user input
  • speaking: agent is playing out response

Transcriptions

LiveKit provides realtime transcriptions for both the agent and the user, which are sent to clients via the transcription protocol.

User speech transcriptions are delivered as soon as they are processed by STT. Since the agent’s text response is available before speech synthesis, we manually synchronize the text transcription with audio playback.