Overview
Effective turn detection and interruption management are essential for creating natural conversational experiences with AI agents. By accurately identifying when to respond and when to pause, agents can facilitate natural and engaging interactions with users.
Turn detection
Endpointing is the process of detecting the start and end of speech in an audio stream. This is crucial for conversational AI agents to understand when to start responding to user input.
Determining the end of a turn is particularly challenging for AI agents. Humans rely on multiple cues, such as pauses, speech tone, and content, to recognize when someone has finished speaking.
The Agents framework offers several strategies for detecting turn boundaries:
- VAD
- Turn detection model
- Realtime LLM
- STT
- Manual
VAD
VAD is used to detect when the user has started and stopped speaking.
- Within
AgentSession
, the default VAD option is Silero VAD. - In Node.js, the default for
VoicePipelineAgent
is Silero VAD while theMultimodalAgent
uses the OpenAI Realtime API server VAD.
For VAD configuration options, see Configuring turn detection and user interruptions.
Turn detection model
VAD is effective at detecting when a user is actively speaking, but it lacks the contextual awareness to determine if the user has finished their thought. People often pause during conversations to think or formulate their words.
To address this, LiveKit has developed a custom, open-weights language model to incorporate conversational context as an additional signal to VAD. The turn-detector plugin uses this model to predict whether a user is done speaking.
When the model predicts that the user is not done with their turn, the agent waits for a significantly longer period of silence before responding. This helps to prevent unwanted interruptions during natural pauses in speech.
Here's a demo of the model in action.
Benchmarks
In our testing, the turn detector model demonstrated the following performance:
- 95% true positive rate: avoids early interruptions by correctly identifying when the user is not done speaking.
- 96% true negative rate: accurately determines the end of a turn when the user has finished speaking.
Using turn detector
To use the turn detector, install the plugin and initialize the agent with the turn detector.
Install the livekit-plugins-turn-detector
package:
pip install "livekit-agents[turn-detector]~=1.0rc"
Initialize the agent with the turn detector:
from livekit.plugins import turn_detectorsession = AgentSession(...turn_detection=turn_detector.EOUModel(),)
The turn detection model also works with speech-to-speech models like the Realtime API. However, since it operates in the text domain, you still need to provide a separate STT plugin for it to function.
Before running the agent for the first time, download the model weights:
python my_agent.py download-files
Realtime LLM
Realtime LLMs can directly take in speech and output speech, giving them enough context to predict end-of-turn events. OpenAI’s Realtime API includes native turn detection with two available modes:
- Server VAD - similar to performing VAD within the agent
- Semantic VAD
Below is an example of using Semantic VAD with the Realtime API:
from livekit.plugins.openai import realtimefrom openai.types.beta.realtime.session import TurnDetectionsession = AgentSession(...turn_detection="realtime_llm",llm=realtime.RealtimeModel(turn_detection=TurnDetection(type="semantic_vad",eagerness="medium",create_response=True,interrupt_response=True,)),)
STT
Turn detection can also be handled by an STT model. Since STT models process audio, they can use speech patterns—such as tone and pauses—to infer when the user has finished speaking. In this mode, the agent treats the final STT transcript as the end of the turn.
session = AgentSession(...stt=myprovider.STT(),turn_detection="stt",)
Manual
You can also take full control over when a turn starts and ends. This is useful for applications like push-to-talk, where the user presses a button before speaking.
Here's an example of a push-to-talk agent using RPC calls to manually start and end turns:
session = AgentSession(...turn_detection="manual",)@ctx.room.local_participant.register_rpc_method("start_turn")async def start_turn(data: rtc.RpcInvocationData):session.interrupt()# listen to the caller if multi-userroom_io.set_participant(data.caller_identity)session.input.set_audio_enabled(True)@ctx.room.local_participant.register_rpc_method("end_turn")async def end_turn(data: rtc.RpcInvocationData):session.input.set_audio_enabled(False)session.generate_reply()
The complete example is available here.
Handling interruptions
When a user interrupts, the agent stops speaking and switches to listening mode, storing the position of the speech played so far in its ChatContext
. These are a number of parameters that control the interruption behavior for an AI voice agent. To learn more, see Configuring turn detection and user interruptions.
Manual Interruptions
Manually interrupt the agent session with the session.interrupt()
method. Any active agent speech is immediately ended and the context is truncated to contain only the speech that the user actually heard before interruption.
This method is currently only available in Python.
# Interrupt the agent's current response whenever someone joins the room@ctx.room.on("participant_connected")def on_participant_connected(participant: rtc.RemoteParticipant):session.interrupt()