Overview
Turn detection is the process of determining when a user begins or ends their "turn" in a conversation. This lets the agent know when to start listening and when to respond.
Most turn detection techniques rely on voice activity detection (VAD) to detect periods of silence in user input. The agent applies heuristics to the VAD data to perform phrase endpointing, which determines the end of a sentence or thought. The agent can use endpoints alone or apply more contextual analysis to determine when a turn is complete.
Effective turn detection and interruption management is essential to great voice AI experiences.
Turn detection
Turn detection determines when the user has finished speaking (so the agent can respond) and when the user starts speaking mid-response (so the agent can yield). LiveKit supports multiple detection strategies and optional features that work together to make turn-taking feel natural:
Detection modes: Choose how the session determines when a user turn is complete. Options range from silence-based rules (VAD or STT endpointing) to context-aware models that use transcript and pause patterns. The
AgentSessionsupports the turn detection modes listed in the following table:Mode Description Turn detector model A custom, open-weights model for context-aware turn detection on top of VAD or STT endpoint data. Realtime models Use server-side detection from a realtime LLM (for example, the OpenAI Realtime API or Gemini Live API). VAD only Detect end of turn from speech and silence data alone using VAD start and stop cues. STT endpointing Use phrase endpoints returned in realtime STT data from your chosen provider. Manual turn control Disable automatic turn detection entirely and control turn boundaries explicitly. Supporting features: Regardless of detection mode, you can tune behavior with additional turn handling options. The following features are available in addition to turn detection modes to make turn-taking feel natural:
Feature Description Endpointing delay Controls how long the agent waits after speech (or after an STT end-of-utterance signal) before treating the turn as complete. Use fixed min_delayandmax_delay, or dynamic endpointing (Python only) to adapt the delay based on session pause statistics.Adaptive interruption handling Controls how the agent detects and reacts when the user speaks while the agent is talking. Adaptive interruption handling can distinguish true interruptions from conversational backchanneling. VAD Use VAD in addition to turn detection modes to improve end-of-turn timing and interruption responsiveness. Noise cancellation Enhanced noise cancellation improves the quality of turn detection and speech-to-text (STT) for voice AI apps by reducing background noise.
Turn detector model
To achieve the recommended behavior of an agent that listens while the user speaks and replies after they finish their thought, use the following plugins in an STT-LLM-TTS pipeline:
Turn detection model
Open-weights model for contextually-aware turn detection.
Silero VAD
Silero VAD model for voice activity detection.
The following example uses the LiveKit turn detector model for turn detection:
from livekit.agents import AgentSession, TurnHandlingOptionsfrom livekit.plugins.turn_detector.multilingual import MultilingualModelfrom livekit.plugins import silerosession = AgentSession(turn_handling=TurnHandlingOptions(turn_detection=MultilingualModel(),),vad=silero.VAD.load(),# ... stt, tts, llm, etc.)
import { voice } from '@livekit/agents';import * as livekit from '@livekit/agents-plugin-livekit';import * as silero from '@livekit/agents-plugin-silero';const session = new voice.AgentSession({vad: await silero.VAD.load(),turnHandling: {turnDetection: new livekit.turnDetector.MultilingualModel(),},// ... stt, tts, llm, etc.});
See the Voice AI quickstart for a complete example.
For a realtime model, LiveKit recommends using the built-in turn detection capabilities of the chosen model provider. This is the most cost-effective option, since the custom turn detection model requires realtime speech-to-text (STT) that would need to run separately.
Realtime models
Realtime models include built-in turn detection options based on VAD and other techniques. Set the turn_detection parameter to "realtime_llm" and configure the realtime model's turn detection options directly.
To use the LiveKit turn detector model with a realtime model, you must also provide an STT plugin. The turn detector model operates on STT output.
OpenAI Realtime API turn detection
Turn detection options for the OpenAI Realtime API.
Gemini Live API turn detection
Turn detection options for the Gemini Live API.
VAD only
In some cases, VAD is the best option for turn detection. For example, VAD works with any spoken language. To use VAD alone, use the Silero VAD plugin and set turn_detection="vad".
session = AgentSession(turn_handling=TurnHandlingOptions(turn_detection="vad",),vad=silero.VAD.load(),# ... stt, tts, llm, etc.)
import { voice } from '@livekit/agents';import * as silero from '@livekit/agents-plugin-silero';const session = new voice.AgentSession({vad: await silero.VAD.load(),turnHandling: {turnDetection: 'vad',},// ... stt, tts, llm, etc.});
STT endpointing
You can also use your STT model's built-in phrase endpointing features for turn detection. Some providers, including AssemblyAI, include sophisticated semantic turn detection models.
You should still provide a VAD plugin for responsive interruption handling. When you use STT endpointing only, your agent is less responsive to user interruptions.
To use STT endpointing, set turn_detection="stt" and provide an STT plugin.
session = AgentSession(turn_handling=TurnHandlingOptions(turn_detection="stt",),stt=assemblyai.STT(), # AssemblyAI is the recommended STT plugin for STT-based endpointingvad=silero.VAD.load(), # Recommended for responsive interruption handling# ... tts, llm, etc.)
import { voice } from '@livekit/agents';import * as assemblyai from '@livekit/agents-plugin-assemblyai';import * as silero from '@livekit/agents-plugin-silero';const session = new voice.AgentSession({stt: new assemblyai.STT(), // AssemblyAI is the recommended STT plugin for STT-based endpointingvad: await silero.VAD.load(), // Recommended for responsive interruption handlingturnHandling: {turnDetection: 'stt',},// ... tts, llm, etc.});
Additional endpointing configuration options
You can configure additional endpointing behavior using the endpointing key in the turn handling options. By default, the agent uses fixed endpointing and always uses the configured min_delay and max_delay. With dynamic endpointing, the agent adapts the delay within that range based on session pause statistics, so turn-taking can feel more responsive over time.
To learn more, see the EndpointingOptions reference.
Manual turn control
Disable automatic turn detection entirely by setting turn_detection="manual" in the turn handling options for the AgentSession.
You can control the user's turn with session.interrupt(), session.clear_user_turn(), and session.commit_user_turn() methods.
This is different from toggling audio input/output for text-only sessions.
For instance, you can use this to implement a push-to-talk interface. Here is a simple example using RPC methods that the frontend can call:
session = AgentSession(turn_handling=TurnHandlingOptions(turn_detection="manual",),# ... stt, tts, llm, etc.)# Disable audio input at the startsession.input.set_audio_enabled(False)# When user starts speaking@ctx.room.local_participant.register_rpc_method("start_turn")async def start_turn(data: rtc.RpcInvocationData):session.interrupt() # Stop any current agent speechsession.clear_user_turn() # Clear any previous inputsession.input.set_audio_enabled(True) # Start listening# When user finishes speaking@ctx.room.local_participant.register_rpc_method("end_turn")async def end_turn(data: rtc.RpcInvocationData):session.input.set_audio_enabled(False) # Stop listeningsession.commit_user_turn() # Process the input and generate response# When user cancels their turn@ctx.room.local_participant.register_rpc_method("cancel_turn")async def cancel_turn(data: rtc.RpcInvocationData):session.input.set_audio_enabled(False) # Stop listeningsession.clear_user_turn() # Discard the input
import { voice } from '@livekit/agents';const session = new voice.AgentSession({turnHandling: {turnDetection: 'manual',},// ... stt, tts, llm, etc.});// Disable audio input at the startsession.input.setAudioEnabled(false);// When user starts speakingctx.room.localParticipant.registerRpcMethod('start_turn', async (data) => {session.interrupt(); // Stop any current agent speechsession.clearUserTurn(); // Clear any previous inputsession.input.setAudioEnabled(true); // Start listeningreturn 'ok';});// When user finishes speakingctx.room.localParticipant.registerRpcMethod('end_turn', async (data) => {session.input.setAudioEnabled(false); // Stop listeningsession.commitUserTurn(); // Process the input and generate responsereturn 'ok';});// When user cancels their turnctx.room.localParticipant.registerRpcMethod('cancel_turn', async (data) => {session.input.setAudioEnabled(false); // Stop listeningsession.clearUserTurn(); // Discard the inputreturn 'ok';});
A more complete example is available here:
Push-to-Talk Agent
Reducing background noise
Enhanced noise cancellation is available in LiveKit Cloud and improves the quality of turn detection and speech-to-text (STT) for voice AI apps. You can add background noise and voice cancellation to your agent by adding it to the room options when you start your agent session. To learn how to enable it, see the Voice AI quickstart.
Interruptions
The framework pauses the agent's speech whenever it detects user speech in the input audio, ensuring the agent feels responsive. The user can interrupt the agent at any time, either by speaking (with automatic turn detection) or via the session.interrupt() method. When interrupted, the agent stops speaking and automatically truncates its conversation history to include only the portion of the speech that the user heard before interruption.
You can disable user interruptions when scheduling speech using the say() or generate_reply() methods by setting turn_handling.interruption.enabled to false. To learn more, see Interruption mode.
To explicitly interrupt the agent, call the interrupt() method on the handle or session at any time. This can be performed even when interruption is disabled in the turn handling options.
handle = session.say("Hello world")handle.interrupt()# or from the sessionsession.interrupt()
const handle = session.say('Hello world');handle.interrupt();// or from the sessionsession.interrupt();
See the section on tool interruptions for more information on handling interruptions during long-running tool calls.
Interruption mode
The interruption options control whether the agent can be interrupted and how interruptions are detected. Key settings:
enabled: WhenTrue, the agent can be interrupted by user speech; whenFalse, the agent cannot be interrupted.mode: Determines how the framework detects interruptions. Only applies whenenabledisTrue. The following modes are available:"adaptive": Adaptive interruption handling. This is the default mode for agents deployed to LiveKit Cloud, when used with most STT providers. To learn more see Adaptive interruption handling."vad": Use VAD for interruption detection. Interruption detection is based on speech start and stop cues.
To learn more, see the InterruptionOptions reference.
Adaptive interruption handling
Adaptive interruption handling enables your agent to intelligently detect when users interrupt mid-response. Rather than using fixed thresholds, adaptive interruption handling analyzes the audio signals to determine whether an interruption is intentional.
Adaptive interruption handling
Use adaptive interruption handling to distinguish between true interruptions and conversational backchanneling.
False interruptions
In some cases, the framework detects human speech audio and interrupts the agent, but the transcription comes up empty as no actual words are spoken. In these cases, the VAD-based interruption is considered a false positive. By default, the agent resumes speaking from where it left off after a false interruption. You can configure this behavior using the resume_false_interruption and false_interruption_timeout parameters.
false_interruption_timeout: If an interruption is detected, but the user is silent, this is the duration of silence to wait after an interruption before emitting anagent_false_interruptionevent. Python uses seconds (for example,2.0); Node.js uses milliseconds (for example,2000).resume_false_interruption: Whether to resume the agent's speech after a false interruption is detected. IfTrue, the agent continues speaking from where it left off after thefalse_interruption_timeoutperiod has passed with no user transcription.
Set these parameters in the interruption key of the turn handling options. For example, the following configuration resumes the agent's speech after a false interruption is detected after 2 seconds of silence. Pass it to the turn_handling parameter of AgentSession:
turn_handling = {"interruption": {"false_interruption_timeout": 2.0,"resume_false_interruption": True,# ... other interruption parameters},}
const session = new voice.AgentSession({turnHandling: {interruption: {falseInterruptionTimeout: 2000,resumeFalseInterruption: true,// ... other interruption parameters},},// ... other parameters});
For more information on these parameters, see the InterruptionOptions reference.
Additional configuration options
For a complete list of interruption options, see the InterruptionOptions reference.
The following additional parameters are available in the interruption options object InterruptionOptions:
discard_audio_if_uninterruptible: WhenTrue, drop buffered audio if the agent is speaking and cannot be interrupted.min_duration: Minimum duration of speech (in seconds) to register as an interruption.min_words: Minimum number of words to be considered as an interruption. Only used if STT is enabled. Set to a value greater than0to require actual speech content before triggering interruptions.
To learn more about these parameters, see the InterruptionOptions reference.
Session events
The AgentSession emits events for turn handling. For a list of all available events, see the Events reference.
Interruption events
The AgentSession exposes interruption events to monitor the flow of a conversation:
@session.on("user_interruption_detected")def on_interruption(ev):print(f"User interrupted at: {ev.timestamp}")print(f"Interruption probability: {ev.probability}")@session.on("agent_false_interruption")def on_false_interruption(ev):print("False interruption detected, resuming speech")
session.on('user_interruption_detected', (ev) => {console.log(`User interrupted at: ${ev.timestamp}`);console.log(`Interruption probability: ${ev.probability}`);});session.on('agent_false_interruption', () => {console.log('False interruption detected, resuming speech');});
Turn-taking events
The AgentSession exposes user and agent state events to monitor the flow of a conversation:
from livekit.agents import UserStateChangedEvent, AgentStateChangedEvent@session.on("user_state_changed")def on_user_state_changed(ev: UserStateChangedEvent):if ev.new_state == "speaking":print("User started speaking")elif ev.new_state == "listening":print("User stopped speaking")elif ev.new_state == "away":print("User is not present (e.g. disconnected)")@session.on("agent_state_changed")def on_agent_state_changed(ev: AgentStateChangedEvent):if ev.new_state == "initializing":print("Agent is starting up")elif ev.new_state == "idle":print("Agent is ready but not processing")elif ev.new_state == "listening":print("Agent is listening for user input")elif ev.new_state == "thinking":print("Agent is processing user input and generating a response")elif ev.new_state == "speaking":print("Agent started speaking")
import { voice } from '@livekit/agents';session.on(voice.AgentSessionEventTypes.UserStateChanged, (ev) => {if (ev.newState === 'speaking') {console.log('User started speaking');} else if (ev.newState === 'listening') {console.log('User stopped speaking');} else if (ev.newState === 'away') {console.log('User is not present (e.g. disconnected)');}});session.on(voice.AgentSessionEventTypes.AgentStateChanged, (ev) => {if (ev.newState === 'initializing') {console.log('Agent is starting up');} else if (ev.newState === 'idle') {console.log('Agent is ready but not processing');} else if (ev.newState === 'listening') {console.log('Agent is listening for user input');} else if (ev.newState === 'thinking') {console.log('Agent is processing user input and generating a response');} else if (ev.newState === 'speaking') {console.log('Agent started speaking');}});