Skip to main content

Turns overview

Guide to managing conversation turns in voice AI.

Overview

Turn detection is the process of determining when a user begins or ends their "turn" in a conversation. This lets the agent know when to start listening and when to respond.

Most turn detection techniques rely on voice activity detection (VAD) to detect periods of silence in user input. The agent applies heuristics to the VAD data to perform phrase endpointing, which determines the end of a sentence or thought. The agent can use endpoints alone or apply more contextual analysis to determine when a turn is complete.

Effective turn detection and interruption management is essential to great voice AI experiences.

Turn detection

Turn detection determines when the user has finished speaking (so the agent can respond) and when the user starts speaking mid-response (so the agent can yield). LiveKit supports multiple detection strategies and optional features that work together to make turn-taking feel natural:

  • Detection modes: Choose how the session determines when a user turn is complete. Options range from silence-based rules (VAD or STT endpointing) to context-aware models that use transcript and pause patterns. The AgentSession supports the turn detection modes listed in the following table:

    ModeDescription
    Turn detector modelA custom, open-weights model for context-aware turn detection on top of VAD or STT endpoint data.
    Realtime modelsUse server-side detection from a realtime LLM (for example, the OpenAI Realtime API or Gemini Live API).
    VAD onlyDetect end of turn from speech and silence data alone using VAD start and stop cues.
    STT endpointingUse phrase endpoints returned in realtime STT data from your chosen provider.
    Manual turn controlDisable automatic turn detection entirely and control turn boundaries explicitly.
  • Supporting features: Regardless of detection mode, you can tune behavior with additional turn handling options. The following features are available in addition to turn detection modes to make turn-taking feel natural:

    FeatureDescription
    Endpointing delayControls how long the agent waits after speech (or after an STT end-of-utterance signal) before treating the turn as complete. Use fixed min_delay and max_delay, or dynamic endpointing (Python only) to adapt the delay based on session pause statistics.
    Adaptive interruption handlingControls how the agent detects and reacts when the user speaks while the agent is talking. Adaptive interruption handling can distinguish true interruptions from conversational backchanneling.
    VADUse VAD in addition to turn detection modes to improve end-of-turn timing and interruption responsiveness.
    Noise cancellationEnhanced noise cancellation improves the quality of turn detection and speech-to-text (STT) for voice AI apps by reducing background noise.

Turn detector model

To achieve the recommended behavior of an agent that listens while the user speaks and replies after they finish their thought, use the following plugins in an STT-LLM-TTS pipeline:

The following example uses the LiveKit turn detector model for turn detection:

from livekit.agents import AgentSession, TurnHandlingOptions
from livekit.plugins.turn_detector.multilingual import MultilingualModel
from livekit.plugins import silero
session = AgentSession(
turn_handling=TurnHandlingOptions(
turn_detection=MultilingualModel(),
),
vad=silero.VAD.load(),
# ... stt, tts, llm, etc.
)
import { voice } from '@livekit/agents';
import * as livekit from '@livekit/agents-plugin-livekit';
import * as silero from '@livekit/agents-plugin-silero';
const session = new voice.AgentSession({
vad: await silero.VAD.load(),
turnHandling: {
turnDetection: new livekit.turnDetector.MultilingualModel(),
},
// ... stt, tts, llm, etc.
});

See the Voice AI quickstart for a complete example.

Realtime model turn detection

For a realtime model, LiveKit recommends using the built-in turn detection capabilities of the chosen model provider. This is the most cost-effective option, since the custom turn detection model requires realtime speech-to-text (STT) that would need to run separately.

Realtime models

Realtime models include built-in turn detection options based on VAD and other techniques. Set the turn_detection parameter to "realtime_llm" and configure the realtime model's turn detection options directly.

To use the LiveKit turn detector model with a realtime model, you must also provide an STT plugin. The turn detector model operates on STT output.

VAD only

In some cases, VAD is the best option for turn detection. For example, VAD works with any spoken language. To use VAD alone, use the Silero VAD plugin and set turn_detection="vad".

session = AgentSession(
turn_handling=TurnHandlingOptions(
turn_detection="vad",
),
vad=silero.VAD.load(),
# ... stt, tts, llm, etc.
)
import { voice } from '@livekit/agents';
import * as silero from '@livekit/agents-plugin-silero';
const session = new voice.AgentSession({
vad: await silero.VAD.load(),
turnHandling: {
turnDetection: 'vad',
},
// ... stt, tts, llm, etc.
});

STT endpointing

You can also use your STT model's built-in phrase endpointing features for turn detection. Some providers, including AssemblyAI, include sophisticated semantic turn detection models.

You should still provide a VAD plugin for responsive interruption handling. When you use STT endpointing only, your agent is less responsive to user interruptions.

To use STT endpointing, set turn_detection="stt" and provide an STT plugin.

session = AgentSession(
turn_handling=TurnHandlingOptions(
turn_detection="stt",
),
stt=assemblyai.STT(), # AssemblyAI is the recommended STT plugin for STT-based endpointing
vad=silero.VAD.load(), # Recommended for responsive interruption handling
# ... tts, llm, etc.
)
import { voice } from '@livekit/agents';
import * as assemblyai from '@livekit/agents-plugin-assemblyai';
import * as silero from '@livekit/agents-plugin-silero';
const session = new voice.AgentSession({
stt: new assemblyai.STT(), // AssemblyAI is the recommended STT plugin for STT-based endpointing
vad: await silero.VAD.load(), // Recommended for responsive interruption handling
turnHandling: {
turnDetection: 'stt',
},
// ... tts, llm, etc.
});

Additional endpointing configuration options

You can configure additional endpointing behavior using the endpointing key in the turn handling options. By default, the agent uses fixed endpointing and always uses the configured min_delay and max_delay. With dynamic endpointing, the agent adapts the delay within that range based on session pause statistics, so turn-taking can feel more responsive over time.

To learn more, see the EndpointingOptions reference.

Manual turn control

Disable automatic turn detection entirely by setting turn_detection="manual" in the turn handling options for the AgentSession.

You can control the user's turn with session.interrupt(), session.clear_user_turn(), and session.commit_user_turn() methods.

Tip

This is different from toggling audio input/output for text-only sessions.

For instance, you can use this to implement a push-to-talk interface. Here is a simple example using RPC methods that the frontend can call:

session = AgentSession(
turn_handling=TurnHandlingOptions(
turn_detection="manual",
),
# ... stt, tts, llm, etc.
)
# Disable audio input at the start
session.input.set_audio_enabled(False)
# When user starts speaking
@ctx.room.local_participant.register_rpc_method("start_turn")
async def start_turn(data: rtc.RpcInvocationData):
session.interrupt() # Stop any current agent speech
session.clear_user_turn() # Clear any previous input
session.input.set_audio_enabled(True) # Start listening
# When user finishes speaking
@ctx.room.local_participant.register_rpc_method("end_turn")
async def end_turn(data: rtc.RpcInvocationData):
session.input.set_audio_enabled(False) # Stop listening
session.commit_user_turn() # Process the input and generate response
# When user cancels their turn
@ctx.room.local_participant.register_rpc_method("cancel_turn")
async def cancel_turn(data: rtc.RpcInvocationData):
session.input.set_audio_enabled(False) # Stop listening
session.clear_user_turn() # Discard the input
import { voice } from '@livekit/agents';
const session = new voice.AgentSession({
turnHandling: {
turnDetection: 'manual',
},
// ... stt, tts, llm, etc.
});
// Disable audio input at the start
session.input.setAudioEnabled(false);
// When user starts speaking
ctx.room.localParticipant.registerRpcMethod('start_turn', async (data) => {
session.interrupt(); // Stop any current agent speech
session.clearUserTurn(); // Clear any previous input
session.input.setAudioEnabled(true); // Start listening
return 'ok';
});
// When user finishes speaking
ctx.room.localParticipant.registerRpcMethod('end_turn', async (data) => {
session.input.setAudioEnabled(false); // Stop listening
session.commitUserTurn(); // Process the input and generate response
return 'ok';
});
// When user cancels their turn
ctx.room.localParticipant.registerRpcMethod('cancel_turn', async (data) => {
session.input.setAudioEnabled(false); // Stop listening
session.clearUserTurn(); // Discard the input
return 'ok';
});

A more complete example is available here:

Push-to-Talk Agent

A voice AI agent that uses push-to-talk for controlled multi-participant conversations, only enabling audio input when explicitly triggered.

Reducing background noise

Enhanced noise cancellation is available in LiveKit Cloud and improves the quality of turn detection and speech-to-text (STT) for voice AI apps. You can add background noise and voice cancellation to your agent by adding it to the room options when you start your agent session. To learn how to enable it, see the Voice AI quickstart.

Interruptions

The framework pauses the agent's speech whenever it detects user speech in the input audio, ensuring the agent feels responsive. The user can interrupt the agent at any time, either by speaking (with automatic turn detection) or via the session.interrupt() method. When interrupted, the agent stops speaking and automatically truncates its conversation history to include only the portion of the speech that the user heard before interruption.

Disabling interruptions

You can disable user interruptions when scheduling speech using the say() or generate_reply() methods by setting turn_handling.interruption.enabled to false. To learn more, see Interruption mode.

To explicitly interrupt the agent, call the interrupt() method on the handle or session at any time. This can be performed even when interruption is disabled in the turn handling options.

handle = session.say("Hello world")
handle.interrupt()
# or from the session
session.interrupt()
const handle = session.say('Hello world');
handle.interrupt();
// or from the session
session.interrupt();
Long-running tool calls

See the section on tool interruptions for more information on handling interruptions during long-running tool calls.

Interruption mode

The interruption options control whether the agent can be interrupted and how interruptions are detected. Key settings:

  • enabled: When True, the agent can be interrupted by user speech; when False, the agent cannot be interrupted.
  • mode: Determines how the framework detects interruptions. Only applies when enabled is True. The following modes are available:
    • "adaptive": Adaptive interruption handling. This is the default mode for agents deployed to LiveKit Cloud, when used with most STT providers. To learn more see Adaptive interruption handling.
    • "vad": Use VAD for interruption detection. Interruption detection is based on speech start and stop cues.

To learn more, see the InterruptionOptions reference.

Adaptive interruption handling

Adaptive interruption handling enables your agent to intelligently detect when users interrupt mid-response. Rather than using fixed thresholds, adaptive interruption handling analyzes the audio signals to determine whether an interruption is intentional.

Adaptive interruption handling

Use adaptive interruption handling to distinguish between true interruptions and conversational backchanneling.

False interruptions

In some cases, the framework detects human speech audio and interrupts the agent, but the transcription comes up empty as no actual words are spoken. In these cases, the VAD-based interruption is considered a false positive. By default, the agent resumes speaking from where it left off after a false interruption. You can configure this behavior using the resume_false_interruption and false_interruption_timeout parameters.

  • false_interruption_timeout: If an interruption is detected, but the user is silent, this is the duration of silence to wait after an interruption before emitting an agent_false_interruption event. Python uses seconds (for example, 2.0); Node.js uses milliseconds (for example, 2000).
  • resume_false_interruption: Whether to resume the agent's speech after a false interruption is detected. If True, the agent continues speaking from where it left off after the false_interruption_timeout period has passed with no user transcription.

Set these parameters in the interruption key of the turn handling options. For example, the following configuration resumes the agent's speech after a false interruption is detected after 2 seconds of silence. Pass it to the turn_handling parameter of AgentSession:

turn_handling = {
"interruption": {
"false_interruption_timeout": 2.0,
"resume_false_interruption": True,
# ... other interruption parameters
},
}
const session = new voice.AgentSession({
turnHandling: {
interruption: {
falseInterruptionTimeout: 2000,
resumeFalseInterruption: true,
// ... other interruption parameters
},
},
// ... other parameters
});

For more information on these parameters, see the InterruptionOptions reference.

Additional configuration options

For a complete list of interruption options, see the InterruptionOptions reference.

The following additional parameters are available in the interruption options object InterruptionOptions:

  • discard_audio_if_uninterruptible: When True, drop buffered audio if the agent is speaking and cannot be interrupted.
  • min_duration: Minimum duration of speech (in seconds) to register as an interruption.
  • min_words: Minimum number of words to be considered as an interruption. Only used if STT is enabled. Set to a value greater than 0 to require actual speech content before triggering interruptions.

To learn more about these parameters, see the InterruptionOptions reference.

Session events

The AgentSession emits events for turn handling. For a list of all available events, see the Events reference.

Interruption events

The AgentSession exposes interruption events to monitor the flow of a conversation:

@session.on("user_interruption_detected")
def on_interruption(ev):
print(f"User interrupted at: {ev.timestamp}")
print(f"Interruption probability: {ev.probability}")
@session.on("agent_false_interruption")
def on_false_interruption(ev):
print("False interruption detected, resuming speech")
session.on('user_interruption_detected', (ev) => {
console.log(`User interrupted at: ${ev.timestamp}`);
console.log(`Interruption probability: ${ev.probability}`);
});
session.on('agent_false_interruption', () => {
console.log('False interruption detected, resuming speech');
});

Turn-taking events

The AgentSession exposes user and agent state events to monitor the flow of a conversation:

from livekit.agents import UserStateChangedEvent, AgentStateChangedEvent
@session.on("user_state_changed")
def on_user_state_changed(ev: UserStateChangedEvent):
if ev.new_state == "speaking":
print("User started speaking")
elif ev.new_state == "listening":
print("User stopped speaking")
elif ev.new_state == "away":
print("User is not present (e.g. disconnected)")
@session.on("agent_state_changed")
def on_agent_state_changed(ev: AgentStateChangedEvent):
if ev.new_state == "initializing":
print("Agent is starting up")
elif ev.new_state == "idle":
print("Agent is ready but not processing")
elif ev.new_state == "listening":
print("Agent is listening for user input")
elif ev.new_state == "thinking":
print("Agent is processing user input and generating a response")
elif ev.new_state == "speaking":
print("Agent started speaking")
import { voice } from '@livekit/agents';
session.on(voice.AgentSessionEventTypes.UserStateChanged, (ev) => {
if (ev.newState === 'speaking') {
console.log('User started speaking');
} else if (ev.newState === 'listening') {
console.log('User stopped speaking');
} else if (ev.newState === 'away') {
console.log('User is not present (e.g. disconnected)');
}
});
session.on(voice.AgentSessionEventTypes.AgentStateChanged, (ev) => {
if (ev.newState === 'initializing') {
console.log('Agent is starting up');
} else if (ev.newState === 'idle') {
console.log('Agent is ready but not processing');
} else if (ev.newState === 'listening') {
console.log('Agent is listening for user input');
} else if (ev.newState === 'thinking') {
console.log('Agent is processing user input and generating a response');
} else if (ev.newState === 'speaking') {
console.log('Agent started speaking');
}
});

Additional resources