Turns overview | LiveKit Documentation

Overview

Turn detection is the process of determining when a user begins or ends their "turn" in a conversation. This lets the agent know when to start listening and when to respond.

Most turn detection techniques rely on voice activity detection (VAD) to detect periods of silence in user input. The agent applies heuristics to the VAD data to perform phrase endpointing, which determines the end of a sentence or thought. The agent can use endpoints alone or apply more contextual analysis to determine when a turn is complete.

Effective turn detection and interruption management is essential to great voice AI experiences.

This page covers user-side detection and interruption handling. Turn-taking is also affected by features that live in other parts of the agent pipeline (preemptive generation, background voice cancellation, and agent-side speech scheduling) that don't fit cleanly into either category. For a recommended starting config that combines all of these, plus a troubleshooting matrix, see Turn-taking tuning.

Turn detection

Turn detection determines when the user has finished speaking (so the agent can respond) and when the user starts speaking mid-response (so the agent can yield). LiveKit supports multiple detection strategies and optional features that work together to make turn-taking feel natural:

Detection modes: Choose how the session determines when a user turn is complete. Options range from silence-based rules (VAD or STT endpointing) to context-aware models that use transcript and pause patterns. The AgentSession supports the turn detection modes listed in the following table:

Mode	Description
Turn detector model	A custom, open-weights model for context-aware turn detection on top of VAD or STT endpoint data.
Realtime models	Use server-side detection from a realtime LLM (for example, the OpenAI Realtime API or Gemini Live API).
VAD only	Detect end of turn from speech and silence data alone using VAD start and stop cues.
STT endpointing	Use phrase endpoints returned in realtime STT data from your chosen provider.
Manual turn control	Disable automatic turn detection entirely and control turn boundaries explicitly.

Supporting features: Regardless of detection mode, you can tune behavior with additional turn handling options. The following features are available in addition to turn detection modes to make turn-taking feel natural:

Feature	Description
Endpointing delay	Controls how long the agent waits after speech (or after an STT end-of-utterance signal) before treating the turn as complete. Use fixed `min_delay` and `max_delay`, or dynamic endpointing (Python only) to adapt the delay based on session pause statistics.
Adaptive interruption handling	Controls how the agent detects and reacts when the user speaks while the agent is talking. Adaptive interruption handling can distinguish true interruptions from conversational backchanneling.
VAD	Use VAD in addition to turn detection modes to improve end-of-turn timing and interruption responsiveness.
Noise cancellation	Enhanced noise cancellation improves the quality of turn detection and speech-to-text (STT) for voice AI apps by reducing background noise.

Turn detector model

To achieve the recommended behavior of an agent that listens while the user speaks and replies after they finish their thought, use the following plugins in an STT-LLM-TTS pipeline:

Turn detection model

Open-weights model for contextually-aware turn detection.

Silero VAD

Silero VAD model for voice activity detection.

The following example uses the LiveKit turn detector model for turn detection:

from livekit.agents import AgentSession, TurnHandlingOptions
from livekit.plugins.turn_detector.multilingual import MultilingualModel
from livekit.plugins import silero

session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection=MultilingualModel(),  
    ),
    vad=silero.VAD.load(),
    # ... stt, tts, llm, etc.
)

import { voice } from '@livekit/agents';
import * as livekit from '@livekit/agents-plugin-livekit';
import * as silero from '@livekit/agents-plugin-silero';

const session = new voice.AgentSession({
  vad: await silero.VAD.load(),
  turnHandling: {
    turnDetection: new livekit.turnDetector.MultilingualModel(), 
  },
  // ... stt, tts, llm, etc.
});

See the Voice AI quickstart for a complete example.

Realtime model turn detection

For a realtime model, LiveKit recommends using the built-in turn detection capabilities of the chosen model provider. This is the most cost-effective option, since the custom turn detection model requires realtime speech-to-text (STT) that would need to run separately.

Realtime models

Realtime models include built-in turn detection options based on VAD and other techniques. Set the turn_detection parameter to "realtime_llm" and configure the realtime model's turn detection options directly.

To use the LiveKit turn detector model with a realtime model, you must also provide an STT plugin. The turn detector model operates on STT output.

OpenAI Realtime API turn detection

Turn detection options for the OpenAI Realtime API.

Gemini Live API turn detection

Turn detection options for the Gemini Live API.

Interruption in realtime mode

When you use a realtime model with server-side turn detection, the model decides when the user is interrupting. The agent forwards user audio to the model unchanged and reacts to the model's interruption signal directly. As a result, InterruptionOptions mostly does not apply: enabled must remain True, discard_audio_if_uninterruptible still gates buffered audio, and every other field is ignored. Tune interruption on the model itself instead. For example, the OpenAI Realtime API exposes threshold, prefix_padding_ms, and silence_duration_ms on its server VAD TurnDetection object (and eagerness and interrupt_response for semantic VAD).

Disabling interruptions is a hard error

With a realtime model that has server-side turn detection enabled, the SDK rejects turn_handling.interruption.enabled=False at session start with a ValueError. To disable user interruptions for a realtime model, set the model's own turn_detection=None and use VAD on the AgentSession instead.

discard_audio_if_uninterruptible controls whether buffered user audio is forwarded to the realtime session while the agent is in a non-interruptible utterance.

The following telephony-friendly configuration uses server VAD with a higher threshold for noisy phone audio and a tighter silence window for quicker turn closing.

from livekit.agents import AgentSession
from livekit.plugins.openai import realtime
from openai.types.beta.realtime.session import TurnDetection

session = AgentSession(
    llm=realtime.RealtimeModel(
        turn_detection=TurnDetection(
            type="server_vad",
            threshold=0.7,            # less sensitive (better for noisy phone audio)
            prefix_padding_ms=300,
            silence_duration_ms=400,  # tighter silence window for snappier turn closing
        ),
    ),
    # ... tts, etc.
)

import { voice } from '@livekit/agents';
import * as openai from '@livekit/agents-plugin-openai';

const session = new voice.AgentSession({
  llm: new openai.realtime.RealtimeModel({
    turnDetection: {
      type: 'server_vad',
      threshold: 0.7,
      prefix_padding_ms: 300,
      silence_duration_ms: 400,
    },
  }),
  // ... tts, etc.
});

For the full set of provider-side options, see OpenAI Realtime API turn detection or Gemini Live API turn detection.

VAD only

In some cases, VAD is the best option for turn detection. For example, VAD works with any spoken language. To use VAD alone, use the Silero VAD plugin and set turn_detection="vad".

session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="vad",
    ),
    vad=silero.VAD.load(),
    # ... stt, tts, llm, etc.
)

import { voice } from '@livekit/agents';
import * as silero from '@livekit/agents-plugin-silero';

const session = new voice.AgentSession({
  vad: await silero.VAD.load(),
  turnHandling: {
    turnDetection: 'vad',
  },
  // ... stt, tts, llm, etc.
});

STT endpointing

You can also use your STT model's built-in phrase endpointing features for turn detection. Some providers, including AssemblyAI, include sophisticated semantic turn detection models.

You should still provide a VAD plugin for responsive interruption handling. When you use STT endpointing only, your agent is less responsive to user interruptions.

To use STT endpointing, set turn_detection="stt" and provide an STT plugin.

session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="stt",
    ),
    stt=assemblyai.STT(),  # AssemblyAI is the recommended STT plugin for STT-based endpointing
    vad=silero.VAD.load(),  # Recommended for responsive interruption handling
    # ... tts, llm, etc.
)

import { voice } from '@livekit/agents';
import * as assemblyai from '@livekit/agents-plugin-assemblyai';
import * as silero from '@livekit/agents-plugin-silero';

const session = new voice.AgentSession({
  stt: new assemblyai.STT(), // AssemblyAI is the recommended STT plugin for STT-based endpointing
  vad: await silero.VAD.load(), // Recommended for responsive interruption handling
  turnHandling: {
    turnDetection: 'stt',
  },
  // ... tts, llm, etc.
});

Additional endpointing configuration options

You can configure additional endpointing behavior using the endpointing key in the turn handling options. By default, the agent uses fixed endpointing and always uses the configured min_delay and max_delay. With dynamic endpointing, the agent adapts the delay within that range based on session pause statistics, so turn-taking can feel more responsive over time.

To learn more, see the EndpointingOptions reference.

Manual turn control

Disable automatic turn detection entirely by setting turn_detection="manual" in the turn handling options for the AgentSession.

You can control the user's turn with session.interrupt(), session.clear_user_turn(), and session.commit_user_turn() methods.

Tip

This is different from toggling audio input/output for text-only sessions.

For instance, you can use this to implement a push-to-talk interface. Here is a simple example using RPC methods that the frontend can call:

session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="manual",
    ),
    # ... stt, tts, llm, etc.
)

# Disable audio input at the start
session.input.set_audio_enabled(False)

# When user starts speaking
@ctx.room.local_participant.register_rpc_method("start_turn")
async def start_turn(data: rtc.RpcInvocationData):
    session.interrupt()  # Stop any current agent speech
    session.clear_user_turn()  # Clear any previous input
    session.input.set_audio_enabled(True)  # Start listening

# When user finishes speaking
@ctx.room.local_participant.register_rpc_method("end_turn")
async def end_turn(data: rtc.RpcInvocationData):
    session.input.set_audio_enabled(False)  # Stop listening
    session.commit_user_turn()  # Process the input and generate response

# When user cancels their turn
@ctx.room.local_participant.register_rpc_method("cancel_turn")
async def cancel_turn(data: rtc.RpcInvocationData):
    session.input.set_audio_enabled(False)  # Stop listening
    session.clear_user_turn()  # Discard the input

import { voice } from '@livekit/agents';

const session = new voice.AgentSession({
  turnHandling: {
    turnDetection: 'manual',
  },
  // ... stt, tts, llm, etc.
});

// Disable audio input at the start
session.input.setAudioEnabled(false);

// When user starts speaking
ctx.room.localParticipant.registerRpcMethod('start_turn', async (data) => {
  session.interrupt(); // Stop any current agent speech
  session.clearUserTurn(); // Clear any previous input
  session.input.setAudioEnabled(true); // Start listening
  return 'ok';
});

// When user finishes speaking
ctx.room.localParticipant.registerRpcMethod('end_turn', async (data) => {
  session.input.setAudioEnabled(false); // Stop listening
  session.commitUserTurn(); // Process the input and generate response
  return 'ok';
});

// When user cancels their turn
ctx.room.localParticipant.registerRpcMethod('cancel_turn', async (data) => {
  session.input.setAudioEnabled(false); // Stop listening
  session.clearUserTurn(); // Discard the input
  return 'ok';
});

A more complete example is available here:

Push-to-Talk Agent

A voice AI agent that uses push-to-talk for controlled multi-participant conversations, only enabling audio input when explicitly triggered.

Reducing background noise

Enhanced noise cancellation is available in LiveKit Cloud and improves the quality of turn detection and speech-to-text (STT) for voice AI apps. You can add background noise and voice cancellation to your agent by adding it to the room options when you start your agent session. To learn how to enable it, see the Voice AI quickstart.

Interruptions

The framework pauses the agent's speech whenever it detects user speech in the input audio, ensuring the agent feels responsive. The user can interrupt the agent at any time, either by speaking (with automatic turn detection) or via the session.interrupt() method. When interrupted, the agent stops speaking and automatically truncates its conversation history to include only the portion of the speech that the user heard before interruption.

Disabling interruptions

You can disable user interruptions when scheduling speech using the say() or generate_reply() methods by setting turn_handling.interruption.enabled to false. To learn more, see Interruption mode.

To explicitly interrupt the agent, call the interrupt() method on the handle or session at any time. This can be performed even when interruption is disabled in the turn handling options.

handle = session.say("Hello world")
handle.interrupt()

# or from the session
session.interrupt()

const handle = session.say('Hello world');
handle.interrupt();

// or from the session
session.interrupt();

Long-running tool calls

See the section on tool interruptions for more information on handling interruptions during long-running tool calls.

Interruption mode

The interruption options control whether the agent can be interrupted and how interruptions are detected. Key settings:

enabled: When True, the agent can be interrupted by user speech; when False, the agent cannot be interrupted.
mode: Determines how the framework detects interruptions. Only applies when enabled is True. The following modes are available:
- "adaptive": Adaptive interruption handling. This is the default mode for agents deployed to LiveKit Cloud, when used with most STT providers. To learn more see Adaptive interruption handling.
- "vad": Use VAD for interruption detection. Interruption detection is based on speech start and stop cues.

For realtime models with server-side turn detection, see Interruption in realtime mode for which of these fields are ignored.

To learn more, see the InterruptionOptions reference.

Adaptive interruption handling

Adaptive interruption handling enables your agent to intelligently detect when users interrupt mid-response. Rather than using fixed thresholds, adaptive interruption handling analyzes the audio signals to determine whether an interruption is intentional.

Adaptive interruption handling

Use adaptive interruption handling to distinguish between true interruptions and conversational backchanneling.

False interruptions

In some cases, the framework detects human speech audio and interrupts the agent, but the transcription comes up empty as no actual words are spoken. In these cases, the VAD-based interruption is considered a false positive. By default, the agent resumes speaking from where it left off after a false interruption. You can configure this behavior using the resume_false_interruption and false_interruption_timeout parameters.

false_interruption_timeout: If an interruption is detected, but the user is silent, this is the duration of silence to wait after an interruption before emitting an agent_false_interruption event. Python uses seconds (for example, 2.0); Node.js uses milliseconds (for example, 2000).
resume_false_interruption: Whether to resume the agent's speech after a false interruption is detected. If True, the agent continues speaking from where it left off after the false_interruption_timeout period has passed with no user transcription.

Set these parameters in the interruption key of the turn handling options. For example, the following configuration resumes the agent's speech after a false interruption is detected after 2 seconds of silence. Pass it to the turn_handling parameter of AgentSession:

turn_handling = {
    "interruption": {
        "false_interruption_timeout": 2.0,
        "resume_false_interruption": True,
        # ... other interruption parameters
    },
}

const session = new voice.AgentSession({
    turnHandling: {
        interruption: {
            falseInterruptionTimeout: 2000,
            resumeFalseInterruption: true,
            // ... other interruption parameters
        },
    },
    // ... other parameters
});

For more information on these parameters, see the InterruptionOptions reference.

Additional configuration options

For a complete list of interruption options, see the InterruptionOptions reference.

The following additional parameters are available in the interruption options object InterruptionOptions:

discard_audio_if_uninterruptible: When True, drop buffered audio if the agent is speaking and cannot be interrupted.
min_duration: Minimum duration of speech (in seconds) to register as an interruption.
min_words: Minimum number of words to be considered as an interruption. Only used if STT is enabled. Set to a value greater than 0 to require actual speech content before triggering interruptions.

To learn more about these parameters, see the InterruptionOptions reference.

User turn limit

User turn limits cap how long a user can speak before the agent interrupts. This is useful for voicebot scenarios where a caller might monopolize the turn: long-form callers, voicemail greetings, or users reading off a list. Unlike interruptions, which are user-initiated, user turn limits are agent-initiated.

Configure user turn limits in the user_turn_limit key of the turn handling options. Set max_words, max_duration, or both. Both default to disabled, so the feature is off until you opt in. Pass the options to the turn_handling parameter of AgentSession:

session = AgentSession(
    turn_handling={
        "user_turn_limit": {
            "max_words": 100,
            "max_duration": 30.0,
        },
    },
    # ... other parameters
)

const session = new voice.AgentSession({
    turnHandling: {
        userTurnLimit: {
            maxWords: 100,
            maxDuration: 30_000,
        },
    },
    // ... other parameters
});

Python uses seconds for max_duration. Node.js uses milliseconds for maxDuration.

Word count and duration accumulate across consecutive user turns and reset only when the agent transitions to the speaking state. A user who pauses briefly mid-monologue still trips the threshold.

When a threshold is crossed, the framework calls the agent's on_user_turn_exceeded hook with a UserTurnExceededEvent. The default implementation calls generate_reply with allow_interruptions=False and tool_choice="none" to politely cut in. Override the hook to customize the behavior:

from livekit.agents import Agent, UserTurnExceededEvent

class MyAgent(Agent):
    async def on_user_turn_exceeded(self, ev: UserTurnExceededEvent) -> None:
        await self.session.say("Sorry to jump in. Can I help with anything specific?")

import { voice } from '@livekit/agents';

class MyAgent extends voice.Agent {
    async onUserTurnExceeded(ev: voice.UserTurnExceededEvent): Promise<void> {
        await this.session.say('Sorry to jump in. Can I help with anything specific?');
    }
}

Default reply cannot be interrupted

The default on_user_turn_exceeded implementation calls generate_reply with allow_interruptions=False, so the user cannot cut in while the agent is delivering the cut-in reply. Override the hook if you need different interruption semantics.

The framework skips the on_user_turn_exceeded callback if the agent enters the speaking state before the threshold fires. This happens when the user pauses long enough for end-of-utterance detection to end their turn naturally and the agent's normal reply starts playing.

To learn more, see the UserTurnLimitOptions reference and the on_user_turn_exceeded hook docs.

Session events

The AgentSession emits events for turn handling. For a list of all available events, see the Events reference.

Interruption events

The AgentSession exposes interruption events to monitor the flow of a conversation:

@session.on("user_interruption_detected")
def on_interruption(ev):
    print(f"User interrupted at: {ev.timestamp}")
    print(f"Interruption probability: {ev.probability}")

@session.on("agent_false_interruption")
def on_false_interruption(ev):
    print("False interruption detected, resuming speech")

session.on('user_interruption_detected', (ev) => {
  console.log(`User interrupted at: ${ev.timestamp}`);
  console.log(`Interruption probability: ${ev.probability}`);
});

session.on('agent_false_interruption', () => {
  console.log('False interruption detected, resuming speech');
});

Turn-taking events

The AgentSession exposes user and agent state events to monitor the flow of a conversation:

from livekit.agents import UserStateChangedEvent, AgentStateChangedEvent

@session.on("user_state_changed")
def on_user_state_changed(ev: UserStateChangedEvent):
    if ev.new_state == "speaking":
        print("User started speaking")
    elif ev.new_state == "listening":
        print("User stopped speaking")
    elif ev.new_state == "away":
        print("User is not present (e.g. disconnected)")

@session.on("agent_state_changed")
def on_agent_state_changed(ev: AgentStateChangedEvent):
    if ev.new_state == "initializing":
        print("Agent is starting up")
    elif ev.new_state == "idle":
        print("Agent is ready but not processing")
    elif ev.new_state == "listening":
        print("Agent is listening for user input")
    elif ev.new_state == "thinking":
        print("Agent is processing user input and generating a response")
    elif ev.new_state == "speaking":
        print("Agent started speaking")

import { voice } from '@livekit/agents';

session.on(voice.AgentSessionEventTypes.UserStateChanged, (ev) => {
  if (ev.newState === 'speaking') {
    console.log('User started speaking');
  } else if (ev.newState === 'listening') {
    console.log('User stopped speaking');
  } else if (ev.newState === 'away') {
    console.log('User is not present (e.g. disconnected)');
  }
});

session.on(voice.AgentSessionEventTypes.AgentStateChanged, (ev) => {
  if (ev.newState === 'initializing') {
    console.log('Agent is starting up');
  } else if (ev.newState === 'idle') {
    console.log('Agent is ready but not processing');
  } else if (ev.newState === 'listening') {
    console.log('Agent is listening for user input');
  } else if (ev.newState === 'thinking') {
    console.log('Agent is processing user input and generating a response');
  } else if (ev.newState === 'speaking') {
    console.log('Agent started speaking');
  }
});

Additional resources

Agent speech

Guide to agent speech and related methods.

Pipeline nodes

Monitor input and output as it flows through the voice pipeline.

Turn-taking tuning

Recommended configs and a troubleshooting guide for turn-taking knobs.

Turn handling options

Reference documentation for turn detection, endpointing, and interruption handling options for your agent session.