Overview
Turn-taking in voice AI involves several stages of the agent pipeline:
- User activity detection decides when the user has finished a turn so the agent can reply. Options include turn detection mode, endpointing delays, and endpointing mode.
- Interruption handling decides when the user can cut the agent off mid-response. Options include enable/disable, detection mode, interruption thresholds, and false-interruption recovery.
- Preemptive generation lets the LLM (and optionally TTS) start work before the user's turn is fully confirmed. Options include enable/disable, preemptive TTS, max speech duration, and max retries.
- Audio pre-processing (noise cancellation, automatic gain control) cleans the input before any of these stages run. Options include voice isolation and background noise suppression.
- Agent speech scheduling controls the cadence of the agent's own utterances. Options include the minimum gap between agent utterances (Python only).
The defaults are reasonable for most apps, but tuning matters when you're chasing low latency, working in noisy environments, or seeing specific symptoms like the agent cutting users off. This page gives a recommended starting config, a full reference of the options that affect each stage, and a troubleshooting table mapping common symptoms to the options that fix them.
For a deeper reference on each parameter, see TurnHandlingOptions.
Configuration
The next two sections cover a recommended starting config and a full options reference.
Recommended starting config
A starting point for a voice agent that needs to respond quickly in environments with background noise or other speakers. See All options for what each parameter does.
from livekit.agents import AgentSession, TurnHandlingOptions, room_iofrom livekit.plugins import ai_coustics, silerofrom livekit.plugins.turn_detector.multilingual import MultilingualModelsession = AgentSession(turn_handling=TurnHandlingOptions(turn_detection=MultilingualModel(),endpointing={"mode": "fixed","min_delay": 0.5,"max_delay": 3.0,},interruption={"mode": "adaptive","min_duration": 0.5,"min_words": 0,},# preemptive_generation is enabled by default. Opt into preemptive TTS# for lower latency at the cost of wasted compute on cancellations.preemptive_generation={"preemptive_tts": False,},),vad=silero.VAD.load(),# ... stt, tts, llm, etc.)await session.start(# ...,room_options=room_io.RoomOptions(audio_input=room_io.AudioInputOptions(noise_cancellation=ai_coustics.audio_enhancement(model=ai_coustics.EnhancerModel.QUAIL_VF_L,),),),)
import { voice } from '@livekit/agents';import * as livekit from '@livekit/agents-plugin-livekit';import * as silero from '@livekit/agents-plugin-silero';import * as aiCoustics from '@livekit/plugins-ai-coustics';const session = new voice.AgentSession({vad: await silero.VAD.load(),turnHandling: {turnDetection: new livekit.turnDetector.MultilingualModel(),endpointing: {minDelay: 500,maxDelay: 3000,},interruption: {mode: 'adaptive',minDuration: 500,minWords: 0,},// preemptiveGeneration is enabled by default. Opt into preemptive TTS// for lower latency at the cost of wasted compute on cancellations.preemptiveGeneration: {preemptiveTts: false,},},// ... stt, tts, llm, etc.});await session.start({// ...,inputOptions: {noiseCancellation: aiCoustics.audioEnhancement({ model: 'quailVfL' }),},});
For quieter environments, drop the noise cancellation argument from session.start(). The rest of the config still applies.
For SIP participants, swap voice isolation for the telephony-tuned Krisp model: noise_cancellation.BVCTelephony() (Python) or TelephonyBackgroundVoiceCancellation() (Node.js). For multi-speaker rooms, use background noise suppression instead of voice isolation.
All options
The following table lists the options that affect turn-taking, grouped by pipeline stage.
| Option | Stage | What it controls | Default |
|---|---|---|---|
turn_detection mode | User activity detection | How the session decides the user is done speaking. Options: turn detector model, VAD, STT endpointing, realtime LLM, manual. | Auto-selected |
endpointing.min_delay | User activity detection | Minimum time after detected silence before the turn closes. In VAD mode this is max(VAD silence, min_delay). In STT mode it adds to the provider's endpoint signal. | 0.5 seconds |
endpointing.max_delay | User activity detection | Maximum time the agent waits before forcing the turn closed. | 3.0 seconds |
endpointing.mode | User activity detection | "fixed" always uses the configured delays. "dynamic" adapts within the range based on session pause statistics. | "fixed" |
interruption.enabled | Interruption handling | Master on/off toggle for interruptions. Set to False to make the agent uninterruptible. | True |
interruption.mode | Interruption handling | "adaptive" (recommended) uses an audio model to distinguish real interruptions from backchannel acknowledgments. "vad" triggers on any detected speech. | "adaptive" if available, otherwise "vad" |
interruption.min_duration | Interruption handling | Minimum speech duration to register as an interruption. | 0.5 seconds |
interruption.min_words | Interruption handling | Minimum word count to register as an interruption. Requires STT. | 0 |
interruption.false_interruption_timeout | Interruption handling | Silence window after a detected interruption before it's classified as false. After this elapses with no transcript, the agent can resume (see resume_false_interruption). | 2.0 seconds |
interruption.resume_false_interruption | Interruption handling | Whether to resume the interrupted speech after the false-interruption timeout passes. | True |
preemptive_generation.enabled | Preemptive generation | Whether to start LLM generation as soon as a final transcript arrives, before the turn is confirmed. | True |
preemptive_generation.preemptive_tts | Preemptive generation | Also start TTS preemptively. Cuts more latency at the cost of wasted compute on cancellations. | False |
preemptive_generation.max_speech_duration | Preemptive generation | Skip preemptive generation for utterances longer than this. Long turns are more likely to mutate. | 10.0 seconds |
preemptive_generation.max_retries | Preemptive generation | Cap on preemptive attempts per turn. Resets when the turn completes. | 3 |
| Voice isolation | Audio pre-processing | Suppresses competing voices in the input so STT, VAD, and the turn detector see clean audio. Models include ai-coustics QUAIL_VF_L, Krisp BVC, and Krisp BVCTelephony. | Off |
| Background noise suppression | Audio pre-processing | Suppresses non-speech noise. Use when the main challenge is environmental noise rather than competing speakers. | Off |
min_consecutive_speech_delay | Agent speech scheduling | Minimum gap between consecutive agent utterances. Does not affect user-side turn detection. | 0.0 seconds |
| Option | Stage | What it controls | Default |
|---|---|---|---|
turnDetection mode | User activity detection | How the session decides the user is done speaking. Options: turn detector model, VAD, STT endpointing, realtime LLM, manual. | Auto-selected |
endpointing.minDelay | User activity detection | Minimum time after detected silence before the turn closes. In VAD mode this is max(VAD silence, minDelay). In STT mode it adds to the provider's endpoint signal. | 500 ms |
endpointing.maxDelay | User activity detection | Maximum time the agent waits before forcing the turn closed. | 3000 ms |
interruption.enabled | Interruption handling | Master on/off toggle for interruptions. Set to false to make the agent uninterruptible. | true |
interruption.mode | Interruption handling | "adaptive" (recommended) uses an audio model to distinguish real interruptions from backchannel acknowledgments. "vad" triggers on any detected speech. | "adaptive" if available, otherwise "vad" |
interruption.minDuration | Interruption handling | Minimum speech duration to register as an interruption. | 500 ms |
interruption.minWords | Interruption handling | Minimum word count to register as an interruption. Requires STT. | 0 |
interruption.falseInterruptionTimeout | Interruption handling | Silence window after a detected interruption before it's classified as false. After this elapses with no transcript, the agent can resume (see resumeFalseInterruption). | 2000 ms |
interruption.resumeFalseInterruption | Interruption handling | Whether to resume the interrupted speech after the false-interruption timeout passes. | true |
preemptiveGeneration.enabled | Preemptive generation | Whether to start LLM generation as soon as a final transcript arrives, before the turn is confirmed. | true |
preemptiveGeneration.preemptiveTts | Preemptive generation | Also start TTS preemptively. Cuts more latency at the cost of wasted compute on cancellations. | false |
preemptiveGeneration.maxSpeechDuration | Preemptive generation | Skip preemptive generation for utterances longer than this. Long turns are more likely to mutate. | 10000 ms |
preemptiveGeneration.maxRetries | Preemptive generation | Cap on preemptive attempts per turn. Resets when the turn completes. | 3 |
| Voice isolation | Audio pre-processing | Suppresses competing voices in the input so STT, VAD, and the turn detector see clean audio. Models include ai-coustics QUAIL_VF_L, Krisp BVC, and Krisp BVCTelephony. | Off |
| Background noise suppression | Audio pre-processing | Suppresses non-speech noise. Use when the main challenge is environmental noise rather than competing speakers. | Off |
Troubleshooting
The following table maps common turn-taking complaints to the options that affect them.
| Symptom | Likely options |
|---|---|
| Agent cuts users off mid-thought. | Switch turn_detection to the turn detector model. Raise endpointing.min_delay. Switch interruption.mode to "adaptive" if it isn't already. Add voice isolation if cross-talk or noise is causing false speech detection. |
| Agent is interrupted by short acknowledgments ("uh-huh," "okay"). | Switch interruption.mode to "adaptive". Raise interruption.min_words (requires STT) or interruption.min_duration. Confirm false_interruption_timeout and resume_false_interruption are at their defaults so the agent resumes after silent false positives. |
| Agent feels too slow to respond. | Confirm preemptive_generation is enabled (it is by default). Consider preemptive_tts: true to start TTS early. Lower endpointing.min_delay. In Python, switch endpointing.mode to "dynamic" to adapt to actual pause patterns. |
| Agent reads a partial transcript and replies based on incomplete input. | The preemptive response should be canceled when the final transcript changes. Confirm by checking that you aren't returning early from on_user_turn_completed. Lower preemptive_generation.max_speech_duration so long utterances skip preemptive responses entirely. Lower max_retries to avoid repeated retries on jittery transcripts. |
| Audio quality is fine but turn detection still misfires in noisy rooms. | Add voice isolation for single-speaker scenarios or background noise suppression for multi-speaker. Both run before VAD and STT, so they improve every downstream turn-taking signal. |
Agent runs back-to-back utterances together with no breath (for example, a say() followed by a tool-driven generate_reply()). | Set min_consecutive_speech_delay to a small value like 0.2–0.4 seconds (Python only). |
If you're tuning by feel, use agent observability to confirm changes actually move the metrics you care about. Preemptive generation in particular doesn't always reduce latency, and the metrics tell you whether your changes are pulling their weight.