Skip to main content

Turn-taking tuning

Tune turn detection, endpointing, interruption, and preemptive generation for natural, low-latency conversations.

Overview

Turn-taking in voice AI involves several stages of the agent pipeline:

  • User activity detection decides when the user has finished a turn so the agent can reply. Options include turn detection mode, endpointing delays, and endpointing mode.
  • Interruption handling decides when the user can cut the agent off mid-response. Options include enable/disable, detection mode, interruption thresholds, and false-interruption recovery.
  • Preemptive generation lets the LLM (and optionally TTS) start work before the user's turn is fully confirmed. Options include enable/disable, preemptive TTS, max speech duration, and max retries.
  • Audio pre-processing (noise cancellation, automatic gain control) cleans the input before any of these stages run. Options include voice isolation and background noise suppression.
  • Agent speech scheduling controls the cadence of the agent's own utterances. Options include the minimum gap between agent utterances (Python only).

The defaults are reasonable for most apps, but tuning matters when you're chasing low latency, working in noisy environments, or seeing specific symptoms like the agent cutting users off. This page gives a recommended starting config, a full reference of the options that affect each stage, and a troubleshooting table mapping common symptoms to the options that fix them.

For a deeper reference on each parameter, see TurnHandlingOptions.

Configuration

The next two sections cover a recommended starting config and a full options reference.

A starting point for a voice agent that needs to respond quickly in environments with background noise or other speakers. See All options for what each parameter does.

from livekit.agents import AgentSession, TurnHandlingOptions, room_io
from livekit.plugins import ai_coustics, silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel
session = AgentSession(
turn_handling=TurnHandlingOptions(
turn_detection=MultilingualModel(),
endpointing={
"mode": "fixed",
"min_delay": 0.5,
"max_delay": 3.0,
},
interruption={
"mode": "adaptive",
"min_duration": 0.5,
"min_words": 0,
},
# preemptive_generation is enabled by default. Opt into preemptive TTS
# for lower latency at the cost of wasted compute on cancellations.
preemptive_generation={
"preemptive_tts": False,
},
),
vad=silero.VAD.load(),
# ... stt, tts, llm, etc.
)
await session.start(
# ...,
room_options=room_io.RoomOptions(
audio_input=room_io.AudioInputOptions(
noise_cancellation=ai_coustics.audio_enhancement(
model=ai_coustics.EnhancerModel.QUAIL_VF_L,
),
),
),
)
import { voice } from '@livekit/agents';
import * as livekit from '@livekit/agents-plugin-livekit';
import * as silero from '@livekit/agents-plugin-silero';
import * as aiCoustics from '@livekit/plugins-ai-coustics';
const session = new voice.AgentSession({
vad: await silero.VAD.load(),
turnHandling: {
turnDetection: new livekit.turnDetector.MultilingualModel(),
endpointing: {
minDelay: 500,
maxDelay: 3000,
},
interruption: {
mode: 'adaptive',
minDuration: 500,
minWords: 0,
},
// preemptiveGeneration is enabled by default. Opt into preemptive TTS
// for lower latency at the cost of wasted compute on cancellations.
preemptiveGeneration: {
preemptiveTts: false,
},
},
// ... stt, tts, llm, etc.
});
await session.start({
// ...,
inputOptions: {
noiseCancellation: aiCoustics.audioEnhancement({ model: 'quailVfL' }),
},
});

For quieter environments, drop the noise cancellation argument from session.start(). The rest of the config still applies.

For SIP participants, swap voice isolation for the telephony-tuned Krisp model: noise_cancellation.BVCTelephony() (Python) or TelephonyBackgroundVoiceCancellation() (Node.js). For multi-speaker rooms, use background noise suppression instead of voice isolation.

All options

The following table lists the options that affect turn-taking, grouped by pipeline stage.

OptionStageWhat it controlsDefault
turn_detection modeUser activity detectionHow the session decides the user is done speaking. Options: turn detector model, VAD, STT endpointing, realtime LLM, manual.Auto-selected
endpointing.min_delayUser activity detectionMinimum time after detected silence before the turn closes. In VAD mode this is max(VAD silence, min_delay). In STT mode it adds to the provider's endpoint signal.0.5 seconds
endpointing.max_delayUser activity detectionMaximum time the agent waits before forcing the turn closed.3.0 seconds
endpointing.modeUser activity detection"fixed" always uses the configured delays. "dynamic" adapts within the range based on session pause statistics."fixed"
interruption.enabledInterruption handlingMaster on/off toggle for interruptions. Set to False to make the agent uninterruptible.True
interruption.modeInterruption handling"adaptive" (recommended) uses an audio model to distinguish real interruptions from backchannel acknowledgments. "vad" triggers on any detected speech."adaptive" if available, otherwise "vad"
interruption.min_durationInterruption handlingMinimum speech duration to register as an interruption.0.5 seconds
interruption.min_wordsInterruption handlingMinimum word count to register as an interruption. Requires STT.0
interruption.false_interruption_timeoutInterruption handlingSilence window after a detected interruption before it's classified as false. After this elapses with no transcript, the agent can resume (see resume_false_interruption).2.0 seconds
interruption.resume_false_interruptionInterruption handlingWhether to resume the interrupted speech after the false-interruption timeout passes.True
preemptive_generation.enabledPreemptive generationWhether to start LLM generation as soon as a final transcript arrives, before the turn is confirmed.True
preemptive_generation.preemptive_ttsPreemptive generationAlso start TTS preemptively. Cuts more latency at the cost of wasted compute on cancellations.False
preemptive_generation.max_speech_durationPreemptive generationSkip preemptive generation for utterances longer than this. Long turns are more likely to mutate.10.0 seconds
preemptive_generation.max_retriesPreemptive generationCap on preemptive attempts per turn. Resets when the turn completes.3
Voice isolationAudio pre-processingSuppresses competing voices in the input so STT, VAD, and the turn detector see clean audio. Models include ai-coustics QUAIL_VF_L, Krisp BVC, and Krisp BVCTelephony.Off
Background noise suppressionAudio pre-processingSuppresses non-speech noise. Use when the main challenge is environmental noise rather than competing speakers.Off
min_consecutive_speech_delayAgent speech schedulingMinimum gap between consecutive agent utterances. Does not affect user-side turn detection.0.0 seconds
OptionStageWhat it controlsDefault
turnDetection modeUser activity detectionHow the session decides the user is done speaking. Options: turn detector model, VAD, STT endpointing, realtime LLM, manual.Auto-selected
endpointing.minDelayUser activity detectionMinimum time after detected silence before the turn closes. In VAD mode this is max(VAD silence, minDelay). In STT mode it adds to the provider's endpoint signal.500 ms
endpointing.maxDelayUser activity detectionMaximum time the agent waits before forcing the turn closed.3000 ms
interruption.enabledInterruption handlingMaster on/off toggle for interruptions. Set to false to make the agent uninterruptible.true
interruption.modeInterruption handling"adaptive" (recommended) uses an audio model to distinguish real interruptions from backchannel acknowledgments. "vad" triggers on any detected speech."adaptive" if available, otherwise "vad"
interruption.minDurationInterruption handlingMinimum speech duration to register as an interruption.500 ms
interruption.minWordsInterruption handlingMinimum word count to register as an interruption. Requires STT.0
interruption.falseInterruptionTimeoutInterruption handlingSilence window after a detected interruption before it's classified as false. After this elapses with no transcript, the agent can resume (see resumeFalseInterruption).2000 ms
interruption.resumeFalseInterruptionInterruption handlingWhether to resume the interrupted speech after the false-interruption timeout passes.true
preemptiveGeneration.enabledPreemptive generationWhether to start LLM generation as soon as a final transcript arrives, before the turn is confirmed.true
preemptiveGeneration.preemptiveTtsPreemptive generationAlso start TTS preemptively. Cuts more latency at the cost of wasted compute on cancellations.false
preemptiveGeneration.maxSpeechDurationPreemptive generationSkip preemptive generation for utterances longer than this. Long turns are more likely to mutate.10000 ms
preemptiveGeneration.maxRetriesPreemptive generationCap on preemptive attempts per turn. Resets when the turn completes.3
Voice isolationAudio pre-processingSuppresses competing voices in the input so STT, VAD, and the turn detector see clean audio. Models include ai-coustics QUAIL_VF_L, Krisp BVC, and Krisp BVCTelephony.Off
Background noise suppressionAudio pre-processingSuppresses non-speech noise. Use when the main challenge is environmental noise rather than competing speakers.Off

Troubleshooting

The following table maps common turn-taking complaints to the options that affect them.

SymptomLikely options
Agent cuts users off mid-thought.Switch turn_detection to the turn detector model. Raise endpointing.min_delay. Switch interruption.mode to "adaptive" if it isn't already. Add voice isolation if cross-talk or noise is causing false speech detection.
Agent is interrupted by short acknowledgments ("uh-huh," "okay").Switch interruption.mode to "adaptive". Raise interruption.min_words (requires STT) or interruption.min_duration. Confirm false_interruption_timeout and resume_false_interruption are at their defaults so the agent resumes after silent false positives.
Agent feels too slow to respond.Confirm preemptive_generation is enabled (it is by default). Consider preemptive_tts: true to start TTS early. Lower endpointing.min_delay. In Python, switch endpointing.mode to "dynamic" to adapt to actual pause patterns.
Agent reads a partial transcript and replies based on incomplete input.The preemptive response should be canceled when the final transcript changes. Confirm by checking that you aren't returning early from on_user_turn_completed. Lower preemptive_generation.max_speech_duration so long utterances skip preemptive responses entirely. Lower max_retries to avoid repeated retries on jittery transcripts.
Audio quality is fine but turn detection still misfires in noisy rooms.Add voice isolation for single-speaker scenarios or background noise suppression for multi-speaker. Both run before VAD and STT, so they improve every downstream turn-taking signal.
Agent runs back-to-back utterances together with no breath (for example, a say() followed by a tool-driven generate_reply()).Set min_consecutive_speech_delay to a small value like 0.20.4 seconds (Python only).

If you're tuning by feel, use agent observability to confirm changes actually move the metrics you care about. Preemptive generation in particular doesn't always reduce latency, and the metrics tell you whether your changes are pulling their weight.

Additional resources