Skip to main content

LiveKit turn detector

Audio-based end-of-turn detection for voice AI.

Overview

The LiveKit turn detector improves end-of-turn detection in voice AI apps by adding signals on top of voice activity detection (VAD).

Traditional VAD models are effective at determining the presence or absence of speech, but without understanding the meaning of speech they can provide a poor user experience. For instance, a user might say "I need to think about that for a moment" and then take a long pause. The user has more to say but a VAD-only system interrupts them anyway. A turn detector model can predict that they have more to say and wait for them to finish before responding.

Audio turn detector

LiveKit's TurnDetector is an audio model that encodes user audio directly, capturing both what is said and how it's said. By combining semantic understanding with acoustic cues like intonation, pitch, and rhythm, it reaches state-of-the-art end-of-turn accuracy without relying on a transcript.

The following capture shows two sessions running side by side on the same audio. The text model is tricked by the mid-turn pauses and commits the turn early, while the audio model waits for the true end of turn:

00:00.00/00:07.31
Time
0s
1s
2s
3s
4s
5s
6s
7s
Audio
Speech waveform
Transcript
Turn Detector 1.0Turn Detector 0.4.1

The detector comes in two versions:

  • v1: the full model, served on LiveKit Inference. Highest accuracy. Available at no cost to agents deployed to LiveKit Cloud.
  • v1-mini: a lightweight version that runs locally on CPU, free to use in any context at no additional cost. Recommended for agents not deployed to LiveKit Cloud.

If the full model is unavailable, the session automatically falls back to v1-mini for the rest of the session. See Fallback to the mini model.

Installation

The audio turn detector is built into the Agents SDK: livekit-agents 1.6.1 or later for Python, and @livekit/agents 1.4.7 or later for Node.js. No separate plugin or extra is required.

Usage

Initialize your AgentSession with TurnDetector. AgentSession provides the required VAD automatically, and the model is selected based on your environment (see Default model selection).

from livekit.agents import AgentSession, TurnHandlingOptions, inference
session = AgentSession(
turn_handling=TurnHandlingOptions(
turn_detection=inference.TurnDetector(),
),
# ... stt, tts, llm, etc.
)
import { inference, voice } from '@livekit/agents';
const session = new voice.AgentSession({
turnHandling: {
turnDetection: new inference.TurnDetector(),
},
// ... stt, tts, llm, etc.
});

To pin a specific version instead of relying on auto-selection, pass version:

turn_detection=inference.TurnDetector(version="v1-mini")
turnDetection: new inference.TurnDetector({ version: 'v1-mini' }),

See the Voice AI quickstart for a complete example.

Note

Some STT models, such as Deepgram Flux, include built-in end-of-turn detection. No special configuration is needed to use them alongside the audio turn detector: the configured turn detector takes precedence, and the session uses end-of-turn signals from the STT only when you set turn_detection="stt".

Parameters

The following parameters are available on the TurnDetector constructor:

versionLiteral['v1', 'v1-mini']

Selects the model version. v1 is the full model served on LiveKit Inference while v1-mini runs locally on CPU. When omitted, the version is selected automatically based on your environment. See Default model selection.

unlikely_thresholdfloat | dict[str, float]

Override the model's confidence threshold for ending a turn. Accepts a scalar (applied to every language) or a dict keyed by language code. Unmapped languages keep the calibrated default for the active model. See Custom thresholds. In Node.js, this parameter is called unlikelyThreshold.

Supported languages

The audio turn detector supports 14 languages: English, Arabic, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Dutch, Portuguese, Turkish, and Chinese.

When STT is enabled, the detector uses the language it reports to apply the right per-language threshold. To force a specific language, configure the STT model with that language. The language parameter accepts any format supported by LanguageCode. For example, to set Spanish:

session = AgentSession(
turn_handling=TurnHandlingOptions(
turn_detection=inference.TurnDetector(),
),
stt=inference.STT(language="es"),
# ... tts, llm, etc.
)
const session = new voice.AgentSession({
turnHandling: {
turnDetection: new inference.TurnDetector(),
},
stt: new inference.STT({ language: 'es' }),
// ... tts, llm, etc.
});

When no STT is configured (for example, with a realtime model), the detector defaults to English thresholds.

Custom thresholds

Each language has a calibrated unlikely_threshold that determines how confident the model must be before considering the user's turn complete. Lower values make the detector more eager to respond while higher values make it more patient.

Override the threshold globally with a scalar:

inference.TurnDetector(unlikely_threshold=0.5)
new inference.TurnDetector({ unlikelyThreshold: 0.5 });

Or override per language (unmapped languages keep the default):

inference.TurnDetector(
unlikely_threshold={
"en": 0.5,
"ja": 0.6,
}
)
new inference.TurnDetector({
unlikelyThreshold: {
en: 0.5,
ja: 0.6,
},
});

The two models ship with different calibrated defaults. When the session falls back from v1 to v1-mini mid-session, your override is rescaled to preserve its relationship to the calibrated defaults of the active model.

Realtime model usage

Because the audio turn detector doesn't depend on a transcript, you can use it with a realtime model without adding an STT plugin. You still need to disable the realtime model's built-in turn detection so the two systems don't conflict.

from livekit.agents import AgentSession, TurnHandlingOptions, inference
from livekit.plugins import openai
session = AgentSession(
turn_handling=TurnHandlingOptions(
turn_detection=inference.TurnDetector(),
),
llm=openai.realtime.RealtimeModel(
voice="alloy",
# Disable the model's built-in turn detection to use
# the LiveKit audio turn detector instead
turn_detection=None,
),
)
import { inference, voice } from '@livekit/agents';
import * as openai from '@livekit/agents-plugin-openai';
const session = new voice.AgentSession({
turnHandling: {
turnDetection: new inference.TurnDetector(),
},
llm: new openai.realtime.RealtimeModel({
voice: 'alloy',
// Disable the model's built-in turn detection to use
// the LiveKit audio turn detector instead
turnDetection: null,
}),
});

Default model selection

When you don't set the version parameter, the detector picks a version based on your environment:

EnvironmentDefault version
Agent deployed to LiveKit Cloudv1 (full model)
Local development (dev mode) with LiveKit Cloud credentialsv1 (full model)
Agent deployed to another environment (start command)v1-mini (runs locally)

Local development includes a free monthly allowance of the full model on every plan. When the allowance is exhausted, the session falls back to v1-mini automatically. To learn more, see Quotas and limits. To pin one version in every environment, set version explicitly.

If you don't pass turn_detection to AgentSession at all, the session uses a TurnDetector by default, unless your LLM is a realtime model that provides its own server-side turn detection.

When v1-mini runs (in production outside LiveKit Cloud, or after a fallback), it executes in a shared CPU process. Use compute-optimized instances (such as AWS c6i or c7i) rather than burstable instances (such as AWS t3 or t4g) to avoid inference timeouts from CPU credit limits.

Fallback to the mini model

When the full v1 model is active, the detector monitors for connection failures and prediction timeouts. A connection failure includes the case where the agent can't reach LiveKit Inference at all (for example, the free allowance is exhausted). If either occurs, the session does the following:

  1. Logs a warning (once per session).
  2. Emits a default probability (1.0) for any in-flight prediction so the current turn isn't blocked.
  3. Swaps to v1-mini for the rest of the session.
  4. Rescales any custom unlikely_threshold to preserve its relationship to the mini model's calibrated defaults.

The fallback is sticky for the lifetime of the session. The next session starts fresh and attempts the full model again.

If v1-mini isn't available (for example, the binary failed to load), the detector emits the default probability for each turn and retries on the next turn. The full model is also retried on each new session.

Benchmarks

LiveKit evaluated the audio turn detector against other end-of-turn models using an open source evaluation harness that simulates the live turn-taking decisions a production voice agent makes. The test set consists of natural, task-oriented human-assistant conversations, and audio-based models are scored 200 ms after speech ends.

The following table shows how well each model's raw score separates true end-of-turn moments from mid-turn pauses, reported at each model's published default threshold:

ModelAUCPrecisionRecallF1
LiveKit Turn Detector (v1/audio)0.960.910.920.91
Deepgram Flux0.920.910.740.82
ultraVAD0.880.760.950.84
smart-turn-v3.20.830.840.640.73
LiveKit Turn Detector (v0.4.1/text)0.740.820.610.70

The audio model also achieves the highest AUC in every supported language. For the full methodology, latency-versus-interruption analysis, and multilingual results, see the LiveKit blog .

Text turn detector

Deprecated

The text turn detector is deprecated and slated for removal in version 2.0 of the LiveKit Agents SDK. Use the audio turn detector for new agents. It remains available for cases where you can't use LiveKit Inference and need a fully open-weights, self-contained option, but no longer receives feature work.

The text turn detector is an open-weights language model that adds conversational context as an additional signal to VAD using transcripts from your STT pipeline.

For more general information about the model, read about it on the LiveKit blog .

Requirements

The text turn detector is designed for use inside an AgentSession and also requires an STT model. If you're using a realtime model, you must include a separate STT model to use this detector.

LiveKit recommends also using the Silero VAD plugin for maximum performance, but you can rely on your STT plugin's endpointing instead if you prefer.

The model is deployed globally on LiveKit Cloud, and agents deployed there automatically use this optimized inference service.

For custom agent deployments, the model runs locally on the CPU in a shared process and requires <500 MB of RAM. Use compute-optimized instances (such as AWS c6i or c7i) rather than burstable instances (such as AWS t3 or t4g) to avoid inference timeouts due to CPU credit limits.

Installation

Install the plugin.

Install the plugin from PyPI:

uv add "livekit-agents[turn-detector]~=1.5"

Install the plugin from npm:

pnpm install @livekit/agents-plugin-livekit

Download model weights

You must download the model weights before running your agent for the first time:

uv run --module livekit.agents download-files
npx livekit-agents download-files

For more information, see Download plugin assets on the Builds and Dockerfiles page.

Usage

Initialize your AgentSession with the MultilingualModel and an STT model. These examples use LiveKit Inference for STT, but more options are available.

from livekit.plugins.turn_detector.multilingual import MultilingualModel
from livekit.agents import AgentSession, inference, TurnHandlingOptions
session = AgentSession(
turn_handling=TurnHandlingOptions(
turn_detection=MultilingualModel(),
),
stt=inference.STT(language="multi"),
# ... vad, stt, tts, llm, etc.
)
import { voice, inference } from '@livekit/agents';
import * as livekit from '@livekit/agents-plugin-livekit';
const session = new voice.AgentSession({
stt: new inference.STT({ language: 'multi' }),
turnHandling: {
turnDetection: new livekit.turnDetector.MultilingualModel(),
},
// ... vad, stt, tts, llm, etc.
});

Parameters

The text turn detector itself has no configuration, but you can configure the following endpointing parameters in the turn handling options passed to the AgentSession. To learn more, see EndpointingOptions.

modeLiteral['dynamic', 'fixed']Default: fixed

Endpointing timing behavior. The endpointing delay is the time the agent waits before terminating the users's turn.

  • "fixed" - Use the configured min_delay and max_delay values to determine the endpointing delay.

  • Dynamic endpointing only Available inPython

    "dynamic" - Adapt the delay within the min_delay and max_delay range based on session pause statistics (exponential moving average of between-utterance and between-turn pauses). Suits most conversations.

min_delayfloatDefault: 0.5 seconds

Minimum time (in seconds) to wait since the last detected speech to declare the user's turn to be complete.

With dynamic endpointing (Python only), this is the lower bound. The agent might use a longer effective delay when session pause statistics suggest slower turn-taking.

  • In VAD mode, this effectively behaves like max(VAD silence, min_delay).
  • In STT mode, this is applied after the STT end-of-speech signal, and therefore in addition to the STT provider's endpointing delay.
max_delayfloatDefault: 3.0 seconds

Maximum time (in seconds) the agent waits before terminating the turn. This prevents the agent from waiting indefinitely for the user to continue speaking.

With dynamic endpointing (Python only), this is the upper bound. The agent might use a shorter effective delay when session pause statistics suggest faster turn-taking.

Time units

In Node.js, min_delay and max_delay are in milliseconds (for example, 500 and 3000). Python uses seconds (for example, 0.5 and 3.0).

Supported languages

The MultilingualModel supports English and 13 other languages. The model relies on your STT model to report the language of the user's speech. To set the language to a fixed value, configure the STT model with a specific language. The language parameter accepts any format supported by LanguageCode. For example, to force the model to use Spanish:

session = AgentSession(
turn_handling=TurnHandlingOptions(
turn_detection=MultilingualModel(),
),
stt=inference.STT(language="es"),
# ... vad, stt, tts, llm, etc.
)
import { voice, inference } from '@livekit/agents';
import * as livekit from '@livekit/agents-plugin-livekit';
const session = new voice.AgentSession({
stt: new inference.STT({ language: 'es' }),
turnHandling: {
turnDetection: new livekit.turnDetector.MultilingualModel(),
},
// ... vad, stt, tts, llm, etc.
});

The model currently supports English, Spanish, French, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Indonesian, Turkish, Russian, and Hindi.

Realtime model usage

Realtime models like the OpenAI Realtime API produce user transcripts after the end of the turn, rather than incrementally while the user speaks. The text turn detector requires live STT results to operate, so you must provide an STT plugin to the AgentSession to use it with a realtime model. This incurs extra cost for the STT model.

You must also disable the realtime model's built-in turn detection so it doesn't conflict with the LiveKit turn detector. The following example demonstrates how to do this with the OpenAI Realtime API:

from livekit.agents import AgentSession, TurnHandlingOptions
from livekit.plugins.turn_detector.multilingual import MultilingualModel
from livekit.plugins import deepgram, openai, silero
session = AgentSession(
turn_handling=TurnHandlingOptions(
turn_detection=MultilingualModel(),
),
vad=silero.VAD.load(),
stt=deepgram.STT(),
# OpenAI Realtime API
llm=openai.realtime.RealtimeModel(
voice="alloy",
# Disable the model's built-in turn detection to use
# the LiveKit turn detector instead
turn_detection=None,
input_audio_transcription=None, # use Deepgram STT instead
),
)
import { voice } from '@livekit/agents';
import * as deepgram from '@livekit/agents-plugin-deepgram';
import * as livekit from '@livekit/agents-plugin-livekit';
import * as openai from '@livekit/agents-plugin-openai';
import * as silero from '@livekit/agents-plugin-silero';
const session = new voice.AgentSession({
turnHandling: {
turnDetection: new livekit.turnDetector.MultilingualModel(),
},
vad: await silero.VAD.load(),
stt: new deepgram.STT(),
// OpenAI Realtime API
llm: new openai.realtime.RealtimeModel({
voice: 'alloy',
// Disable the model's built-in turn detection to use
// the LiveKit turn detector instead
turnDetection: null,
inputAudioTranscription: null, // use Deepgram STT instead
}),
});

Benchmarks

The following data shows the expected performance of the text turn detector model.

Runtime performance

The size on disk and typical CPU inference time for the text turn detector model is as follows:

ModelBase ModelSize on DiskPer Turn Latency
MultilingualQwen2.5-0.5B-Instruct 396 MB~50-160 ms

Detection accuracy

The following tables show accuracy metrics for the text turn detector model in each supported language.

  • True positive means the model correctly identifies the user has finished speaking.
  • True negative means the model correctly identifies the user will continue speaking.
LanguageTrue Positive RateTrue Negative Rate
Hindi99.4%96.30%
Korean99.3%94.50%
French99.3%88.90%
Portuguese99.4%87.40%
Indonesian99.3%89.40%
Russian99.3%88.00%
English99.3%87.00%
Chinese99.3%86.60%
Japanese99.3%88.80%
Italian99.3%85.10%
Spanish99.3%86.00%
German99.3%87.80%
Turkish99.3%87.30%
Dutch99.3%88.10%

Resources

The following resources cover the open-weights text turn detector:

Additional resources