Overview
The LiveKit turn detector improves end-of-turn detection in voice AI apps by adding signals on top of voice activity detection (VAD).
Traditional VAD models are effective at determining the presence or absence of speech, but without understanding the meaning of speech they can provide a poor user experience. For instance, a user might say "I need to think about that for a moment" and then take a long pause. The user has more to say but a VAD-only system interrupts them anyway. A turn detector model can predict that they have more to say and wait for them to finish before responding.
Audio turn detector
LiveKit's TurnDetector is an audio model that encodes user audio directly, capturing both what is said and how it's said. By combining semantic understanding with acoustic cues like intonation, pitch, and rhythm, it reaches state-of-the-art end-of-turn accuracy without relying on a transcript.
The following capture shows two sessions running side by side on the same audio. The text model is tricked by the mid-turn pauses and commits the turn early, while the audio model waits for the true end of turn:
The detector comes in two versions:
v1: the full model, served on LiveKit Inference. Highest accuracy. Available at no cost to agents deployed to LiveKit Cloud.v1-mini: a lightweight version that runs locally on CPU, free to use in any context at no additional cost. Recommended for agents not deployed to LiveKit Cloud.
If the full model is unavailable, the session automatically falls back to v1-mini for the rest of the session. See Fallback to the mini model.
Installation
The audio turn detector is built into the Agents SDK: livekit-agents 1.6.1 or later for Python, and @livekit/agents 1.4.7 or later for Node.js. No separate plugin or extra is required.
Usage
Initialize your AgentSession with TurnDetector. AgentSession provides the required VAD automatically, and the model is selected based on your environment (see Default model selection).
from livekit.agents import AgentSession, TurnHandlingOptions, inferencesession = AgentSession(turn_handling=TurnHandlingOptions(turn_detection=inference.TurnDetector(),),# ... stt, tts, llm, etc.)
import { inference, voice } from '@livekit/agents';const session = new voice.AgentSession({turnHandling: {turnDetection: new inference.TurnDetector(),},// ... stt, tts, llm, etc.});
To pin a specific version instead of relying on auto-selection, pass version:
turn_detection=inference.TurnDetector(version="v1-mini")
turnDetection: new inference.TurnDetector({ version: 'v1-mini' }),
See the Voice AI quickstart for a complete example.
Some STT models, such as Deepgram Flux, include built-in end-of-turn detection. No special configuration is needed to use them alongside the audio turn detector: the configured turn detector takes precedence, and the session uses end-of-turn signals from the STT only when you set turn_detection="stt".
Parameters
The following parameters are available on the TurnDetector constructor:
versionLiteral['v1', 'v1-mini']Selects the model version. v1 is the full model served on LiveKit Inference while v1-mini runs locally on CPU. When omitted, the version is selected automatically based on your environment. See Default model selection.
unlikely_thresholdfloat | dict[str, float]Override the model's confidence threshold for ending a turn. Accepts a scalar (applied to every language) or a dict keyed by language code. Unmapped languages keep the calibrated default for the active model. See Custom thresholds. In Node.js, this parameter is called unlikelyThreshold.
Supported languages
The audio turn detector supports 14 languages: English, Arabic, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Dutch, Portuguese, Turkish, and Chinese.
When STT is enabled, the detector uses the language it reports to apply the right per-language threshold. To force a specific language, configure the STT model with that language. The language parameter accepts any format supported by LanguageCode. For example, to set Spanish:
session = AgentSession(turn_handling=TurnHandlingOptions(turn_detection=inference.TurnDetector(),),stt=inference.STT(language="es"),# ... tts, llm, etc.)
const session = new voice.AgentSession({turnHandling: {turnDetection: new inference.TurnDetector(),},stt: new inference.STT({ language: 'es' }),// ... tts, llm, etc.});
When no STT is configured (for example, with a realtime model), the detector defaults to English thresholds.
Custom thresholds
Each language has a calibrated unlikely_threshold that determines how confident the model must be before considering the user's turn complete. Lower values make the detector more eager to respond while higher values make it more patient.
Override the threshold globally with a scalar:
inference.TurnDetector(unlikely_threshold=0.5)
new inference.TurnDetector({ unlikelyThreshold: 0.5 });
Or override per language (unmapped languages keep the default):
inference.TurnDetector(unlikely_threshold={"en": 0.5,"ja": 0.6,})
new inference.TurnDetector({unlikelyThreshold: {en: 0.5,ja: 0.6,},});
The two models ship with different calibrated defaults. When the session falls back from v1 to v1-mini mid-session, your override is rescaled to preserve its relationship to the calibrated defaults of the active model.
Realtime model usage
Because the audio turn detector doesn't depend on a transcript, you can use it with a realtime model without adding an STT plugin. You still need to disable the realtime model's built-in turn detection so the two systems don't conflict.
from livekit.agents import AgentSession, TurnHandlingOptions, inferencefrom livekit.plugins import openaisession = AgentSession(turn_handling=TurnHandlingOptions(turn_detection=inference.TurnDetector(),),llm=openai.realtime.RealtimeModel(voice="alloy",# Disable the model's built-in turn detection to use# the LiveKit audio turn detector insteadturn_detection=None,),)
import { inference, voice } from '@livekit/agents';import * as openai from '@livekit/agents-plugin-openai';const session = new voice.AgentSession({turnHandling: {turnDetection: new inference.TurnDetector(),},llm: new openai.realtime.RealtimeModel({voice: 'alloy',// Disable the model's built-in turn detection to use// the LiveKit audio turn detector insteadturnDetection: null,}),});
Default model selection
When you don't set the version parameter, the detector picks a version based on your environment:
| Environment | Default version |
|---|---|
| Agent deployed to LiveKit Cloud | v1 (full model) |
Local development (dev mode) with LiveKit Cloud credentials | v1 (full model) |
Agent deployed to another environment (start command) | v1-mini (runs locally) |
Local development includes a free monthly allowance of the full model on every plan. When the allowance is exhausted, the session falls back to v1-mini automatically. To learn more, see Quotas and limits. To pin one version in every environment, set version explicitly.
If you don't pass turn_detection to AgentSession at all, the session uses a TurnDetector by default, unless your LLM is a realtime model that provides its own server-side turn detection.
When v1-mini runs (in production outside LiveKit Cloud, or after a fallback), it executes in a shared CPU process. Use compute-optimized instances (such as AWS c6i or c7i) rather than burstable instances (such as AWS t3 or t4g) to avoid inference timeouts from CPU credit limits.
Fallback to the mini model
When the full v1 model is active, the detector monitors for connection failures and prediction timeouts. A connection failure includes the case where the agent can't reach LiveKit Inference at all (for example, the free allowance is exhausted). If either occurs, the session does the following:
- Logs a warning (once per session).
- Emits a default probability (
1.0) for any in-flight prediction so the current turn isn't blocked. - Swaps to
v1-minifor the rest of the session. - Rescales any custom
unlikely_thresholdto preserve its relationship to the mini model's calibrated defaults.
The fallback is sticky for the lifetime of the session. The next session starts fresh and attempts the full model again.
If v1-mini isn't available (for example, the binary failed to load), the detector emits the default probability for each turn and retries on the next turn. The full model is also retried on each new session.
Benchmarks
LiveKit evaluated the audio turn detector against other end-of-turn models using an open source evaluation harness that simulates the live turn-taking decisions a production voice agent makes. The test set consists of natural, task-oriented human-assistant conversations, and audio-based models are scored 200 ms after speech ends.
The following table shows how well each model's raw score separates true end-of-turn moments from mid-turn pauses, reported at each model's published default threshold:
| Model | AUC | Precision | Recall | F1 |
|---|---|---|---|---|
| LiveKit Turn Detector (v1/audio) | 0.96 | 0.91 | 0.92 | 0.91 |
| Deepgram Flux | 0.92 | 0.91 | 0.74 | 0.82 |
| ultraVAD | 0.88 | 0.76 | 0.95 | 0.84 |
| smart-turn-v3.2 | 0.83 | 0.84 | 0.64 | 0.73 |
| LiveKit Turn Detector (v0.4.1/text) | 0.74 | 0.82 | 0.61 | 0.70 |
The audio model also achieves the highest AUC in every supported language. For the full methodology, latency-versus-interruption analysis, and multilingual results, see the LiveKit blog .
Text turn detector
The text turn detector is deprecated and slated for removal in version 2.0 of the LiveKit Agents SDK. Use the audio turn detector for new agents. It remains available for cases where you can't use LiveKit Inference and need a fully open-weights, self-contained option, but no longer receives feature work.
The text turn detector is an open-weights language model that adds conversational context as an additional signal to VAD using transcripts from your STT pipeline.
For more general information about the model, read about it on the LiveKit blog .
Requirements
The text turn detector is designed for use inside an AgentSession and also requires an STT model. If you're using a realtime model, you must include a separate STT model to use this detector.
LiveKit recommends also using the Silero VAD plugin for maximum performance, but you can rely on your STT plugin's endpointing instead if you prefer.
The model is deployed globally on LiveKit Cloud, and agents deployed there automatically use this optimized inference service.
For custom agent deployments, the model runs locally on the CPU in a shared process and requires <500 MB of RAM. Use compute-optimized instances (such as AWS c6i or c7i) rather than burstable instances (such as AWS t3 or t4g) to avoid inference timeouts due to CPU credit limits.
Installation
Install the plugin.
Install the plugin from PyPI:
uv add "livekit-agents[turn-detector]~=1.5"
Install the plugin from npm:
pnpm install @livekit/agents-plugin-livekit
Download model weights
You must download the model weights before running your agent for the first time:
uv run --module livekit.agents download-files
npx livekit-agents download-files
For more information, see Download plugin assets on the Builds and Dockerfiles page.
Usage
Initialize your AgentSession with the MultilingualModel and an STT model. These examples use LiveKit Inference for STT, but more options are available.
from livekit.plugins.turn_detector.multilingual import MultilingualModelfrom livekit.agents import AgentSession, inference, TurnHandlingOptionssession = AgentSession(turn_handling=TurnHandlingOptions(turn_detection=MultilingualModel(),),stt=inference.STT(language="multi"),# ... vad, stt, tts, llm, etc.)
import { voice, inference } from '@livekit/agents';import * as livekit from '@livekit/agents-plugin-livekit';const session = new voice.AgentSession({stt: new inference.STT({ language: 'multi' }),turnHandling: {turnDetection: new livekit.turnDetector.MultilingualModel(),},// ... vad, stt, tts, llm, etc.});
Parameters
The text turn detector itself has no configuration, but you can configure the following endpointing parameters in the turn handling options passed to the AgentSession. To learn more, see EndpointingOptions.
modeLiteral['dynamic', 'fixed']Default: fixedEndpointing timing behavior. The endpointing delay is the time the agent waits before terminating the users's turn.
"fixed"- Use the configuredmin_delayandmax_delayvalues to determine the endpointing delay.- Dynamic endpointing only Available inPython
"dynamic"- Adapt the delay within themin_delayandmax_delayrange based on session pause statistics (exponential moving average of between-utterance and between-turn pauses). Suits most conversations.
min_delayfloatDefault: 0.5 secondsMinimum time (in seconds) to wait since the last detected speech to declare the user's turn to be complete.
With dynamic endpointing (Python only), this is the lower bound. The agent might use a longer effective delay when session pause statistics suggest slower turn-taking.
- In VAD mode, this effectively behaves like
max(VAD silence, min_delay). - In STT mode, this is applied after the STT end-of-speech signal, and therefore in addition to the STT provider's endpointing delay.
max_delayfloatDefault: 3.0 secondsMaximum time (in seconds) the agent waits before terminating the turn. This prevents the agent from waiting indefinitely for the user to continue speaking.
With dynamic endpointing (Python only), this is the upper bound. The agent might use a shorter effective delay when session pause statistics suggest faster turn-taking.
In Node.js, min_delay and max_delay are in milliseconds (for example, 500 and 3000). Python uses seconds (for example, 0.5 and 3.0).
Supported languages
The MultilingualModel supports English and 13 other languages. The model relies on your STT model to report the language of the user's speech. To set the language to a fixed value, configure the STT model with a specific language. The language parameter accepts any format supported by LanguageCode. For example, to force the model to use Spanish:
session = AgentSession(turn_handling=TurnHandlingOptions(turn_detection=MultilingualModel(),),stt=inference.STT(language="es"),# ... vad, stt, tts, llm, etc.)
import { voice, inference } from '@livekit/agents';import * as livekit from '@livekit/agents-plugin-livekit';const session = new voice.AgentSession({stt: new inference.STT({ language: 'es' }),turnHandling: {turnDetection: new livekit.turnDetector.MultilingualModel(),},// ... vad, stt, tts, llm, etc.});
The model currently supports English, Spanish, French, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Indonesian, Turkish, Russian, and Hindi.
Realtime model usage
Realtime models like the OpenAI Realtime API produce user transcripts after the end of the turn, rather than incrementally while the user speaks. The text turn detector requires live STT results to operate, so you must provide an STT plugin to the AgentSession to use it with a realtime model. This incurs extra cost for the STT model.
You must also disable the realtime model's built-in turn detection so it doesn't conflict with the LiveKit turn detector. The following example demonstrates how to do this with the OpenAI Realtime API:
from livekit.agents import AgentSession, TurnHandlingOptionsfrom livekit.plugins.turn_detector.multilingual import MultilingualModelfrom livekit.plugins import deepgram, openai, silerosession = AgentSession(turn_handling=TurnHandlingOptions(turn_detection=MultilingualModel(),),vad=silero.VAD.load(),stt=deepgram.STT(),# OpenAI Realtime APIllm=openai.realtime.RealtimeModel(voice="alloy",# Disable the model's built-in turn detection to use# the LiveKit turn detector insteadturn_detection=None,input_audio_transcription=None, # use Deepgram STT instead),)
import { voice } from '@livekit/agents';import * as deepgram from '@livekit/agents-plugin-deepgram';import * as livekit from '@livekit/agents-plugin-livekit';import * as openai from '@livekit/agents-plugin-openai';import * as silero from '@livekit/agents-plugin-silero';const session = new voice.AgentSession({turnHandling: {turnDetection: new livekit.turnDetector.MultilingualModel(),},vad: await silero.VAD.load(),stt: new deepgram.STT(),// OpenAI Realtime APIllm: new openai.realtime.RealtimeModel({voice: 'alloy',// Disable the model's built-in turn detection to use// the LiveKit turn detector insteadturnDetection: null,inputAudioTranscription: null, // use Deepgram STT instead}),});
Benchmarks
The following data shows the expected performance of the text turn detector model.
Runtime performance
The size on disk and typical CPU inference time for the text turn detector model is as follows:
| Model | Base Model | Size on Disk | Per Turn Latency |
|---|---|---|---|
| Multilingual | Qwen2.5-0.5B-Instruct | 396 MB | ~50-160 ms |
Detection accuracy
The following tables show accuracy metrics for the text turn detector model in each supported language.
- True positive means the model correctly identifies the user has finished speaking.
- True negative means the model correctly identifies the user will continue speaking.
| Language | True Positive Rate | True Negative Rate |
|---|---|---|
| Hindi | 99.4% | 96.30% |
| Korean | 99.3% | 94.50% |
| French | 99.3% | 88.90% |
| Portuguese | 99.4% | 87.40% |
| Indonesian | 99.3% | 89.40% |
| Russian | 99.3% | 88.00% |
| English | 99.3% | 87.00% |
| Chinese | 99.3% | 86.60% |
| Japanese | 99.3% | 88.80% |
| Italian | 99.3% | 85.10% |
| Spanish | 99.3% | 86.00% |
| German | 99.3% | 87.80% |
| Turkish | 99.3% | 87.30% |
| Dutch | 99.3% | 88.10% |
Resources
The following resources cover the open-weights text turn detector:
Plugin reference
Reference for the livekit-plugins-turn-detector package.
GitHub repo
View the source or contribute to the text turn detector plugin.
LiveKit Model License
LiveKit Model License used for the text turn detector and the v1-mini model.
Additional resources
Audio turn detector deep dive
Model architecture, benchmarks, and evaluation methodology on the LiveKit blog.
Evaluation harness
Open source harness for benchmarking end-of-turn models under production endpointing policies.
Evaluation datasets
Open source English and multilingual end-of-turn test datasets on Hugging Face.