Overview
The LiveKit turn detector plugin is a custom, open-weights language model that adds conversational context as an additional signal to voice activity detection (VAD) to improve end of turn detection in voice AI apps.
Traditional VAD models are effective at determining the presence or absence of speech, but without language understanding they can provide a poor user experience. For instance, a user might say "I need to think about that for a moment" and then take a long pause. The user has more to say but a VAD-only system interrupts them anyways. A context-aware model can predict that they have more to say and wait for them to finish before responding.
The LiveKit turn detector plugin is free to use with the Agents SDK with both English-only and multilingual models.
Turn detector demo
A video showcasing the improvements provided by the LiveKit turn detector.
Quick reference
The following sections provide a quick overview of the turn detector plugin. For more information, see Additional resources.
Requirements
The LiveKit turn detector is designed for use inside an AgentSession
and also requires an STT plugin be provided. If you're using a realtime LLM you must include a separate STT plugin to use the LiveKit turn detector plugin.
LiveKit recommends also using the Silero VAD plugin for maximum performance, but you can rely on your STT plugin's endpointing instead if you prefer.
The model runs locally on the CPU and requires <500 MB of RAM even with multiple concurrent jobs with a shared inference server.
Installation
Install the plugin from PyPI:
pip install "livekit-agents[turn-detector]~=1.0"
Download model weights
You must download the model weights before running your agent for the first time:
python main.py download-files
Usage
Initialize your AgentSession
with the turn detector and initialize your STT plugin with matching language settings. These examples use the Deepgram STT plugin, but more than 10 other STT plugins are available.
English-only model
Use the EnglishModel
and ensure your STT plugin configuration matches:
from livekit.plugins.turn_detector.english import EnglishModelfrom livekit.plugins import deepgramsession = AgentSession(turn_detection=EnglishModel(),stt=deepgram.STT(model="nova-3", language="en"),# ... vad, stt, tts, llm, etc.)
Multilingual model
Use the MultilingualModel
and ensure your STT plugin configuration matches. In this example, Deepgram performs automatic language detection and passes that value to the turn detector.
from livekit.plugins.turn_detector.multilingual import MultilingualModelfrom livekit.plugins import deepgramsession = AgentSession(turn_detection=MultilingualModel(),stt=deepgram.STT(model="nova-3", language="multi"),# ... vad, stt, tts, llm, etc.)
Parameters
The turn detector itself has no configuration, but the AgentSession
that uses it supports the following related parameters:
The number of seconds to wait before considering the turn complete. The session uses this delay when no turn detector model is present, or when the model indicates a likely turn boundary.
The maximum time to wait for the user to speak after the turn detector model indicates the user is likely to continue speaking. This parameter has no effect without the turn detector model.
Supported languages
The MultilingualModel
supports English and 13 other languages. The model relies on your STT plugin to report the language of the user's speech. To set the language to a fixed value, configure the STT plugin with a specific language. For example, to force the model to use Spanish:
session = AgentSession(turn_detection=MultilingualModel(),stt=deepgram.STT(model="nova-2", language="es"),# ... vad, stt, tts, llm, etc.)
The model currently supports English, Spanish, French, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Indonesian, Turkish, and Russian.
Realtime model usage
Realtime models like the OpenAI Realtime API produce user transcripts after the end of the turn, rather than incrementally while the user speaks. The turn detector model requires live STT results to operate, so you must provide an STT plugin to the AgentSession
to use it with a realtime model. This incurs extra cost for the STT model.
Benchmarks
The following data shows the expected performance of the turn detector model.
Runtime performance
The size on disk and typical CPU inference time for the turn detector models is as follows:
Model | Base Model | Size on Disk | Per Turn Latency |
---|---|---|---|
English-only | SmolLM2-135M | 66 MB | ~15-45 ms |
Multilingual | Qwen2.5-0.5B | 281 MB | ~50-160 ms |
Detection accuracy
The following tables show accuracy metrics for the turn detector models in each supported language.
- True positive means the model correctly identifies the user has finished speaking.
- True negative means the model correctly identifies the user will continue speaking.
English-only model
Accuracy metrics for the English-only model:
Language | True Positive Rate | True Negative Rate |
---|---|---|
English | 98.8% | 87.5% |
Multilingual model
Accuracy metrics for the multilingual model, when configured with the correct language:
Language | True Positive Rate | True Negative Rate |
---|---|---|
French | 98.8% | 97.3% |
Indonesian | 98.8% | 97.3% |
Russian | 98.8% | 97.3% |
Turkish | 98.8% | 97.2% |
Dutch | 98.8% | 97.1% |
Portuguese | 98.8% | 97.1% |
Spanish | 98.8% | 96.7% |
German | 98.8% | 96.6% |
Italian | 98.8% | 96.5% |
Korean | 98.8% | 89.7% |
English | 98.8% | 89.5% |
Japanese | 98.8% | 83.6% |
Chinese | 98.8% | 75.7% |
Additional resources
The following resources provide more information about using the LiveKit turn detector plugin.