Overview
The Silero VAD plugin provides voice activity detection (VAD) that contributes to accurate turn detection in voice AI applications.
VAD is a crucial component for voice AI applications as it helps determine when a user is speaking versus when they are silent. This enables natural turn-taking in conversations and helps optimize resource usage by only performing speech-to-text while the user speaks.
LiveKit recommends using the Silero VAD plugin in combination with the custom turn detector model for the best performance.
Quick reference
The following sections provide a quick overview of the Silero VAD plugin. For more information, see Additional resources.
Requirements
The model runs locally on the CPU and requires minimal system resources.
Installation
Install the plugin from PyPI:
pip install "livekit-agents[silero]~=1.0"
Download model weights
You must download the model weights before running your agent for the first time:
python main.py download-files
Usage
Initialize your AgentSession
with the Silero VAD plugin:
from livekit.plugins import silerosession = AgentSession(vad=silero.VAD.load(),# ... stt, tts, llm, etc.)
Prewarm
You can prewarm the plugin to improve load times for new jobs:
async def entrypoint(ctx: agents.JobContext):await ctx.connect()session = AgentSession(vad=ctx.proc.userdata["vad"],# ... stt, tts, llm, etc.)# ... session.start etc ...def prewarm(proc: agents.JobProcess):proc.userdata["vad"] = silero.VAD.load()if __name__ == "__main__":agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))
Configuration
The following parameters are available on the load
method:
Minimum duration of speech required to start a new speech chunk.
Duration of silence to wait after speech ends to determine if the user has finished speaking.
Duration of padding to add to the beginning of each speech chunk.
Maximum duration of speech to keep in the buffer (in seconds).
Threshold to consider a frame as speech. A higher threshold results in more conservative detection but might potentially miss soft speech. A lower threshold results in more sensitive detection, but might identify noise as speech.
Sample rate for the inference (only 8KHz and 16KHz are supported).
Force the use of CPU for inference.
Additional resources
The following resources provide more information about using the LiveKit Silero VAD plugin.
Python package
The livekit-plugins-silero
package on PyPI.
Plugin reference
Reference for the LiveKit Silero VAD plugin.
GitHub repo
View the source or contribute to the LiveKit Silero VAD plugin.
Silero VAD project
The open source VAD model that powers the LiveKit Silero VAD plugin.
Transcriber
An example using standalone VAD and STT outside of an AgentSession
.