Silero VAD plugin

High-performance voice activity detection for LiveKit Agents.

Overview

The Silero VAD plugin provides voice activity detection (VAD) that contributes to accurate turn detection in voice AI applications.

VAD is a crucial component for voice AI applications as it helps determine when a user is speaking versus when they are silent. This enables natural turn-taking in conversations and helps optimize resource usage by only performing speech-to-text while the user speaks.

LiveKit recommends using the Silero VAD plugin in combination with the custom turn detector model for the best performance.

Quick reference

The following sections provide a quick overview of the Silero VAD plugin. For more information, see Additional resources.

Requirements

The model runs locally on the CPU and requires minimal system resources.

Installation

Install the plugin from PyPI:

pip install "livekit-agents[silero]~=1.0"

Download model weights

You must download the model weights before running your agent for the first time:

python main.py download-files

Usage

Initialize your AgentSession with the Silero VAD plugin:

from livekit.plugins import silero
session = AgentSession(
vad=silero.VAD.load(),
# ... stt, tts, llm, etc.
)

Prewarm

You can prewarm the plugin to improve load times for new jobs:

async def entrypoint(ctx: agents.JobContext):
await ctx.connect()
session = AgentSession(
vad=ctx.proc.userdata["vad"],
# ... stt, tts, llm, etc.
)
# ... session.start etc ...
def prewarm(proc: agents.JobProcess):
proc.userdata["vad"] = silero.VAD.load()
if __name__ == "__main__":
agents.cli.run_app(
agents.WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm)
)

Configuration

The following parameters are available on the load method:

min_speech_durationfloatOptionalDefault: 0.05

Minimum duration of speech required to start a new speech chunk.

min_silence_durationfloatOptionalDefault: 0.55

Duration of silence to wait after speech ends to determine if the user has finished speaking.

prefix_padding_durationfloatOptionalDefault: 0.5

Duration of padding to add to the beginning of each speech chunk.

max_buffered_speechfloatOptionalDefault: 60.0

Maximum duration of speech to keep in the buffer (in seconds).

activation_thresholdfloatOptionalDefault: 0.5

Threshold to consider a frame as speech. A higher threshold results in more conservative detection but might potentially miss soft speech. A lower threshold results in more sensitive detection, but might identify noise as speech.

sample_rateLiteral[8000, 16000]OptionalDefault: 16000

Sample rate for the inference (only 8KHz and 16KHz are supported).

force_cpuboolOptionalDefault: True

Force the use of CPU for inference.

Additional resources

The following resources provide more information about using the LiveKit Silero VAD plugin.