Overview
Use the Sarvam STT plugin to add speech recognition for Indian languages, English, and code-mixed audio to your LiveKit Agents. It fits voice agents that need broad Indic coverage with low-latency transcription, plus the option to translate, transliterate, output verbatim text, or return code-mixed transcripts.
For new voice agents, start with saaras:v3 and set the language explicitly.
Authentication
The Sarvam plugin requires a Sarvam API key .
Set SARVAM_API_KEY in your .env file:
SARVAM_API_KEY=<your-sarvam-api-key>
Installation
Install the plugin:
uv add "livekit-agents[sarvam]~=1.5"
pnpm add @livekit/agents-plugin-sarvam@1.x
Usage
Use Sarvam STT in an AgentSession or as a standalone transcription service. For example, you can use this STT in the Voice AI quickstart.
For most LiveKit voice agents, start with the following settings. Explicit configuration keeps examples, debugging, and production rollouts predictable.
language: Set the expected input language, for exampleen-INorhi-IN.model: Usesaaras:v3for the latest Sarvam STT model with broader language support and mode control.mode: Usetranscribeunless you specifically need translation, transliteration, verbatim output, or code-mixed output.sample_rate: Use16000for Python streaming sessions unless your audio pipeline requires a different rate.
from livekit.agents import AgentSessionfrom livekit.plugins import sarvamsession = AgentSession(stt=sarvam.STT(language="en-IN",model="saaras:v3",mode="transcribe", # defaultsample_rate=16000,high_vad_sensitivity=True,flush_signal=True,),# ... llm, tts, etc.)
import { voice } from '@livekit/agents';import * as sarvam from '@livekit/agents-plugin-sarvam';const session = new voice.AgentSession({stt: new sarvam.STT({languageCode: "en-IN",model: "saaras:v3",mode: "transcribe", // default}),// ... llm, tts, etc.});
Parameters
This section describes commonly used parameters. See the plugin reference links in the Additional resources section for a complete list of all available parameters.
languageLanguageCodeDefault: en-INLanguage code for the input audio. Language support varies by model:
saaras:v3supports the full set of plugin-supported languages:as-IN,bn-IN,brx-IN,doi-IN,en-IN,gu-IN,hi-IN,kn-IN,kok-IN,ks-IN,mai-IN,ml-IN,mni-IN,mr-IN,ne-IN,od-IN,pa-IN,sa-IN,sat-IN,sd-IN,ta-IN,te-IN,unknown, andur-IN.saarika:v2.5andsaaras:v2.5supportbn-IN,en-IN,gu-IN,hi-IN,kn-IN,ml-IN,mr-IN,od-IN,pa-IN,ta-IN,te-IN, andunknown.
See Sarvam's language-code documentation for the list of supported languages.
In Node.js this parameter is called languageCode.
modelstringDefault: saarika:v2.5The Sarvam STT model to use. Valid values are:
saarika:v2.5saaras:v2.5saaras:v3
saaras:v3 is the latest model and the recommended default for new voice agents because it supports advanced mode control and broader language coverage.
The Python plugin automatically selects Sarvam's translate endpoint for saaras:v2.5; other models use the standard speech-to-text endpoint.
modestringDefault: transcribeThe transcription mode for saaras:v3. Valid values are:
transcribe: Return a standard transcription in the source language.translate: Translate the spoken input.verbatim: Preserve more of the speaker's exact wording.translit: Return transliterated output.codemix: Optimize for code-mixed speech.
Only saaras:v3 supports mode selection.
sample_rateintegerDefault: 16000Input audio sample rate used for streaming sessions. Must be greater than 0.
high_vad_sensitivitybooleanEnables Sarvam's high VAD sensitivity option for streaming transcription. Set to True if your agent needs to detect softer or shorter utterances.
flush_signalbooleanSends Sarvam's flush_signal streaming option when set.
input_audio_codecstringInput audio encoding for streaming sessions. When set, it's included in the WebSocket URL and used as the audio message encoding. If omitted, the Python plugin uses audio/wav for streaming audio messages.
Fine-grained VAD options
The following fine-grained VAD parameters are sent to Sarvam only when model is saaras:v3. If unset, Sarvam applies its own defaults.
Tune these only after validating the default behavior with your target microphone, room, telephony, or browser audio path. Changing several VAD values at once can make it harder to understand why an agent starts listening too early, misses short utterances, or waits too long before finalizing a turn.
positive_speech_thresholdfloatIf a frame's speech probability is above this value (range 0.0 to 1.0), the plugin treats it as speech.
negative_speech_thresholdfloatIf a frame's speech probability falls below this value (range 0.0 to 1.0), the plugin treats it as silence.
min_speech_framesintegerHow many consecutive speech frames the plugin requires before opening a new speech segment.
first_turn_min_speech_framesintegerHow many speech frames are needed to recognize the first user turn in a session.
negative_frames_countintegerHow many silence frames within the window close out an in-progress speech segment.
negative_frames_windowintegerWindow size, in frames, over which silence frames are counted toward end-of-speech.
start_speech_volume_thresholdfloatAudio volume floor, in dB. Frames quieter than this are ignored for speech detection.
interrupt_min_speech_framesintegerHow many speech frames are required before incoming audio is treated as a barge-in.
pre_speech_pad_framesintegerAudio frames included ahead of the detected speech start so the beginning of an utterance is not cut off.
num_initial_ignored_framesintegerAudio frames discarded at the very start of the WebSocket stream.
Troubleshooting
The following sections include common issues and their solutions.
Unsupported language or model combination
If the plugin rejects your configuration, check that the selected language, model, and mode are compatible. mode selection is supported only with saaras:v3.
No or delayed transcripts
Check the audio path first:
- Confirm that the LiveKit participant is publishing audio.
- Confirm that the agent session is using Sarvam as the configured
sttprovider. - Use
sample_rate=16000unless your audio pipeline requires another value. - Try disabling custom VAD options and retest with the defaults.
Short utterances are missed
For short commands, names, or interruptions, test high_vad_sensitivity=True in Python. If you are using fine-grained VAD options, tune one value at a time and validate with representative audio.
Transcripts are in the wrong language or script
Set the language explicitly instead of relying on defaults. If your use case involves translation, transliteration, or code-mixed output, use saaras:v3 and set the corresponding mode.
Additional resources
The following resources provide more information about using Sarvam with LiveKit Agents.