Skip to main content

Sarvam STT plugin guide

How to use the Sarvam STT plugin for LiveKit Agents.

Available in
Python
|
Node.js

Overview

Use the Sarvam STT plugin to add speech recognition for Indian languages, English, and code-mixed audio to your LiveKit Agents. It fits voice agents that need broad Indic coverage with low-latency transcription, plus the option to translate, transliterate, output verbatim text, or return code-mixed transcripts.

For new voice agents, start with saaras:v3 and set the language explicitly.

Authentication

The Sarvam plugin requires a Sarvam API key .

Set SARVAM_API_KEY in your .env file:

SARVAM_API_KEY=<your-sarvam-api-key>

Installation

Install the plugin:

uv add "livekit-agents[sarvam]~=1.5"
pnpm add @livekit/agents-plugin-sarvam@1.x

Usage

Use Sarvam STT in an AgentSession or as a standalone transcription service. For example, you can use this STT in the Voice AI quickstart.

For most LiveKit voice agents, start with the following settings. Explicit configuration keeps examples, debugging, and production rollouts predictable.

  • language: Set the expected input language, for example en-IN or hi-IN.
  • model: Use saaras:v3 for the latest Sarvam STT model with broader language support and mode control.
  • mode: Use transcribe unless you specifically need translation, transliteration, verbatim output, or code-mixed output.
  • sample_rate: Use 16000 for Python streaming sessions unless your audio pipeline requires a different rate.
from livekit.agents import AgentSession
from livekit.plugins import sarvam
session = AgentSession(
stt=sarvam.STT(
language="en-IN",
model="saaras:v3",
mode="transcribe", # default
sample_rate=16000,
high_vad_sensitivity=True,
flush_signal=True,
),
# ... llm, tts, etc.
)
import { voice } from '@livekit/agents';
import * as sarvam from '@livekit/agents-plugin-sarvam';
const session = new voice.AgentSession({
stt: new sarvam.STT({
languageCode: "en-IN",
model: "saaras:v3",
mode: "transcribe", // default
}),
// ... llm, tts, etc.
});

Parameters

This section describes commonly used parameters. See the plugin reference links in the Additional resources section for a complete list of all available parameters.

languageLanguageCodeDefault: en-IN

Language code for the input audio. Language support varies by model:

  • saaras:v3 supports the full set of plugin-supported languages: as-IN, bn-IN, brx-IN, doi-IN, en-IN, gu-IN, hi-IN, kn-IN, kok-IN, ks-IN, mai-IN, ml-IN, mni-IN, mr-IN, ne-IN, od-IN, pa-IN, sa-IN, sat-IN, sd-IN, ta-IN, te-IN, unknown, and ur-IN.
  • saarika:v2.5 and saaras:v2.5 support bn-IN, en-IN, gu-IN, hi-IN, kn-IN, ml-IN, mr-IN, od-IN, pa-IN, ta-IN, te-IN, and unknown.

See Sarvam's language-code documentation  for the list of supported languages.

In Node.js this parameter is called languageCode.

modelstringDefault: saarika:v2.5

The Sarvam STT model to use. Valid values are:

  • saarika:v2.5
  • saaras:v2.5
  • saaras:v3

saaras:v3 is the latest model and the recommended default for new voice agents because it supports advanced mode control and broader language coverage.

The Python plugin automatically selects Sarvam's translate endpoint for saaras:v2.5; other models use the standard speech-to-text endpoint.

modestringDefault: transcribe

The transcription mode for saaras:v3. Valid values are:

  • transcribe: Return a standard transcription in the source language.
  • translate: Translate the spoken input.
  • verbatim: Preserve more of the speaker's exact wording.
  • translit: Return transliterated output.
  • codemix: Optimize for code-mixed speech.

Only saaras:v3 supports mode selection.

sample_rateintegerDefault: 16000
Only Available in
Python

Input audio sample rate used for streaming sessions. Must be greater than 0.

high_vad_sensitivityboolean
Only Available in
Python

Enables Sarvam's high VAD sensitivity option for streaming transcription. Set to True if your agent needs to detect softer or shorter utterances.

flush_signalboolean
Only Available in
Python

Sends Sarvam's flush_signal streaming option when set.

input_audio_codecstring
Only Available in
Python

Input audio encoding for streaming sessions. When set, it's included in the WebSocket URL and used as the audio message encoding. If omitted, the Python plugin uses audio/wav for streaming audio messages.

Fine-grained VAD options

Only Available in
Python

The following fine-grained VAD parameters are sent to Sarvam only when model is saaras:v3. If unset, Sarvam applies its own defaults.

Tune these only after validating the default behavior with your target microphone, room, telephony, or browser audio path. Changing several VAD values at once can make it harder to understand why an agent starts listening too early, misses short utterances, or waits too long before finalizing a turn.

positive_speech_thresholdfloat

If a frame's speech probability is above this value (range 0.0 to 1.0), the plugin treats it as speech.

negative_speech_thresholdfloat

If a frame's speech probability falls below this value (range 0.0 to 1.0), the plugin treats it as silence.

min_speech_framesinteger

How many consecutive speech frames the plugin requires before opening a new speech segment.

first_turn_min_speech_framesinteger

How many speech frames are needed to recognize the first user turn in a session.

negative_frames_countinteger

How many silence frames within the window close out an in-progress speech segment.

negative_frames_windowinteger

Window size, in frames, over which silence frames are counted toward end-of-speech.

start_speech_volume_thresholdfloat

Audio volume floor, in dB. Frames quieter than this are ignored for speech detection.

interrupt_min_speech_framesinteger

How many speech frames are required before incoming audio is treated as a barge-in.

pre_speech_pad_framesinteger

Audio frames included ahead of the detected speech start so the beginning of an utterance is not cut off.

num_initial_ignored_framesinteger

Audio frames discarded at the very start of the WebSocket stream.

Troubleshooting

The following sections include common issues and their solutions.

Unsupported language or model combination

If the plugin rejects your configuration, check that the selected language, model, and mode are compatible. mode selection is supported only with saaras:v3.

No or delayed transcripts

Check the audio path first:

  • Confirm that the LiveKit participant is publishing audio.
  • Confirm that the agent session is using Sarvam as the configured stt provider.
  • Use sample_rate=16000 unless your audio pipeline requires another value.
  • Try disabling custom VAD options and retest with the defaults.

Short utterances are missed

For short commands, names, or interruptions, test high_vad_sensitivity=True in Python. If you are using fine-grained VAD options, tune one value at a time and validate with representative audio.

Transcripts are in the wrong language or script

Set the language explicitly instead of relying on defaults. If your use case involves translation, transliteration, or code-mixed output, use saaras:v3 and set the corresponding mode.

Additional resources

The following resources provide more information about using Sarvam with LiveKit Agents.