LiveKit docs › Models › STT › Additional models › Sarvam

---

# Sarvam STT plugin guide

> How to use the Sarvam STT plugin for LiveKit Agents.

Available in:
- [x] Node.js
- [x] Python

## Overview

Use the Sarvam STT plugin to add speech recognition for Indian languages, English, and code-mixed audio to your LiveKit Agents. It fits voice agents that need broad Indic coverage with low-latency transcription, plus the option to translate, transliterate, output verbatim text, or return code-mixed transcripts.

For new voice agents, start with `saaras:v3` and set the language explicitly.

### Authentication

The Sarvam plugin requires a [Sarvam API key](https://docs.sarvam.ai/).

Set `SARVAM_API_KEY` in your `.env` file:

```shell
SARVAM_API_KEY=<your-sarvam-api-key>

```

### Installation

Install the plugin:

**Python**:

```shell
uv add "livekit-agents[sarvam]~=1.5"

```

---

**Node.js**:

```shell
pnpm add @livekit/agents-plugin-sarvam@1.x

```

### Usage

Use Sarvam STT in an `AgentSession` or as a standalone transcription service. For example, you can use this STT in the [Voice AI quickstart](https://docs.livekit.io/agents/start/voice-ai.md).

For most LiveKit voice agents, start with the following settings. Explicit configuration keeps examples, debugging, and production rollouts predictable.

- `language`: Set the expected input language, for example `en-IN` or `hi-IN`.
- `model`: Use `saaras:v3` for the latest Sarvam STT model with broader language support and mode control.
- `mode`: Use `transcribe` unless you specifically need translation, transliteration, verbatim output, or code-mixed output.
- `sample_rate`: Use `16000` for Python streaming sessions unless your audio pipeline requires a different rate.

**Python**:

```python
from livekit.agents import AgentSession
from livekit.plugins import sarvam

session = AgentSession(
   stt=sarvam.STT(
      language="en-IN",
      model="saaras:v3",
      mode="transcribe",  # default
      sample_rate=16000,
      high_vad_sensitivity=True,
      flush_signal=True,
   ),
   # ... llm, tts, etc.
)

```

---

**Node.js**:

```typescript
import { voice } from '@livekit/agents';
import * as sarvam from '@livekit/agents-plugin-sarvam';

const session = new voice.AgentSession({
    stt: new sarvam.STT({
        languageCode: "en-IN",
        model: "saaras:v3",
        mode: "transcribe",  // default
    }),
    // ... llm, tts, etc.
});

```

### Parameters

This section describes commonly used parameters. See the plugin reference links in the [Additional resources](#additional-resources) section for a complete list of all available parameters.

- **`language`** _(LanguageCode)_ (optional) - Default: `en-IN`: [Language code](https://docs.livekit.io/agents/models/stt.md#language-codes) for the input audio. Language support varies by model:

- `saaras:v3` supports the full set of plugin-supported languages: `as-IN`, `bn-IN`, `brx-IN`, `doi-IN`, `en-IN`, `gu-IN`, `hi-IN`, `kn-IN`, `kok-IN`, `ks-IN`, `mai-IN`, `ml-IN`, `mni-IN`, `mr-IN`, `ne-IN`, `od-IN`, `pa-IN`, `sa-IN`, `sat-IN`, `sd-IN`, `ta-IN`, `te-IN`, `unknown`, and `ur-IN`.
- `saarika:v2.5` and `saaras:v2.5` support `bn-IN`, `en-IN`, `gu-IN`, `hi-IN`, `kn-IN`, `ml-IN`, `mr-IN`, `od-IN`, `pa-IN`, `ta-IN`, `te-IN`, and `unknown`.
See [Sarvam's language-code documentation](https://docs.sarvam.ai/api-reference-docs/speech-to-text/transcribe#request.body.language_code.language_code) for the list of supported languages.

In Node.js this parameter is called `languageCode`.

- **`model`** _(string)_ (optional) - Default: `saarika:v2.5`: The Sarvam STT model to use. Valid values are:

- `saarika:v2.5`
- `saaras:v2.5`
- `saaras:v3`
`saaras:v3` is the latest model and the recommended default for new voice agents because it supports advanced mode control and broader language coverage.

The Python plugin automatically selects Sarvam's translate endpoint for `saaras:v2.5`; other models use the standard speech-to-text endpoint.

- **`mode`** _(string)_ (optional) - Default: `transcribe`: The transcription mode for `saaras:v3`. Valid values are:

- `transcribe`: Return a standard transcription in the source language.
- `translate`: Translate the spoken input.
- `verbatim`: Preserve more of the speaker's exact wording.
- `translit`: Return transliterated output.
- `codemix`: Optimize for code-mixed speech.
Only `saaras:v3` supports mode selection.

- **`sample_rate`** _(integer)_ (optional) - Default: `16000`: Available in:
- [ ] Node.js
- [x] Python

Input audio sample rate used for streaming sessions. Must be greater than `0`.

- **`high_vad_sensitivity`** _(boolean)_ (optional): Available in:
- [ ] Node.js
- [x] Python

Enables Sarvam's high VAD sensitivity option for streaming transcription. Set to `True` if your agent needs to detect softer or shorter utterances.

- **`flush_signal`** _(boolean)_ (optional): Available in:
- [ ] Node.js
- [x] Python

Sends Sarvam's `flush_signal` streaming option when set.

- **`input_audio_codec`** _(string)_ (optional): Available in:
- [ ] Node.js
- [x] Python

Input audio encoding for streaming sessions. When set, it's included in the WebSocket URL and used as the audio message encoding. If omitted, the Python plugin uses `audio/wav` for streaming audio messages.

#### Fine-grained VAD options

Available in:
- [ ] Node.js
- [x] Python

The following fine-grained VAD parameters are sent to Sarvam only when `model` is `saaras:v3`. If unset, Sarvam applies its own defaults.

Tune these only after validating the default behavior with your target microphone, room, telephony, or browser audio path. Changing several VAD values at once can make it harder to understand why an agent starts listening too early, misses short utterances, or waits too long before finalizing a turn.

- **`positive_speech_threshold`** _(float)_ (optional): If a frame's speech probability is above this value (range `0.0` to `1.0`), the plugin treats it as speech.

- **`negative_speech_threshold`** _(float)_ (optional): If a frame's speech probability falls below this value (range `0.0` to `1.0`), the plugin treats it as silence.

- **`min_speech_frames`** _(integer)_ (optional): How many consecutive speech frames the plugin requires before opening a new speech segment.

- **`first_turn_min_speech_frames`** _(integer)_ (optional): How many speech frames are needed to recognize the first user turn in a session.

- **`negative_frames_count`** _(integer)_ (optional): How many silence frames within the window close out an in-progress speech segment.

- **`negative_frames_window`** _(integer)_ (optional): Window size, in frames, over which silence frames are counted toward end-of-speech.

- **`start_speech_volume_threshold`** _(float)_ (optional): Audio volume floor, in dB. Frames quieter than this are ignored for speech detection.

- **`interrupt_min_speech_frames`** _(integer)_ (optional): How many speech frames are required before incoming audio is treated as a barge-in.

- **`pre_speech_pad_frames`** _(integer)_ (optional): Audio frames included ahead of the detected speech start so the beginning of an utterance is not cut off.

- **`num_initial_ignored_frames`** _(integer)_ (optional): Audio frames discarded at the very start of the WebSocket stream.

### Troubleshooting

The following sections include common issues and their solutions.

#### Unsupported language or model combination

If the plugin rejects your configuration, check that the selected `language`, `model`, and `mode` are compatible. `mode` selection is supported only with `saaras:v3`.

#### No or delayed transcripts

Check the audio path first:

- Confirm that the LiveKit participant is publishing audio.
- Confirm that the agent session is using Sarvam as the configured `stt` provider.
- Use `sample_rate=16000` unless your audio pipeline requires another value.
- Try disabling custom VAD options and retest with the defaults.

#### Short utterances are missed

For short commands, names, or interruptions, test `high_vad_sensitivity=True` in Python. If you are using fine-grained VAD options, tune one value at a time and validate with representative audio.

#### Transcripts are in the wrong language or script

Set the language explicitly instead of relying on defaults. If your use case involves translation, transliteration, or code-mixed output, use `saaras:v3` and set the corresponding `mode`.

## Additional resources

The following resources provide more information about using Sarvam with LiveKit Agents.

- **[Sarvam docs](https://docs.sarvam.ai/)**: Sarvam's full docs site.

- **[Sarvam STT API reference](https://docs.sarvam.ai/api-reference-docs/speech-to-text/transcribe)**: Sarvam's speech-to-text API documentation.

- **[Voice AI quickstart](https://docs.livekit.io/agents/start/voice-ai.md)**: Get started with LiveKit Agents and Sarvam.

- **[Sarvam TTS](https://docs.livekit.io/agents/models/tts/sarvam.md)**: Guide to the Sarvam TTS plugin with LiveKit Agents.

---

This document was rendered at 2026-06-07T11:35:51.214Z.
For the latest version of this document, see [https://docs.livekit.io/agents/models/stt/sarvam.md](https://docs.livekit.io/agents/models/stt/sarvam.md).

To explore all LiveKit documentation, see [llms.txt](https://docs.livekit.io/llms.txt).