Skip to main content

Sarvam TTS plugin guide

How to use the Sarvam TTS plugin for LiveKit Agents.

Available inPython
|
Node.js

Overview

Use the Sarvam TTS plugin to synthesize Indian-language and English speech in LiveKit Agents. It provides natural Indic voices, low-latency turn-taking, configurable speaking style, and production audio formats for browser, mobile, and telephony use cases.

For new voice agents, start with bulbul:v3, set target_language_code explicitly, and choose a speaker that is compatible with the selected model.

Authentication

The Sarvam plugin requires a Sarvam API key .

Set SARVAM_API_KEY in your .env file:

SARVAM_API_KEY=<your-sarvam-api-key>

Installation

Install the plugin:

uv add "livekit-agents[sarvam]~=1.5"
pnpm add @livekit/agents-plugin-sarvam@1.x

Usage

Use Sarvam TTS within an AgentSession or as a standalone speech generator. For example, you can use this TTS in the Voice AI quickstart.

For most LiveKit voice agents, begin with the following settings. Explicit configuration makes voice quality, latency, and deployment behavior easier to reproduce across environments.

  • target_language_code / targetLanguageCode: Set the language your agent should speak, for example hi-IN or en-IN.
  • model: Use bulbul:v3.
  • speaker: Use a speaker supported by the selected model. The default is shubh for bulbul:v3.
  • speech_sample_rate / sampleRate: Use 22050 for general voice agent audio; use 8000 only when your downstream path requires narrowband telephony audio.
  • pace: Start at 1.0, then tune after listening to full agent turns.
from livekit.agents import AgentSession
from livekit.plugins import sarvam
session = AgentSession(
tts=sarvam.TTS(
target_language_code="hi-IN",
model="bulbul:v3",
speaker="shubh",
speech_sample_rate=22050,
pace=1.0,
output_audio_bitrate="128k",
output_audio_codec="mp3",
min_buffer_size=50,
max_chunk_length=150,
send_completion_event=True,
),
# ... llm, stt, etc.
)
import { voice } from '@livekit/agents';
import * as sarvam from '@livekit/agents-plugin-sarvam';
const session = new voice.AgentSession({
tts: new sarvam.TTS({
targetLanguageCode: "hi-IN",
model: "bulbul:v3",
speaker: "shubh",
pace: 1.0,
temperature: 0.6,
}),
// ... llm, stt, etc.
});

Parameters

This section describes commonly used parameters. See the plugin reference links in the Additional resources section for a complete list of all available parameters.

target_language_code
Required
LanguageCode

The language for synthesized speech. In Node.js, this parameter is called targetLanguageCode.

Set this explicitly instead of relying on defaults. The text you send to TTS should match the selected target language and script for the most predictable output.

See Sarvam's target-language documentation  for the list of supported languages.

modelstringDefault: bulbul:v3

The Sarvam TTS model to use. Valid values are:

  • bulbul:v3
  • bulbul:v2

Use bulbul:v3 for new voice agent builds unless you need a bulbul:v2-only option such as pitch, loudness, or enable_preprocessing.

The default model for Node.js is bulbul:v2.

speakerstringDefault: varies by model

The voice to use for synthesis. Defaults depend on the selected model:

  • shubh for bulbul:v3
  • anushka for bulbul:v2

Speakers are validated for model compatibility. If synthesis fails after changing model or speaker, check that the speaker is supported by that model. See Speakers for the full list of available voices per model.

pacefloatDefault: 1.0

Speech rate multiplier. Valid range: 0.3 to 3.0.

temperaturefloatDefault: 0.6

Controls output randomness. Valid range: 0.01 to 2.0. Only sent if model is bulbul:v3 or bulbul:v3-beta; ignored for bulbul:v2.

pitchfloatDefault: 0.0

Voice pitch adjustment. Accepted range: -0.75 to 0.75. Values outside this range are silently adjusted to the nearest boundary by the Python plugin, which also logs a warning. Included in synthesis payload for bulbul:v2.

dict_idstring

Custom pronunciation dictionary ID. Only available for the bulbul:v3 model. Create and manage dictionaries using the Pronunciation Dictionary API .

In Node.js this parameter is called dictId.

loudnessfloatDefault: 1.0

Volume multiplier. Valid range: 0.5 to 2.0. Included in synthesis payload for bulbul:v2.

enable_preprocessingbooleanDefault: false

Controls whether normalization of English words and numeric entities, for example, numbers and dates, is performed.

This option is only valid if model is bulbul:v2 and is ignored for other models.

In Node.js this parameter is called enablePreprocessing.

speech_sample_rateintDefault: 22050

Output sample rate in Hz. Supported values: 8000, 16000, 22050, 24000, 32000, 44100, and 48000.

In Node.js this parameter is called sampleRate.

output_audio_bitratestringDefault: 128k
Only Available inPython

Output audio bitrate. Allowed values: 32k, 64k, 96k, 128k, 192k.

output_audio_codecstringDefault: mp3
Only Available inPython

Output audio codec. Allowed values are aac, alaw, flac, linear16, mp3, mulaw, opus, and wav. The Python plugin decodes mulaw and alaw to 16-bit PCM before emitting audio frames.

min_buffer_sizeintegerDefault: 50

Minimum character length that triggers buffer flushing for TTS model processing. Valid range: 30 to 200.

max_chunk_lengthintegerDefault: 150

Maximum length for sentence splitting. Valid range: 50 to 500.

dict_idstring

Custom pronunciation dictionary ID. Only sent when model is bulbul:v3.

enable_cached_responsesboolean

Enables Sarvam's cached responses beta option. Only sent when model is bulbul:v2.

send_completion_eventbooleanDefault: true

Controls whether the Sarvam WebSocket URL requests explicit completion events for streaming synthesis.

Speakers

Speaker availability depends on the selected model. The following lists show all speakers supported by the Python plugin. The Node.js plugin supports additional bulbul:v3 speakers not listed here. For the most up-to-date list, see How to change the speaker .

bulbul:v3

The default speaker is shubh.

Female: amelia, ishita, kavitha, kavya, neha, pooja, priya, ritu, roopa, rupali, shruti, shreya, simran, sophia, suhani, tanya.

Male: aayan, aditya, advait, amit, ashutosh, dev, kabir, manan, rahul, ratan, rohan, shubh, sumit, varun.

bulbul:v2

The default speaker is anushka.

Female: anushka, arya, manisha, vidya.

Male: abhilash, hitesh, karun.

Troubleshooting

Common issues and solutions for the Sarvam TTS plugin.

Unsupported speaker or model

If the plugin rejects your configuration, check the model and speaker combination. Speaker availability depends on the selected model, and some parameters are model-specific.

Audio starts too slowly

For streaming voice agents, review chunking and buffering first:

  • Reduce min_buffer_size gradually if the agent waits too long before speaking.
  • Reduce max_chunk_length if long LLM responses are delaying synthesis.
  • Keep punctuation in the generated text so the TTS system can split speech naturally.
  • Avoid changing several latency-related settings at once.

Speech sounds rushed, slow, or unnatural

Start with pace=1.0 and temperature=0.6, then tune one setting at a time. If the agent speaks long paragraphs, consider splitting the LLM response into shorter, conversational sentences before it reaches TTS.

Output format does not match your media path

Check speech_sample_rate, output_audio_codec, and output_audio_bitrate. Browser playback, mobile playback, and telephony paths often need different formats. For phone calls, confirm whether your provider expects 8000 Hz audio, mulaw, alaw, or linear PCM.

Pronunciations are inconsistent

For bulbul:v3, use dict_id when you need consistent pronunciations for names, brands, product terms, acronyms, or domain-specific words, provided you have an existing Sarvam TTS pronunciation dictionary.

Additional resources

The following resources provide more information about using Sarvam with LiveKit Agents.