Overview
Use the Sarvam TTS plugin to synthesize Indian-language and English speech in LiveKit Agents. It provides natural Indic voices, low-latency turn-taking, configurable speaking style, and production audio formats for browser, mobile, and telephony use cases.
For new voice agents, start with bulbul:v3, set target_language_code explicitly, and choose a speaker that is compatible with the selected model.
Authentication
The Sarvam plugin requires a Sarvam API key .
Set SARVAM_API_KEY in your .env file:
SARVAM_API_KEY=<your-sarvam-api-key>
Installation
Install the plugin:
uv add "livekit-agents[sarvam]~=1.5"
pnpm add @livekit/agents-plugin-sarvam@1.x
Usage
Use Sarvam TTS within an AgentSession or as a standalone speech generator. For example, you can use this TTS in the Voice AI quickstart.
For most LiveKit voice agents, begin with the following settings. Explicit configuration makes voice quality, latency, and deployment behavior easier to reproduce across environments.
target_language_code/targetLanguageCode: Set the language your agent should speak, for examplehi-INoren-IN.model: Usebulbul:v3.speaker: Use a speaker supported by the selected model. The default isshubhforbulbul:v3.speech_sample_rate/sampleRate: Use22050for general voice agent audio; use8000only when your downstream path requires narrowband telephony audio.pace: Start at1.0, then tune after listening to full agent turns.
from livekit.agents import AgentSessionfrom livekit.plugins import sarvamsession = AgentSession(tts=sarvam.TTS(target_language_code="hi-IN",model="bulbul:v3",speaker="shubh",speech_sample_rate=22050,pace=1.0,output_audio_bitrate="128k",output_audio_codec="mp3",min_buffer_size=50,max_chunk_length=150,send_completion_event=True,),# ... llm, stt, etc.)
import { voice } from '@livekit/agents';import * as sarvam from '@livekit/agents-plugin-sarvam';const session = new voice.AgentSession({tts: new sarvam.TTS({targetLanguageCode: "hi-IN",model: "bulbul:v3",speaker: "shubh",pace: 1.0,temperature: 0.6,}),// ... llm, stt, etc.});
Parameters
This section describes commonly used parameters. See the plugin reference links in the Additional resources section for a complete list of all available parameters.
target_language_codeLanguageCodeThe language for synthesized speech. In Node.js, this parameter is called targetLanguageCode.
Set this explicitly instead of relying on defaults. The text you send to TTS should match the selected target language and script for the most predictable output.
See Sarvam's target-language documentation for the list of supported languages.
modelstringDefault: bulbul:v3The Sarvam TTS model to use. Valid values are:
bulbul:v3bulbul:v2
Use bulbul:v3 for new voice agent builds unless you need a bulbul:v2-only option such as pitch, loudness, or enable_preprocessing.
The default model for Node.js is bulbul:v2.
speakerstringDefault: varies by modelThe voice to use for synthesis. Defaults depend on the selected model:
shubhforbulbul:v3anushkaforbulbul:v2
Speakers are validated for model compatibility. If synthesis fails after changing model or speaker, check that the speaker is supported by that model. See Speakers for the full list of available voices per model.
pacefloatDefault: 1.0Speech rate multiplier. Valid range: 0.3 to 3.0.
temperaturefloatDefault: 0.6Controls output randomness. Valid range: 0.01 to 2.0. Only sent if model is bulbul:v3 or bulbul:v3-beta; ignored for bulbul:v2.
pitchfloatDefault: 0.0Voice pitch adjustment. Accepted range: -0.75 to 0.75. Values outside this range are silently adjusted to the nearest boundary by the Python plugin, which also logs a warning. Included in synthesis payload for bulbul:v2.
dict_idstringCustom pronunciation dictionary ID. Only available for the bulbul:v3 model. Create and manage dictionaries using the Pronunciation Dictionary API .
In Node.js this parameter is called dictId.
loudnessfloatDefault: 1.0Volume multiplier. Valid range: 0.5 to 2.0. Included in synthesis payload for bulbul:v2.
enable_preprocessingbooleanDefault: falseControls whether normalization of English words and numeric entities, for example, numbers and dates, is performed.
This option is only valid if model is bulbul:v2 and is ignored for other models.
In Node.js this parameter is called enablePreprocessing.
speech_sample_rateintDefault: 22050Output sample rate in Hz. Supported values: 8000, 16000, 22050, 24000, 32000, 44100, and 48000.
In Node.js this parameter is called sampleRate.
output_audio_bitratestringDefault: 128kOutput audio bitrate. Allowed values: 32k, 64k, 96k, 128k, 192k.
output_audio_codecstringDefault: mp3Output audio codec. Allowed values are aac, alaw, flac, linear16, mp3, mulaw, opus, and wav. The Python plugin decodes mulaw and alaw to 16-bit PCM before emitting audio frames.
min_buffer_sizeintegerDefault: 50Minimum character length that triggers buffer flushing for TTS model processing. Valid range: 30 to 200.
max_chunk_lengthintegerDefault: 150Maximum length for sentence splitting. Valid range: 50 to 500.
dict_idstringCustom pronunciation dictionary ID. Only sent when model is bulbul:v3.
enable_cached_responsesbooleanEnables Sarvam's cached responses beta option. Only sent when model is bulbul:v2.
send_completion_eventbooleanDefault: trueControls whether the Sarvam WebSocket URL requests explicit completion events for streaming synthesis.
Speakers
Speaker availability depends on the selected model. The following lists show all speakers supported by the Python plugin. The Node.js plugin supports additional bulbul:v3 speakers not listed here. For the most up-to-date list, see How to change the speaker .
bulbul:v3
The default speaker is shubh.
Female: amelia, ishita, kavitha, kavya, neha, pooja, priya, ritu, roopa, rupali, shruti, shreya, simran, sophia, suhani, tanya.
Male: aayan, aditya, advait, amit, ashutosh, dev, kabir, manan, rahul, ratan, rohan, shubh, sumit, varun.
bulbul:v2
The default speaker is anushka.
Female: anushka, arya, manisha, vidya.
Male: abhilash, hitesh, karun.
Troubleshooting
Common issues and solutions for the Sarvam TTS plugin.
Unsupported speaker or model
If the plugin rejects your configuration, check the model and speaker combination. Speaker availability depends on the selected model, and some parameters are model-specific.
Audio starts too slowly
For streaming voice agents, review chunking and buffering first:
- Reduce
min_buffer_sizegradually if the agent waits too long before speaking. - Reduce
max_chunk_lengthif long LLM responses are delaying synthesis. - Keep punctuation in the generated text so the TTS system can split speech naturally.
- Avoid changing several latency-related settings at once.
Speech sounds rushed, slow, or unnatural
Start with pace=1.0 and temperature=0.6, then tune one setting at a time. If the agent speaks long paragraphs, consider splitting the LLM response into shorter, conversational sentences before it reaches TTS.
Output format does not match your media path
Check speech_sample_rate, output_audio_codec, and output_audio_bitrate. Browser playback, mobile playback, and telephony paths often need different formats. For phone calls, confirm whether your provider expects 8000 Hz audio, mulaw, alaw, or linear PCM.
Pronunciations are inconsistent
For bulbul:v3, use dict_id when you need consistent pronunciations for names, brands, product terms, acronyms, or domain-specific words, provided you have an existing Sarvam TTS pronunciation dictionary.
Additional resources
The following resources provide more information about using Sarvam with LiveKit Agents.