Overview
Voice agent speech is produced by a TTS model, configured with a voice profile that specifies tone, accent, and other qualitative characteristics of the speech. The TTS model runs on output from an LLM model to speak the agent response to the user.
You can choose a voice model served through LiveKit Inference or you can use a plugin to connect directly to a wider range of model providers with your own account.
LiveKit Inference
The following models are available in LiveKit Inference. Refer to the guide for each model for more details on additional configuration options. A limited selection of Suggested voices are available, as well as a wider selection through each provider's documentation.
Cartesia
Reference for Cartesia TTS in LiveKit Inference.
ElevenLabs
Reference for ElevenLabs TTS with LiveKit Inference.
Inworld
Reference for Inworld TTS in LiveKit Inference.
Rime
Reference for Rime TTS in LiveKit Inference.
Suggested voices
The following voices are good choices for overall quality and performance. Each provider has a much larger selection of voices to choose from, which you can find in their documentation. In addition to the voices below, you can choose to use other voices through LiveKit Inference.
Click the copy icon to copy the voice ID to use in your agent session.
Use the keyboard and arrows to audition voices
Use the keyboard and arrows to audition voices
Usage
To set up TTS in an AgentSession
, provide a descriptor with both the desired model and voice. LiveKit Inference manages the connection to the model automatically. Consult the Suggested voices list for suggeted voices, or view the model reference for more voices.
from livekit.agents import AgentSessionsession = AgentSession(tts="cartesia/sonic-2:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",# ... llm, stt, etc.)
import { AgentSession } from '@livekit/agents';const session = new AgentSession({tts: "cartesia/sonic-2:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",// ... llm, stt, etc.})
Additional parameters
More configuration options, such as custom pronunciation, are available for each model. To set additional parameters, use the TTS
class from the inference
module. Consult each model reference for examples and available parameters.
Plugins
The LiveKit Agents framework also includes a variety of open source plugins for a wide range of TTS providers. Plugins are especially useful if you need custom voices, including voice cloning support. These plugins require authentication with the provider yourself, usually via an API key. You are responsible for setting up your own account and managing your own billing and credentials. The plugins are listed below, along with their availability for Python or Node.js.
Provider | Python | Node.js |
---|---|---|
✓ | — | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | ✓ | |
✓ | — | |
✓ | ✓ | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | ✓ | |
✓ | ✓ | |
✓ | — | |
✓ | ✓ | |
✓ | ✓ | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | — |
Have another provider in mind? LiveKit is open source and welcomes new plugin contributions.
Advanced features
The following sections cover more advanced topics common to all TTS providers. For more detailed reference on individual provider configuration, consult the model reference or plugin documentation for that provider.
Custom TTS
To create an entirely custom TTS, implement the TTS node in your agent.
Standalone usage
You can use a TTS
instance as a standalone component by creating a stream. Use push_text
to add text to the stream, and then consume a stream of SynthesizedAudio
to publish as realtime audio to another participant.
Here is an example of a standalone TTS app:
from livekit import agents, rtcfrom livekit.agents.tts import SynthesizedAudiofrom livekit.plugins import cartesiafrom typing import AsyncIterableasync def entrypoint(ctx: agents.JobContext):text_stream: AsyncIterable[str] = ... # you need to provide a stream of textaudio_source = rtc.AudioSource(44100, 1)track = rtc.LocalAudioTrack.create_audio_track("agent-audio", audio_source)await ctx.room.local_participant.publish_track(track)tts = cartesia.TTS(model="sonic-english")tts_stream = tts.stream()# create a task to consume and publish audio framesctx.create_task(send_audio(tts_stream))# push text into the stream, TTS stream will emit audio frames along with events# indicating sentence (or segment) boundaries.async for text in text_stream:tts_stream.push_text(text)tts_stream.end_input()async def send_audio(audio_stream: AsyncIterable[SynthesizedAudio]):async for a in audio_stream:await audio_source.capture_frame(a.audio.frame)if __name__ == "__main__":agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))
Additional resources
The following resources cover related topics that may be useful for your application.