Overview
Text-to-speech (TTS) models produce realtime synthetic speech from text input. In voice AI, this allows a text-based LLM to speak its response to the user.
Available providers
The agents framework includes plugins for the following TTS providers out-of-the-box. Choose a provider from the list for a step-by-step guide. You can also implement the TTS node to provide custom behavior or an alternative provider.
All TTS providers support high quality, low-latency, and lifelike multilingual voice synthesis. Support for other features is noted in the following table.
Provider | Prompt | Custom Voices | Pronunciation | Aligned Transcripts | Available in |
---|---|---|---|---|---|
— | — | ✓ SSML | — | Python | |
— | — | ✓ SSML | — | Python | |
✓ | — | — | — | Python | |
— | — | — | — | Python | |
— | ✓ | ✓ SSML | ✓ | Python | |
— | — | — | — | Python | |
— | ✓ | ✓ SSML | ✓ | Python, Node.js | |
— | — | ✓ SSML | — | Python, Node.js | |
— | — | ✓ SSML | — | Python | |
— | — | — | — | Python | |
✓ | ✓ | — | — | Python | |
— | ✓ | — | — | Python | |
— | ✓ | — | — | Python | |
— | ✓ | — | — | Python | |
✓ | — | — | — | Python, Node.js | |
— | ✓ | — | — | Python | |
— | ✓ | ✓ SSML | — | Python | |
— | — | ✓ Custom | — | Python | |
— | — | — | — | Python | |
— | ✓ | ✓ SSML | — | Python | |
— | — | — | — | Python |
Have another provider in mind? LiveKit is open source and welcomes new plugin contributions.
How to use
The following sections describe high-level usage only.
For more detailed information about installing and using plugins, see the plugins overview.
Usage in AgentSession
Construct an AgentSession
or Agent
with a TTS
instance created by your desired plugin:
from livekit.agents import AgentSessionfrom livekit.plugins import cartesiasession = AgentSession(tts=cartesia.TTS(model="sonic-english"))
AgentSession
automatically sends LLM responses to the TTS model, and also supports a say
method for one-off responses.
Standalone usage
You can also use a TTS
instance in a standalone fashion by creating a stream. You can use push_text
to add text to the stream, and then consume a stream of SynthesizedAudio
as to publish as realtime audio to another participant.
Here is an example of a standalone TTS app:
from livekit import agents, rtcfrom livekit.agents.tts import SynthesizedAudiofrom livekit.plugins import cartesiafrom typing import AsyncIterableasync def entrypoint(ctx: agents.JobContext):text_stream: AsyncIterable[str] = ... # you need to provide a stream of textaudio_source = rtc.AudioSource(44100, 1)track = rtc.LocalAudioTrack.create_audio_track("agent-audio", audio_source)await ctx.room.local_participant.publish_track(track)tts = cartesia.TTS(model="sonic-english")tts_stream = tts.stream()# create a task to consume and publish audio framesctx.create_task(send_audio(tts_stream))# push text into the stream, TTS stream will emit audio frames along with events# indicating sentence (or segment) boundaries.async for text in text_stream:tts_stream.push_text(text)tts_stream.end_input()async def send_audio(audio_stream: AsyncIterable[SynthesizedAudio]):async for a in audio_stream:await audio_source.capture_frame(a.audio.frame)if __name__ == "__main__":agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))