Text-to-speech (TTS) integrations

Guides for adding TTS integrations to your agents.

Text-to-speech (TTS) models produce realtime synthetic speech from text input. In voice AI, this allows a text-based LLM to speak its response to the user.

The agents framework includes plugins for popular TTS providers out of the box. You can also implement the TTS node to provide custom behavior or use an alternative provider.

LiveKit is open source and welcomes new plugin contributions.

How to use

The following sections describe high-level usage only.

For more detailed information about installing and using plugins, see the plugins overview.

Usage in AgentSession

Construct an AgentSession or Agent with a TTS instance created by your desired plugin:

from livekit.agents import AgentSession
from livekit.plugins import cartesia
session = AgentSession(
tts=cartesia.TTS(model="sonic-english")
)

AgentSession automatically sends LLM responses to the TTS model, and also supports a say method for one-off responses.

Standalone usage

You can also use a TTS instance in a standalone fashion by creating a stream. You can use push_text to add text to the stream, and then consume a stream of SynthesizedAudio as to publish as realtime audio to another participant.

Here is an example of a standalone TTS app:

from livekit import agents, rtc
from livekit.agents.tts import SynthesizedAudio
from livekit.plugins import cartesia
from typing import AsyncIterable
async def entrypoint(ctx: agents.JobContext):
await ctx.connect()
text_stream: AsyncIterable[str] = ... # you need to provide a stream of text
audio_source = rtc.AudioSource(44100, 1)
track = rtc.LocalAudioTrack.create_audio_track("agent-audio", audio_source)
await ctx.room.local_participant.publish_track(track)
tts = cartesia.TTS(model="sonic-english")
tts_stream = tts.stream()
# create a task to consume and publish audio frames
ctx.create_task(send_audio(tts_stream))
# push text into the stream, TTS stream will emit audio frames along with events
# indicating sentence (or segment) boundaries.
async for text in text_stream:
tts_stream.push_text(text)
tts_stream.end_input()
async def send_audio(audio_stream: AsyncIterable[SynthesizedAudio]):
async for a in audio_stream:
await audio_source.capture_frame(e.audio.frame)
if __name__ == "__main__":
agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))

Available providers

The following table lists the available TTS providers for LiveKit Agents.

Further reading

Was this page helpful?