Agent speech

Explore the speech capabilities and features of LiveKit Agents.

Overview

Speech capabilities are a core feature of AI agents, enabling them to interact with users through voice. This guide covers the various speech features and functionalities available for agents.

Text to speech (TTS) is a synthesis process that converts written text into spoken audio. TTS models produce realistic and expressive speech, giving AI agents a "voice." Some providers offer custom voice generation so you can give your own voice to your agent.

Differences between TTS and realtime APIs

You can use the TTS class within the TTS-LLM-STT pipeline of an AgentSession to generate audio speech for responses generated by the LLM. It also functions as a standalone speech generator.

Multimodal, realtime APIs such as the OpenAI Realtime API and Google Multimodal Live API generate speech using their own TTS for text-to-speech capabilities. While a provider-specific TTS instance allows customization of the model, voice, and other speech output parameters, creating an agent with a realtime API only requires specifying the voice for your agent.

For more information on the OpenAI Realtime API, see the OpenAI integration guide. For details on the Google Multimodal Live API, see the Google integration guide.

Usage examples

To use the TTS provider plugins, you need to install the livekit-agents package, as well as the specific provider plugin.

Creating a TTS

Create a TTS instance using the specific provider plugin. This example uses ElevenLabs for TTS:

  1. Install the provider plugin:

    pip install livekit-agents[elevenlabs]~=1.0rc
  2. Create a TTS instance:

    from livekit.plugins.elevenlabs import tts
    eleven_tts=elevenlabs.tts.TTS(
    model="eleven_turbo_v2_5",
    voice=elevenlabs.tts.Voice(
    id="EXAVITQu4vr4xnSDxMaL",
    name="Bella",
    ),
    language="en",
    streaming_latency=3,
    enable_ssml_parsing=False,
    chunk_length_schedule=[80, 120, 200, 260],
    )

For a more complete example, see the ElevenLabs TTS guide.

RealtimeModel usage

This example creates an agent session using the OpenAI Realtime API:

from livekit.plugins import openai
from livekit.agents.voice import AgentSession
session = AgentSession(
llm=openai.realtime.RealtimeModel(
voice="alloy",
),
)

Synchronized transcription forwarding

You can forward agent speech transcriptions to your frontend client to display text synchronized with agent speech. Synchronized transcriptions display text word by word as the agent speaks. When you interrupt the agent, the transcription stops and truncates to match the speech.

Enable transcriptions in RoomOutputOptions by setting the transcription_enabled option to True. Pass RoomOutputOptions as a parameter to your agent start() method. To learn more and see example code, see Text and transcriptions.

Controlling agent speech

In some cases, you want to manually control when an agent speaks. The agent methods in this section allow you to manually stop or generate speech.

Instruct an agent to speak

Use the say() method to have an agent speak using a specified source. The source can be the text you want the agent to speak. For example, say "Hello. How can I help you today?" when a new participant joins a room:

Realtime models and TTS

The say method requires a TTS plugin. If you're using a realtime model, you need to add a TTS plugin to your session or use the generate_reply() method instead.

await session.say("Hello. How can I help you today?", allow_interruptions=True)

Parameters

textstringOptional
The text to speak.
audiortc.AudioFrameOptional
Audio data to play instead of the text.
allow_interruptionsbooleanOptional
If True, allow the user to interrupt the agent while speaking.
add_to_chat_ctxbooleanOptional
If True, add the text to the agent's chat context.

Returns

Returns a SpeechHandle object.

Manually interrupt and generate responses

If you're using manual VAD, for example, a push-to-talk interface, you can manually interrupt the agent and generate a response.

Use the interrupt() method to manually interrupt speech within a session:

session.interrupt()

To learn more about manual interruptions, see Turn detection and interruptions.

Use the generate_reply() method to force the LLM to generate a new conversation turn. You can optionally include user input or specific instructions for this particular reply.

session.generate_reply(user_input="Can you help me build an agent using LiveKit?")

Parameters

user_inputstringRequired
The user input to insert before the reply is generated.
instructionsstringOptional
Instructions for the agent to use for the reply.
allow_interruptionsbooleanOptional
If True, allow the user to interrupt the agent while speaking.

Returns

Returns a SpeechHandle object.

SpeechHandle

The say() and generate_reply() methods return a SpeechHandle object that you can await. This allows you to control the conversational flow.

The following examples await the say() and generate_reply() methods:

handle = await session.say("Hello ...")
if handle.interrupted:
# The speech was interrupted, the user didn't listen to everything
handle = session.generate_reply(instructions="Tell the user we're about to run some slow operations.")
# Do some operations here
await asyncio.sleep(4) # slow network call
await handle # finally wait for the speech

The following example makes a web request for the user, and cancels the request when the user interrupts:

async with aiohttp.ClientSession() as client_session:
web_request = client_session.get('https://api.example.com/data')
handle = await session.generate_reply(instructions="Tell the user we're processing their request.")
if handle.interrupted:
# if the user interrupts, cancel the web_request too
web_request.cancel()

SpeechHandle has an API similar to ayncio.Future, allowing you to add a callback:

handle = session.say("Hello world")
handle.add_done_callback(lambda _: print("speech done"))

Manually interrupt speech. This example assumes handle.allow_interruptions is True:

handle = session.say("Hello world")
handle.interrupt()

You can also interrupt agent speech in a session. To learn more, see Manually interrupt and generate responses.

Customizing pronunciation

Most TTS providers allow you to customize pronunciation of words using Speech Synthesis Markup Language (SSML) using some or all of the SSML tags in the following table.

SSML TagDescription
phonemeUsed for phonetic pronunciation using a standard phonetic alphabet. These tags provide a phonetic pronunciation for the enclosed text.
say asSpecifies how to interpret the enclosed text. For example, use character to speak each character individually, or date to specify a calendar date.
lexiconA custom dictionary that defines the pronunciation of certain words using phonetic notation or text-to-pronunciation mappings.
emphasisSpeak text with an emphasis.
breakAdd a manual pause.
prosodyControls pitch, speaking rate, and volume of speech output.

To learn more about support for custom pronunciation, see the documentation for each individual TTS provider.

TTS provider plugins

LiveKit supports the following TTS provider plugins:

If you want to use a provider not listed in the table, contributions for plugins are always welcome. To learn more, see the guidelines for contributions to the Python repository or the Node.js repository.