Overview
Speech capabilities are a core feature of AI agents, enabling them to interact with users through voice. This guide covers the various speech features and functionalities available for agents.
Text to speech (TTS) is a synthesis process that converts written text into spoken audio. TTS models produce realistic and expressive speech, giving AI agents a "voice." Some providers offer custom voice generation so you can give your own voice to your agent.
Differences between TTS and realtime APIs
You can use the TTS class within the TTS-LLM-STT pipeline of an AgentSession
to generate audio speech for responses generated by the LLM. It also functions as a standalone speech generator.
Multimodal, realtime APIs such as the OpenAI Realtime API and Google Multimodal Live API generate speech using their own TTS for text-to-speech capabilities. While a provider-specific TTS instance allows customization of the model, voice, and other speech output parameters, creating an agent with a realtime API only requires specifying the voice for your agent.
For more information on the OpenAI Realtime API, see the OpenAI integration guide. For details on the Google Multimodal Live API, see the Google integration guide.
Usage examples
To use the TTS provider plugins, you need to install the livekit-agents
package, as well as the specific provider plugin.
Creating a TTS
Create a TTS instance using the specific provider plugin. This example uses ElevenLabs for TTS:
Install the provider plugin:
pip install livekit-agents[elevenlabs]~=1.0rcCreate a TTS instance:
from livekit.plugins.elevenlabs import ttseleven_tts=elevenlabs.tts.TTS(model="eleven_turbo_v2_5",voice=elevenlabs.tts.Voice(id="EXAVITQu4vr4xnSDxMaL",name="Bella",),language="en",streaming_latency=3,enable_ssml_parsing=False,chunk_length_schedule=[80, 120, 200, 260],)
For a more complete example, see the ElevenLabs TTS guide.
RealtimeModel usage
This example creates an agent session using the OpenAI Realtime API:
from livekit.plugins import openaifrom livekit.agents.voice import AgentSessionsession = AgentSession(llm=openai.realtime.RealtimeModel(voice="alloy",),)
Synchronized transcription forwarding
You can forward agent speech transcriptions to your frontend client to display text synchronized with agent speech. Synchronized transcriptions display text word by word as the agent speaks. When you interrupt the agent, the transcription stops and truncates to match the speech.
Enable transcriptions in RoomOutputOptions
by setting the transcription_enabled
option to True
. Pass RoomOutputOptions
as a parameter to your agent start()
method. To learn more and see example code, see Text and transcriptions.
Controlling agent speech
In some cases, you want to manually control when an agent speaks. The agent methods in this section allow you to manually stop or generate speech.
Instruct an agent to speak
Use the say()
method to have an agent speak using a specified source. The source can be the text you want the agent to speak. For example, say "Hello. How can I help you today?" when a new participant joins a room:
The say
method requires a TTS plugin. If you're using a realtime model, you need to add a TTS plugin to your session or use the generate_reply()
method instead.
await session.say("Hello. How can I help you today?", allow_interruptions=True)
Parameters
True
, allow the user to interrupt the agent while speaking.True
, add the text to the agent's chat context.Returns
Returns a SpeechHandle
object.
Manually interrupt and generate responses
If you're using manual VAD, for example, a push-to-talk interface, you can manually interrupt the agent and generate a response.
Use the interrupt()
method to manually interrupt speech within a session:
session.interrupt()
To learn more about manual interruptions, see Turn detection and interruptions.
Use the generate_reply()
method to force the LLM to generate a new conversation turn. You can optionally include user input or specific instructions for this particular reply.
session.generate_reply(user_input="Can you help me build an agent using LiveKit?")
Parameters
True
, allow the user to interrupt the agent while speaking.Returns
Returns a SpeechHandle
object.
SpeechHandle
The say()
and generate_reply()
methods return a SpeechHandle
object that you can await
. This allows you to control the conversational flow.
The following examples await
the say()
and generate_reply()
methods:
handle = await session.say("Hello ...")if handle.interrupted:# The speech was interrupted, the user didn't listen to everything
handle = session.generate_reply(instructions="Tell the user we're about to run some slow operations.")# Do some operations hereawait asyncio.sleep(4) # slow network callawait handle # finally wait for the speech
The following example makes a web request for the user, and cancels the request when the user interrupts:
async with aiohttp.ClientSession() as client_session:web_request = client_session.get('https://api.example.com/data')handle = await session.generate_reply(instructions="Tell the user we're processing their request.")if handle.interrupted:# if the user interrupts, cancel the web_request tooweb_request.cancel()
SpeechHandle
has an API similar to ayncio.Future
, allowing you to add a callback:
handle = session.say("Hello world")handle.add_done_callback(lambda _: print("speech done"))
Manually interrupt speech. This example assumes handle.allow_interruptions
is True
:
handle = session.say("Hello world")handle.interrupt()
You can also interrupt agent speech in a session. To learn more, see Manually interrupt and generate responses.
Customizing pronunciation
Most TTS providers allow you to customize pronunciation of words using Speech Synthesis Markup Language (SSML) using some or all of the SSML tags in the following table.
SSML Tag | Description |
---|---|
phoneme | Used for phonetic pronunciation using a standard phonetic alphabet. These tags provide a phonetic pronunciation for the enclosed text. |
say as | Specifies how to interpret the enclosed text. For example, use character to speak each character individually, or date to specify a calendar date. |
lexicon | A custom dictionary that defines the pronunciation of certain words using phonetic notation or text-to-pronunciation mappings. |
emphasis | Speak text with an emphasis. |
break | Add a manual pause. |
prosody | Controls pitch, speaking rate, and volume of speech output. |
To learn more about support for custom pronunciation, see the documentation for each individual TTS provider.
TTS provider plugins
LiveKit supports the following TTS provider plugins:
If you want to use a provider not listed in the table, contributions for plugins are always welcome. To learn more, see the guidelines for contributions to the Python repository or the Node.js repository.