Overview
Speech capabilities are a core feature of LiveKit agents, enabling them to interact with users through voice. This guide covers the various speech features and functionalities available for agents.
LiveKit Agents provide a unified interface for controlling agents using both the STT-LLM-TTS pipeline and realtime models.
To learn more and see usage examples, see the following topics:
Text-to-speech (TTS)
TTS is a synthesis process that converts text into audio, giving AI agents a "voice."
Speech-to-speech
Multimodal, realtime APIs can understand speech input and generate speech output directly.
Initiating speech
By default, the agent waits for user input before responding—the Agents framework automatically handles response generation.
In some cases, though, the agent might need to initiate the conversation. For example, it might greet the user at the start of a session or check in after a period of silence.
session.say
To have the agent speak a predefined message, use session.say()
. This triggers the configured TTS to synthesize speech and play it back to the user.
You can also optionally provide pre-synthesized audio for playback. This skips the TTS step and reduces response time.
The say
method requires a TTS plugin. If you're using a realtime model, you need to add a TTS plugin to your session or use the generate_reply()
method instead.
await session.say("Hello. How can I help you today?",allow_interruptions=False,)
Parameters
True
, allow the user to interrupt the agent while speaking. (default True
)True
, add the text to the agent's chat context after playback. (default True
)Returns
Returns a SpeechHandle
object.
Events
This method triggers a speech_created
event.
generate_reply
To make conversations more dynamic, use session.generate_reply()
to prompt the LLM to generate a response.
There are two ways to use generate_reply
:
give the agent instructions to generate a response
session.generate_reply(instructions="greet the user and ask where they are from",)provide the user's input via text
session.generate_reply(user_input="how is the weather today?",)
When using generate_reply
with instructions
, the agent uses the instructions to generate a response, which is added to the chat history. The instructions themselves are not recorded in the history.
In contrast, user_input
is directly added to the chat history.
Parameters
True
, allow the user to interrupt the agent while speaking. (default True
)Returns
Returns a SpeechHandle
object.
Events
This method triggers a speech_created
event.
Controlling agent speech
You can control agent speech using the SpeechHandle
object returned by the say()
and generate_reply()
methods, and allowing user interruptions.
SpeechHandle
The say()
and generate_reply()
methods return a SpeechHandle
object, which lets you track the state of the agent's speech. This can be useful for coordinating follow-up actions—for example, notifying the user before ending the call.
await session.say("Goodbye for now.", allow_interruptions=False)# the above is a shortcut for# handle = session.say("Goodbye for now.", allow_interruptions=False)# await handle.wait_for_playout()
You can wait for the agent to finish speaking before continuing:
handle = session.generate_reply(instructions="Tell the user we're about to run some slow operations.")# perform an operation that takes time...await handle # finally wait for the speech
The following example makes a web request for the user, and cancels the request when the user interrupts:
async with aiohttp.ClientSession() as client_session:web_request = client_session.get('https://api.example.com/data')handle = await session.generate_reply(instructions="Tell the user we're processing their request.")if handle.interrupted:# if the user interrupts, cancel the web_request tooweb_request.cancel()
SpeechHandle
has an API similar to ayncio.Future
, allowing you to add a callback:
handle = session.say("Hello world")handle.add_done_callback(lambda _: print("speech done"))
Getting the current speech
You can access the current speech handle via AgentSession.current_speech
. This is useful for checking the status of the agent's speech from anywhere in your program.
Interruptions
By default, the agent stops speaking when it detects that the user has started speaking. This behavior can be disabled by setting allow_interruptions=False
when scheduling speech.
To explicitly interrupt the agent, call the interrupt()
method on the handle or session at any time. This can be performed even when allow_interruptions
is set to False
.
handle = session.say("Hello world")handle.interrupt()# or from the sessionsession.interrupt()
Customizing pronunciation
Most TTS providers allow you to customize pronunciation of words using Speech Synthesis Markup Language (SSML) using some or all of the SSML tags in the following table.
SSML Tag | Description |
---|---|
phoneme | Used for phonetic pronunciation using a standard phonetic alphabet. These tags provide a phonetic pronunciation for the enclosed text. |
say as | Specifies how to interpret the enclosed text. For example, use character to speak each character individually, or date to specify a calendar date. |
lexicon | A custom dictionary that defines the pronunciation of certain words using phonetic notation or text-to-pronunciation mappings. |
emphasis | Speak text with an emphasis. |
break | Add a manual pause. |
prosody | Controls pitch, speaking rate, and volume of speech output. |
Adding background audio
By default, your agent's produces no audio besides its synthesized speech. To add more realism, you can publish ambient background audio such as the noise of an office or call center. Your agent can also adjust the background audio when "thinking", such as adding the sound of a keyboard.
The BackgroundAudioPlayer
class manages audio playback to a room and can play the following two types of audio:
- Ambient sound: A looping audio file that plays in the background.
- Thinking sound: An audio file that plays while the agent is thinking.
The following example demonstrates simple usage with built-in audio clips.
from livekit.agents import BackgroundAudioPlayer, AudioConfig, BuiltinAudioClipasync def entrypoint(ctx: agents.JobContext):await ctx.connect()session = AgentSession(# ... stt, llm, tts, vad, turn_detection, etc.)await session.start(room=ctx.room,# ... agent, etc.)background_audio = BackgroundAudioPlayer(# play office ambience sound looping in the backgroundambient_sound=AudioConfig(BuiltinAudioClip.OFFICE_AMBIENCE, volume=0.8),# play keyboard typing sound when the agent is thinkingthinking_sound=[AudioConfig(BuiltinAudioClip.KEYBOARD_TYPING, volume=0.8),AudioConfig(BuiltinAudioClip.KEYBOARD_TYPING2, volume=0.7),],)await background_audio.start(room=ctx.room, agent_session=session)
See the following example for more details:
Background Audio
Reference
See the following sections for more details on using the BackgroundAudioPlayer
class.
BackgroundAudioPlayer
The BackgroundAudioPlayer
class manages audio playback to a room and has the following parameters:
The audio source or list of sources for the ambient sound. Ambient sound plays on a loop in the background.
The audio source or list of sources for the thinking sound. Thinking sound plays while the agent is thinking.
To start background audio, call the start method. You can also play arbitrary audio files at any time by calling the play method of an instance of the BackgroundAudioPlayer
class.
AudioConfig
The AudioConfig
class allows you to control the volume and the probability of playback. The probability value determines the chance that a particular sound is selected for playback. If the sum of all probability values is less than 1, there is a chance that sometimes there might only be silence. This can be useful in creating a more natural-sounding background audio effect.
AudioConfig
has the following properties:
The audio source to play. It can be a path to a file, an async iterator of audio frames, or a built-in audio clip.
The volume at which to play the audio source.
The probability of playback. If the sum of probability
values for all audio sources is less than 1, there is a chance that an audio source won't be selected and there'll only silence.
AudioSource
The AudioSource
can be one of the following types:
String
: Path to an audio file.AsyncIterator[rtc.AudioFrame]
: An async iterator of audio frames.BuiltInAudioClip
: A built-in audio clip.
BuiltinAudioClip
The BuiltinAudioClip
enum provides a list of pre-defined audio clips that you can use with the background audio player:
OFFICE_AMBIENCE
: Office ambience sound.KEYBOARD_TYPING
: Keyboard typing sound.KEYBOARD_TYPING2
: Keyboard typing sound. This is a shorter clip of theKEYBOARD_TYPING
sound.
Start the background audio player
The start
method takes the following parameters. If you include an ambient sound in the BackgroundAudioPlayer parameters, it starts playing immediately. If you include a thinking sound, it only plays while the agent is "thinking."
room
: The room to publish the audio to.agent_session
: The agent session to publish the audio to.
Play audio files
You can play arbitrary audio files at any time by calling the play
method of an instance of the BackgroundAudioPlayer
class. The play
method takes the following parameters:
The audio source or list of sources to play. To learn more, see AudioSource.
Set to True
to loop the audio source.
For example, if you created background_audio
in the previous example, you can play an audio file like this:
MY_AUDIO_FILE = "<PATH_TO_AUDIO_FILE>"background_audio.play(MY_AUDIO_FILE)
Additional resources
To learn more, see the following resources.
Voice AI quickstart
Use the quickstart as a starting base for adding audio code.
Speech related event
Learn more about the speech_created
event, triggered when new agent speech is created.
LiveKit SDK
Learn how to use the LiveKit SDK to play audio tracks.
Background audio example
An example of using the BackgroundAudioPlayer
class to play ambient office noise and thinking sounds.
Text-to-speech (TTS)
TTS usage and examples for pipeline agents.
Speech-to-speech
Multimodal, realtime APIs understand speech input and generate speech output directly.