Agent speech and audio

Speech and audio capabilities for LiveKit agents.

Overview

Speech capabilities are a core feature of LiveKit agents, enabling them to interact with users through voice. This guide covers the various speech features and functionalities available for agents.

LiveKit Agents provide a unified interface for controlling agents using both the STT-LLM-TTS pipeline and realtime models.

To learn more and see usage examples, see the following topics:

Initiating speech

By default, the agent waits for user input before responding—the Agents framework automatically handles response generation.

In some cases, though, the agent might need to initiate the conversation. For example, it might greet the user at the start of a session or check in after a period of silence.

session.say

To have the agent speak a predefined message, use session.say(). This triggers the configured TTS to synthesize speech and play it back to the user.

You can also optionally provide pre-synthesized audio for playback. This skips the TTS step and reduces response time.

Realtime models and TTS

The say method requires a TTS plugin. If you're using a realtime model, you need to add a TTS plugin to your session or use the generate_reply() method instead.

await session.say(
"Hello. How can I help you today?",
allow_interruptions=False,
)

Parameters

textstr | AsyncIterable[str]Required
The text to speak.
audioAsyncIterable[rtc.AudioFrame]Optional
Pre-synthesized audio to play.
allow_interruptionsbooleanOptional
If True, allow the user to interrupt the agent while speaking. (default True)
add_to_chat_ctxbooleanOptional
If True, add the text to the agent's chat context after playback. (default True)

Returns

Returns a SpeechHandle object.

Events

This method triggers a speech_created event.

generate_reply

To make conversations more dynamic, use session.generate_reply() to prompt the LLM to generate a response.

There are two ways to use generate_reply:

  1. give the agent instructions to generate a response

    session.generate_reply(
    instructions="greet the user and ask where they are from",
    )
  2. provide the user's input via text

    session.generate_reply(
    user_input="how is the weather today?",
    )
Impact to chat history

When using generate_reply with instructions, the agent uses the instructions to generate a response, which is added to the chat history. The instructions themselves are not recorded in the history.

In contrast, user_input is directly added to the chat history.

Parameters

user_inputstringOptional
The user input to respond to.
instructionsstringOptional
Instructions for the agent to use for the reply.
allow_interruptionsbooleanOptional
If True, allow the user to interrupt the agent while speaking. (default True)

Returns

Returns a SpeechHandle object.

Events

This method triggers a speech_created event.

Controlling agent speech

You can control agent speech using the SpeechHandle object returned by the say() and generate_reply() methods, and allowing user interruptions.

SpeechHandle

The say() and generate_reply() methods return a SpeechHandle object, which lets you track the state of the agent's speech. This can be useful for coordinating follow-up actions—for example, notifying the user before ending the call.

await session.say("Goodbye for now.", allow_interruptions=False)
# the above is a shortcut for
# handle = session.say("Goodbye for now.", allow_interruptions=False)
# await handle.wait_for_playout()

You can wait for the agent to finish speaking before continuing:

handle = session.generate_reply(instructions="Tell the user we're about to run some slow operations.")
# perform an operation that takes time
...
await handle # finally wait for the speech

The following example makes a web request for the user, and cancels the request when the user interrupts:

async with aiohttp.ClientSession() as client_session:
web_request = client_session.get('https://api.example.com/data')
handle = await session.generate_reply(instructions="Tell the user we're processing their request.")
if handle.interrupted:
# if the user interrupts, cancel the web_request too
web_request.cancel()

SpeechHandle has an API similar to ayncio.Future, allowing you to add a callback:

handle = session.say("Hello world")
handle.add_done_callback(lambda _: print("speech done"))

Getting the current speech

You can access the current speech handle via AgentSession.current_speech. This is useful for checking the status of the agent's speech from anywhere in your program.

Interruptions

By default, the agent stops speaking when it detects that the user has started speaking. This behavior can be disabled by setting allow_interruptions=False when scheduling speech.

To explicitly interrupt the agent, call the interrupt() method on the handle or session at any time. This can be performed even when allow_interruptions is set to False.

handle = session.say("Hello world")
handle.interrupt()
# or from the session
session.interrupt()

Customizing pronunciation

Most TTS providers allow you to customize pronunciation of words using Speech Synthesis Markup Language (SSML) using some or all of the SSML tags in the following table.

SSML TagDescription
phonemeUsed for phonetic pronunciation using a standard phonetic alphabet. These tags provide a phonetic pronunciation for the enclosed text.
say asSpecifies how to interpret the enclosed text. For example, use character to speak each character individually, or date to specify a calendar date.
lexiconA custom dictionary that defines the pronunciation of certain words using phonetic notation or text-to-pronunciation mappings.
emphasisSpeak text with an emphasis.
breakAdd a manual pause.
prosodyControls pitch, speaking rate, and volume of speech output.

Adding background audio

By default, your agent's produces no audio besides its synthesized speech. To add more realism, you can publish ambient background audio such as the noise of an office or call center. Your agent can also adjust the background audio when "thinking", such as adding the sound of a keyboard.

The BackgroundAudioPlayer class manages audio playback to a room and can play the following two types of audio:

  • Ambient sound: A looping audio file that plays in the background.
  • Thinking sound: An audio file that plays while the agent is thinking.

The following example demonstrates simple usage with built-in audio clips.

from livekit.agents import BackgroundAudioPlayer, AudioConfig, BuiltinAudioClip
async def entrypoint(ctx: agents.JobContext):
await ctx.connect()
session = AgentSession(
# ... stt, llm, tts, vad, turn_detection, etc.
)
await session.start(
room=ctx.room,
# ... agent, etc.
)
background_audio = BackgroundAudioPlayer(
# play office ambience sound looping in the background
ambient_sound=AudioConfig(BuiltinAudioClip.OFFICE_AMBIENCE, volume=0.8),
# play keyboard typing sound when the agent is thinking
thinking_sound=[
AudioConfig(BuiltinAudioClip.KEYBOARD_TYPING, volume=0.8),
AudioConfig(BuiltinAudioClip.KEYBOARD_TYPING2, volume=0.7),
],
)
await background_audio.start(room=ctx.room, agent_session=session)

See the following example for more details:

Background Audio

A voice AI agent with background audio for thinking states and ambiance.

Reference

See the following sections for more details on using the BackgroundAudioPlayer class.

BackgroundAudioPlayer

The BackgroundAudioPlayer class manages audio playback to a room and has the following parameters:

ambient_soundAudioSource | AudioConfig | list[AudioConfig]Optional

The audio source or list of sources for the ambient sound. Ambient sound plays on a loop in the background.

thinking_soundAudioSource | AudioConfig | list[AudioConfig]Optional

The audio source or list of sources for the thinking sound. Thinking sound plays while the agent is thinking.

To start background audio, call the start method. You can also play arbitrary audio files at any time by calling the play method of an instance of the BackgroundAudioPlayer class.

AudioConfig

The AudioConfig class allows you to control the volume and the probability of playback. The probability value determines the chance that a particular sound is selected for playback. If the sum of all probability values is less than 1, there is a chance that sometimes there might only be silence. This can be useful in creating a more natural-sounding background audio effect.

AudioConfig has the following properties:

sourcestr | AsyncIterator[rtc.AudioFrame]| BuiltInAudioClipRequired

The audio source to play. It can be a path to a file, an async iterator of audio frames, or a built-in audio clip.

volumefloatOptionalDefault: 1

The volume at which to play the audio source.

probabilityfloatOptionalDefault: 1

The probability of playback. If the sum of probability values for all audio sources is less than 1, there is a chance that an audio source won't be selected and there'll only silence.

AudioSource

The AudioSource can be one of the following types:

  • String: Path to an audio file.
  • AsyncIterator[rtc.AudioFrame]: An async iterator of audio frames.
  • BuiltInAudioClip: A built-in audio clip.

BuiltinAudioClip

The BuiltinAudioClip enum provides a list of pre-defined audio clips that you can use with the background audio player:

  • OFFICE_AMBIENCE: Office ambience sound.
  • KEYBOARD_TYPING: Keyboard typing sound.
  • KEYBOARD_TYPING2: Keyboard typing sound. This is a shorter clip of the KEYBOARD_TYPING sound.

Start the background audio player

The start method takes the following parameters. If you include an ambient sound in the BackgroundAudioPlayer parameters, it starts playing immediately. If you include a thinking sound, it only plays while the agent is "thinking."

  • room: The room to publish the audio to.
  • agent_session: The agent session to publish the audio to.

Play audio files

You can play arbitrary audio files at any time by calling the play method of an instance of the BackgroundAudioPlayer class. The play method takes the following parameters:

audioAudioSource | AudioConfig | list[AudioConfig]Required

The audio source or list of sources to play. To learn more, see AudioSource.

loopbooleanOptionalDefault: False

Set to True to loop the audio source.

For example, if you created background_audio in the previous example, you can play an audio file like this:

MY_AUDIO_FILE = "<PATH_TO_AUDIO_FILE>"
background_audio.play(MY_AUDIO_FILE)

Additional resources

To learn more, see the following resources.