Agent speech and audio

Speech and audio capabilities for LiveKit agents.

Overview

Speech capabilities are a core feature of LiveKit agents, enabling them to interact with users through voice. This guide covers the various speech features and functionalities available for agents.

LiveKit Agents provide a unified interface for controlling agents using both the STT-LLM-TTS pipeline and realtime models.

To learn more and see usage examples, see the following topics:

Initiating speech

By default, the agent waits for user input before responding—the Agents framework automatically handles response generation.

In some cases, though, the agent might need to initiate the conversation. For example, it might greet the user at the start of a session or check in after a period of silence.

session.say

To have the agent speak a predefined message, use session.say(). This triggers the configured TTS to synthesize speech and play it back to the user.

You can also optionally provide pre-synthesized audio for playback. This skips the TTS step and reduces response time.

Realtime models and TTS

The say method requires a TTS plugin. If you're using a realtime model, you need to add a TTS plugin to your session or use the generate_reply() method instead.

await session.say(
"Hello. How can I help you today?",
allow_interruptions=False,
)

Parameters

textstr | AsyncIterable[str]Required
The text to speak.
audioAsyncIterable[rtc.AudioFrame]Optional
Pre-synthesized audio to play.
allow_interruptionsbooleanOptional
If True, allow the user to interrupt the agent while speaking. (default True)
add_to_chat_ctxbooleanOptional
If True, add the text to the agent's chat context after playback. (default True)

Returns

Returns a SpeechHandle object.

Events

This method triggers a speech_created event.

generate_reply

To make conversations more dynamic, use session.generate_reply() to prompt the LLM to generate a response.

There are two ways to use generate_reply:

  1. give the agent instructions to generate a response

    session.generate_reply(
    instructions="greet the user and ask where they are from",
    )
  2. provide the user's input via text

    session.generate_reply(
    user_input="how is the weather today?",
    )
Impact to chat history

When using generate_reply with instructions, the agent uses the instructions to generate a response, which is added to the chat history. The instructions themselves are not recorded in the history.

In contrast, user_input is directly added to the chat history.

Parameters

user_inputstringOptional
The user input to respond to.
instructionsstringOptional
Instructions for the agent to use for the reply.
allow_interruptionsbooleanOptional
If True, allow the user to interrupt the agent while speaking. (default True)

Returns

Returns a SpeechHandle object.

Events

This method triggers a speech_created event.

Controlling agent speech

You can control agent speech using the SpeechHandle object returned by the say() and generate_reply() methods, and allowing user interruptions.

SpeechHandle

The say() and generate_reply() methods return a SpeechHandle object, which lets you track the state of the agent's speech. This can be useful for coordinating follow-up actions—for example, notifying the user before ending the call.

await session.say("Goodbye for now.", allow_interruptions=False)
# the above is a shortcut for
# handle = session.say("Goodbye for now.", allow_interruptions=False)
# await handle.wait_for_playout()

You can wait for the agent to finish speaking before continuing:

handle = session.generate_reply(instructions="Tell the user we're about to run some slow operations.")
# perform an operation that takes time
...
await handle # finally wait for the speech

The following example makes a web request for the user, and cancels the request when the user interrupts:

async with aiohttp.ClientSession() as client_session:
web_request = client_session.get('https://api.example.com/data')
handle = await session.generate_reply(instructions="Tell the user we're processing their request.")
if handle.interrupted:
# if the user interrupts, cancel the web_request too
web_request.cancel()

SpeechHandle has an API similar to ayncio.Future, allowing you to add a callback:

handle = session.say("Hello world")
handle.add_done_callback(lambda _: print("speech done"))

Getting the current speech handle

The agent session's active speech handle, if any, is available with the current_speech property. If no speech is active, this property returns None. Otherwise, it returns the active SpeechHandle.

Use the active speech handle to coordinate with the speaking state. For instance, you can ensure that a hang up occurs only after the current speech has finished, rather than mid-speech:

# to hang up the call as part of a function call
@function_tool
async def end_call(self, ctx: RunContext):
"""Use this tool when the user has signaled they wish to end the current call. The session will end automatically after invoking this tool."""
# let the agent finish speaking
current_speech = ctx.session.current_speech
if current_speech:
await current_speech.wait_for_playout()
# call API to delete_room
...

Interruptions

By default, the agent stops speaking when it detects that the user has started speaking. This behavior can be disabled by setting allow_interruptions=False when scheduling speech.

To explicitly interrupt the agent, call the interrupt() method on the handle or session at any time. This can be performed even when allow_interruptions is set to False.

handle = session.say("Hello world")
handle.interrupt()
# or from the session
session.interrupt()

Customizing pronunciation

Most TTS providers allow you to customize pronunciation of words using Speech Synthesis Markup Language (SSML). The following example uses the tts_node to add custom pronunciation rules:

async def tts_node(
self,
text: AsyncIterable[str],
model_settings: ModelSettings
) -> AsyncIterable[rtc.AudioFrame]:
# Pronunciation replacements for common technical terms and abbreviations.
# Support for custom pronunciations depends on the TTS provider.
pronunciations = {
"API": "A P I",
"REST": "rest",
"SQL": "sequel",
"kubectl": "kube control",
"AWS": "A W S",
"UI": "U I",
"URL": "U R L",
"npm": "N P M",
"LiveKit": "Live Kit",
"async": "a sink",
"nginx": "engine x",
}
async def adjust_pronunciation(input_text: AsyncIterable[str]) -> AsyncIterable[str]:
async for chunk in input_text:
modified_chunk = chunk
# Apply pronunciation rules
for term, pronunciation in pronunciations.items():
# Use word boundaries to avoid partial replacements
modified_chunk = re.sub(
rf'\b{term}\b',
pronunciation,
modified_chunk,
flags=re.IGNORECASE
)
yield modified_chunk
# Process with modified text through base TTS implementation
async for frame in Agent.default.tts_node(
self,
adjust_pronunciation(text),
model_settings
):
yield frame

The following table lists the SSML tags supported by most TTS providers:

SSML TagDescription
phonemeUsed for phonetic pronunciation using a standard phonetic alphabet. These tags provide a phonetic pronunciation for the enclosed text.
say asSpecifies how to interpret the enclosed text. For example, use character to speak each character individually, or date to specify a calendar date.
lexiconA custom dictionary that defines the pronunciation of certain words using phonetic notation or text-to-pronunciation mappings.
emphasisSpeak text with an emphasis.
breakAdd a manual pause.
prosodyControls pitch, speaking rate, and volume of speech output.

Adjusting speech volume

To adjust the volume of the agent's speech, add a processor to the tts_node or the realtime_audio_output_node. Alternative, you can also adjust the volume of playback in the frontend SDK.

The following example agent has an adjustable volume between 0 and 100, and offers a tool call to change it.

class Assistant(Agent):
def __init__(self) -> None:
self.volume: int = 50
super().__init__(
instructions=f"You are a helpful voice AI assistant. Your starting volume level is {self.volume}."
)
@function_tool()
async def set_volume(self, volume: int):
"""Set the volume of the audio output.
Args:
volume (int): The volume level to set. Must be between 0 and 100.
"""
self.volume = volume
# Audio node used by STT-LLM-TTS pipeline models
async def tts_node(self, text: AsyncIterable[str], model_settings: ModelSettings):
return self._adjust_volume_in_stream(
Agent.default.tts_node(self, text, model_settings)
)
# Audio node used by realtime models
async def realtime_audio_output_node(
self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings
) -> AsyncIterable[rtc.AudioFrame]:
return self._adjust_volume_in_stream(
Agent.default.realtime_audio_output_node(self, audio, model_settings)
)
async def _adjust_volume_in_stream(
self, audio: AsyncIterable[rtc.AudioFrame]
) -> AsyncIterable[rtc.AudioFrame]:
stream: utils.audio.AudioByteStream | None = None
async for frame in audio:
if stream is None:
stream = utils.audio.AudioByteStream(
sample_rate=frame.sample_rate,
num_channels=frame.num_channels,
samples_per_channel=frame.sample_rate // 10, # 100ms
)
for f in stream.push(frame.data):
yield self._adjust_volume_in_frame(f)
if stream is not None:
for f in stream.flush():
yield self._adjust_volume_in_frame(f)
def _adjust_volume_in_frame(self, frame: rtc.AudioFrame) -> rtc.AudioFrame:
audio_data = np.frombuffer(frame.data, dtype=np.int16)
audio_float = audio_data.astype(np.float32) / np.iinfo(np.int16).max
audio_float = audio_float * max(0, min(self.volume, 100)) / 100.0
processed = (audio_float * np.iinfo(np.int16).max).astype(np.int16)
return rtc.AudioFrame(
data=processed.tobytes(),
sample_rate=frame.sample_rate,
num_channels=frame.num_channels,
samples_per_channel=len(processed) // frame.num_channels,
)

Adding background audio

By default, your agent's produces no audio besides its synthesized speech. To add more realism, you can publish ambient background audio such as the noise of an office or call center. Your agent can also adjust the background audio when "thinking", such as adding the sound of a keyboard.

The BackgroundAudioPlayer class manages audio playback to a room and can play the following two types of audio:

  • Ambient sound: A looping audio file that plays in the background.
  • Thinking sound: An audio file that plays while the agent is thinking.

The following example demonstrates simple usage with built-in audio clips.

from livekit.agents import BackgroundAudioPlayer, AudioConfig, BuiltinAudioClip
async def entrypoint(ctx: agents.JobContext):
await ctx.connect()
session = AgentSession(
# ... stt, llm, tts, vad, turn_detection, etc.
)
await session.start(
room=ctx.room,
# ... agent, etc.
)
background_audio = BackgroundAudioPlayer(
# play office ambience sound looping in the background
ambient_sound=AudioConfig(BuiltinAudioClip.OFFICE_AMBIENCE, volume=0.8),
# play keyboard typing sound when the agent is thinking
thinking_sound=[
AudioConfig(BuiltinAudioClip.KEYBOARD_TYPING, volume=0.8),
AudioConfig(BuiltinAudioClip.KEYBOARD_TYPING2, volume=0.7),
],
)
await background_audio.start(room=ctx.room, agent_session=session)

See the following example for more details:

Background Audio

A voice AI agent with background audio for thinking states and ambiance.

Reference

See the following sections for more details on using the BackgroundAudioPlayer class.

BackgroundAudioPlayer

The BackgroundAudioPlayer class manages audio playback to a room and has the following parameters:

ambient_soundAudioSource | AudioConfig | list[AudioConfig]Optional

The audio source or list of sources for the ambient sound. Ambient sound plays on a loop in the background.

thinking_soundAudioSource | AudioConfig | list[AudioConfig]Optional

The audio source or list of sources for the thinking sound. Thinking sound plays while the agent is thinking.

To start background audio, call the start method. You can also play arbitrary audio files at any time by calling the play method of an instance of the BackgroundAudioPlayer class.

AudioConfig

The AudioConfig class allows you to control the volume and the probability of playback. The probability value determines the chance that a particular sound is selected for playback. If the sum of all probability values is less than 1, there is a chance that sometimes there might only be silence. This can be useful in creating a more natural-sounding background audio effect.

AudioConfig has the following properties:

sourcestr | AsyncIterator[rtc.AudioFrame]| BuiltInAudioClipRequired

The audio source to play. It can be a path to a file, an async iterator of audio frames, or a built-in audio clip.

volumefloatOptionalDefault: 1

The volume at which to play the audio source.

probabilityfloatOptionalDefault: 1

The probability of playback. If the sum of probability values for all audio sources is less than 1, there is a chance that an audio source won't be selected and there'll only silence.

AudioSource

The AudioSource can be one of the following types:

  • String: Path to an audio file.
  • AsyncIterator[rtc.AudioFrame]: An async iterator of audio frames.
  • BuiltInAudioClip: A built-in audio clip.

BuiltinAudioClip

The BuiltinAudioClip enum provides a list of pre-defined audio clips that you can use with the background audio player:

  • OFFICE_AMBIENCE: Office ambience sound.
  • KEYBOARD_TYPING: Keyboard typing sound.
  • KEYBOARD_TYPING2: Keyboard typing sound. This is a shorter clip of the KEYBOARD_TYPING sound.

Start the background audio player

The start method takes the following parameters. If you include an ambient sound in the BackgroundAudioPlayer parameters, it starts playing immediately. If you include a thinking sound, it only plays while the agent is "thinking."

  • room: The room to publish the audio to.
  • agent_session: The agent session to publish the audio to.

Play audio files

You can play arbitrary audio files at any time by calling the play method of an instance of the BackgroundAudioPlayer class. The play method takes the following parameters:

audioAudioSource | AudioConfig | list[AudioConfig]Required

The audio source or list of sources to play. To learn more, see AudioSource.

loopbooleanOptionalDefault: False

Set to True to loop the audio source.

For example, if you created background_audio in the previous example, you can play an audio file like this:

MY_AUDIO_FILE = "<PATH_TO_AUDIO_FILE>"
background_audio.play(MY_AUDIO_FILE)

Additional resources

To learn more, see the following resources.