Overview
Speech capabilities are a core feature of LiveKit agents, enabling them to interact with users through voice. This guide covers the various speech features and functionalities available for agents.
LiveKit Agents provide a unified interface for controlling agents using both the STT-LLM-TTS pipeline and realtime models.
In this section
This page covers core speech control features like initiating speech, managing speech handles, and handling interruptions. The following pages in this section cover additional topics:
| Topic | Description |
|---|---|
| Audio customization | Cache TTS responses, customize pronunciation, and adjust speech volume. |
| Background audio | Add ambient sounds, thinking sounds, and on-demand audio playback. |
To learn more and see usage examples, see the following topics:
Text-to-speech (TTS)
TTS is a synthesis process that converts text into audio, giving AI agents a "voice."
Speech-to-speech
Multimodal, realtime APIs can understand speech input and generate speech output directly.
Instant connect
The instant connect feature reduces perceived connection time by capturing microphone input before the agent connection is established. This pre-connect audio buffer sends speech as context to the agent, avoiding awkward gaps between a user's connection and their ability to interact with an agent.
Microphone capture begins locally while the agent is connecting. Once the connection is established, the speech and metadata is sent over a byte stream with the topic lk.agent.pre-connect-audio-buffer. If no agent connects before timeout, the buffer is discarded.
You can enable this feature using withPreconnectAudio:
In the Javascript SDK, this functionality is exposed via TrackPublishOptions.
await room.localParticipant.setMicrophoneEnabled(!enabled, undefined, {preConnectBuffer: true,});
try await room.withPreConnectAudio(timeout: 10) {try await room.connect(url: serverURL, token: token)} onError: { err inprint("Pre-connect audio send failed:", err)}
try {room.withPreconnectAudio {// Audio is being captured automatically// Perform other async setupval (url, token) = tokenService.fetchConnectionDetails()room.connect(url = url,token = token,)room.localParticipant.setMicrophoneEnabled(true)}} catch (e: Throwable) {Log.e(TAG, "Error!")}
try {await room.withPreConnectAudio(() async {// Audio is being captured automatically, perform other async setup// Get connection details from token service etc.final connectionDetails = await tokenService.fetchConnectionDetails();await room.connect(connectionDetails.serverUrl,connectionDetails.participantToken,);// Mic already enabled});} catch (error) {print("Error: $error");}
Preemptive speech generation
Preemptive generation allows the agent to begin generating a response before the user's end of turn is committed. The response is based on partial transcription or early signals from user input, helping reduce perceived response delay and improving conversational flow.
When enabled, the agent starts generating a response as soon as the final transcript is available. If the chat context or tools change in the on_user_turn_completed node, the preemptive response is canceled and replaced with a new one based on the final transcript.
This feature reduces latency when the following are true:
- STT node returns the final transcript faster than VAD emits the
end_of_speechevent. - Turn detection model is enabled.
You can enable this feature for STT-LLM-TTS pipeline agents using the preemptive_generation parameter for AgentSession:
session = AgentSession(preemptive_generation=True,... # STT, LLM, TTS, etc.)
const session = new voice.AgentSession({// ... llm, stt, etc.voiceOptions: {preemptiveGeneration: true,},});
Preemptive generation doesn't guarantee reduced latency. Use Agent observability to validate and fine-tune agent performance.
Initiating speech
By default, the agent waits for user input before responding—the Agents framework automatically handles response generation.
In some cases, though, the agent might need to initiate the conversation. For example, it might greet the user at the start of a session or check in after a period of silence. For fixed phrases like these, you can cache TTS and use pre-synthesized audio to avoid redundant TTS calls and reduce latency.
session.say
To have the agent speak a predefined message, use session.say(). This triggers the configured TTS to synthesize speech and play it back to the user.
You can also optionally provide pre-synthesized audio for playback. This skips the TTS step and reduces response time.
The say method requires a TTS plugin. If you're using a realtime model, you need to add a TTS plugin to your session or use the generate_reply() method instead.
await session.say("Hello. How can I help you today?",allow_interruptions=False,)
await session.say('Hello. How can I help you today?',{allowInterruptions: false,});
Parameters
You can call session.say() with the following options:
textonly: Synthesizes speech using TTS, which is added to the transcript and chat context (unlessadd_to_chat_ctx=False).audioonly: Plays audio, which is not added to the transcript or chat context.text+audio: Plays the provided audio and thetextis used for the transcript and chat context.
textstr | AsyncIterable[str]Text for TTS playback, added to the transcript and by default to the chat context.
audioAsyncIterable[rtc.AudioFrame]Pre-synthesized audio to play. If used without text, nothing is added to the transcript or chat context.
allow_interruptionsbooleanDefault: TrueIf True, allow the user to interrupt the agent while speaking.
add_to_chat_ctxbooleanDefault: TrueIf True, add the text to the agent's chat context after playback. Has no effect if text is not provided.
Returns
Returns a SpeechHandle object.
Events
This method triggers a speech_created event.
generate_reply
To make conversations more dynamic, use session.generate_reply() to prompt the LLM to generate a response.
There are two ways to use generate_reply:
give the agent instructions to generate a response
session.generate_reply(instructions="greet the user and ask where they are from",)session.generateReply({instructions: 'greet the user and ask where they are from',});provide the user's input via text
session.generate_reply(user_input="how is the weather today?",)session.generateReply({userInput: 'how is the weather today?',});
How instructions interact with session-level instructions
The instructions parameter behaves differently depending on the model type:
STT-LLM-TTS pipeline:
instructionsare appended to the agent's session-level instructions. Both are active for the reply.For full control over the instructions used for a reply, use a custom chat context (available in Python).
Realtime models:
instructionsreplace the agent's session-level instructions for that response only. TheAgent(instructions=...)you set at startup don't apply to that reply.If you're using a realtime model and need to preserve the agent's persona or context, include the relevant session instructions explicitly:
await session.generate_reply(instructions=f"{session.current_agent.instructions}\n\nGreet the user warmly.",)session.generateReply({instructions: `${session.currentAgent.instructions}\n\nGreet the user warmly.`,});
Using a custom chat context
For pipeline agents, you can use the chat_ctx parameter to generate_reply to fully control the context used for that reply, including replacing the agent's session-level instructions entirely rather than appending to them.
This is useful when the instructions parameter isn't enough. For example, if you need to switch contexts for a specific reply, exclude certain messages from the conversation history, or inject additional context before the LLM call. Pass a custom chat context and omit the instructions parameter.
The following example uses a modified copy of the agent's chat context:
# Copy the current chat context to modify for this replyctx = session.current_agent.chat_ctx.copy()# Modify context as needed: replace instructions, trim history, inject context, etc.# Then pass the modified context to generate_reply without instructionsawait session.generate_reply(chat_ctx=ctx)
Parameters
The generate_reply() method accepts the following parameters. For a full list of parameters, see the Python reference and Node.js reference.
user_inputstringinstructionsstringallow_interruptionsbooleanTrue, allow the user to interrupt the agent while speaking. (default True)chat_ctxChatContextThe chat context to use for generating the reply. Defaults to the agent's current chat context. Pass a modified copy to fully control the context for this reply. To learn more, see Using a custom chat context.
Returns
Returns a SpeechHandle object.
Events
This method triggers a speech_created event.
Controlling agent speech
You can control agent speech using the SpeechHandle object returned by the say() and generate_reply() methods, and allowing user interruptions.
SpeechHandle
The say() and generate_reply() methods return a SpeechHandle object, which lets you track the state of the agent's speech. This can be useful for coordinating follow-up actions—for example, notifying the user before ending the call.
# The following is a shortcut for:# handle = session.say("Goodbye for now.", allow_interruptions=False)# await handle.wait_for_playout()await session.say("Goodbye for now.", allow_interruptions=False)
// The following is a shortcut for:// const handle = session.say('Goodbye for now.', { allowInterruptions: false });// await handle.waitForPlayout();await session.say('Goodbye for now.', { allowInterruptions: false });
You can wait for the agent to finish speaking before continuing:
handle = session.generate_reply(instructions="Tell the user we're about to run some slow operations.")# perform an operation that takes time...await handle # finally wait for the speech
const handle = session.generateReply({instructions: "Tell the user we're about to run some slow operations."});// perform an operation that takes time...await handle.waitForPlayout(); // finally wait for the speech
The following example makes a web request for the user, and cancels the request when the user interrupts:
async with aiohttp.ClientSession() as client_session:web_request = client_session.get('https://api.example.com/data')handle = await session.generate_reply(instructions="Tell the user we're processing their request.")if handle.interrupted:# if the user interrupts, cancel the web_request tooweb_request.cancel()
import { Task } from '@livekit/agents';const webRequestTask = Task.from(async (controller) => {const response = await fetch('https://api.example.com/data', {signal: controller.signal});return response.json();});const handle = session.generateReply({instructions: "Tell the user we're processing their request.",});await handle.waitForPlayout();if (handle.interrupted) {// if the user interrupts, cancel the web_request toowebRequestTask.cancel();}
SpeechHandle has an API similar to asyncio.Future, allowing you to add a callback:
handle = session.say("Hello world")handle.add_done_callback(lambda _: print("speech done"))
const handle = session.say('Hello world');handle.then(() => console.log('speech done'));
Getting the current speech handle
The agent session's active speech handle, if any, is available with the current_speech property. If no speech is active, this property returns None. Otherwise, it returns the active SpeechHandle.
Use the active speech handle to coordinate with the speaking state. For instance, you can ensure that a hang up occurs only after the current speech has finished, rather than mid-speech:
# to hang up the call as part of a function call@function_toolasync def end_call(self, ctx: RunContext):"""Use this tool when the user has signaled they wish to end the current call. The session ends automatically after invoking this tool."""await ctx.wait_for_playout() # let the agent finish speaking# call API to delete_room...
const endCall = llm.tool({description: 'End the call.',parameters: z.object({reason: z.enum(['assistant-ended-call','sip-call-transferred','user-ended-call','unknown-error',]).describe('The reason to end the call'),}),execute: async ({ reason }, { ctx }) => {ctx.session.generateReply({userInput: `You are about to end the call due to ${reason}, notify the user with one last message`,});await ctx.waitForPlayout();ctx.session.shutdown({ reason });},});
Interruptions
By default, the agent stops speaking when it detects that the user has started speaking. You can customize this behavior. To learn more, see Interruptions in the Turn detection topic.
Additional resources
To learn more, see the following resources.
Audio customization
Customize pronunciation, adjust speech volume, and cache TTS responses.
Background audio
Add ambient sounds, thinking sounds, and on-demand audio playback.
Voice AI quickstart
Use the quickstart as a starting base for adding audio code.
Speech related event
Learn more about the speech_created event, triggered when new agent speech is created.
Text-to-speech (TTS)
TTS models for pipeline agents.
Speech-to-speech
Realtime models that understand speech input and generate speech output directly.