Agent speech and audio | LiveKit Documentation

Overview

Speech capabilities are a core feature of LiveKit agents, enabling them to interact with users through voice. This guide covers the various speech features and functionalities available for agents.

LiveKit Agents provide a unified interface for controlling agents using both the STT-LLM-TTS pipeline and realtime models.

In this section

This page covers core speech control features like initiating speech, managing speech handles, and handling interruptions. The following pages in this section cover additional topics:

Topic	Description
Audio customization	Cache TTS responses, customize pronunciation, and adjust speech volume.
Background audio	Add ambient sounds, thinking sounds, and on-demand audio playback.

To learn more and see usage examples, see the following topics:

Text-to-speech (TTS)

TTS is a synthesis process that converts text into audio, giving AI agents a "voice."

Speech-to-speech

Multimodal, realtime APIs can understand speech input and generate speech output directly.

Instant connect

The instant connect feature reduces perceived connection time by capturing microphone input before the agent connection is established. This pre-connect audio buffer sends speech as context to the agent, avoiding awkward gaps between a user's connection and their ability to interact with an agent.

Microphone capture begins locally while the agent is connecting. Once the connection is established, the speech and metadata is sent over a byte stream with the topic lk.agent.pre-connect-audio-buffer. If no agent connects before timeout, the buffer is discarded.

You can enable this feature using withPreconnectAudio:

In the Javascript SDK, this functionality is exposed via TrackPublishOptions.

await room.localParticipant.setMicrophoneEnabled(!enabled, undefined, {
  preConnectBuffer: true,
});

try await room.withPreConnectAudio(timeout: 10) {
  try await room.connect(url: serverURL, token: token)
} onError: { err in
  print("Pre-connect audio send failed:", err)
}

try {
  room.withPreconnectAudio {
      // Audio is being captured automatically
      // Perform other async setup
      val (url, token) = tokenService.fetchConnectionDetails()
      room.connect(
          url = url,
          token = token,
      )
      room.localParticipant.setMicrophoneEnabled(true)
  }
} catch (e: Throwable) {
  Log.e(TAG, "Error!")
}

try {
  await room.withPreConnectAudio(() async {
    // Audio is being captured automatically, perform other async setup
    // Get connection details from token service etc.
    final connectionDetails = await tokenService.fetchConnectionDetails();
    await room.connect(
      connectionDetails.serverUrl,
      connectionDetails.participantToken,
    );
    // Mic already enabled
  });
} catch (error) {
  print("Error: $error");
}

Preemptive speech generation

Preemptive generation allows the agent to begin generating a response before the user's end of turn is committed. The response is based on partial transcription or early signals from user input, helping reduce perceived response delay and improving conversational flow.

When enabled, the agent starts generating a response as soon as the final transcript is available. If the chat context or tools change in the on_user_turn_completed node, the preemptive response is canceled and replaced with a new one based on the final transcript.

This feature reduces latency when the following are true:

STT node returns the final transcript faster than VAD emits the end_of_speech event.
Turn detection model is enabled.

You can enable this feature for STT-LLM-TTS pipeline agents using the preemptive_generation parameter for AgentSession:

session = AgentSession(
   preemptive_generation=True,
   ... # STT, LLM, TTS, etc.
)

const session = new voice.AgentSession({
    // ... llm, stt, etc.
    voiceOptions: {
      preemptiveGeneration: true,
    },
});

Note

Preemptive generation doesn't guarantee reduced latency. Use Agent observability to validate and fine-tune agent performance.

Initiating speech

By default, the agent waits for user input before responding—the Agents framework automatically handles response generation.

In some cases, though, the agent might need to initiate the conversation. For example, it might greet the user at the start of a session or check in after a period of silence. For fixed phrases like these, you can cache TTS and use pre-synthesized audio to avoid redundant TTS calls and reduce latency.

session.say

To have the agent speak a predefined message, use session.say(). This triggers the configured TTS to synthesize speech and play it back to the user.

You can also optionally provide pre-synthesized audio for playback. This skips the TTS step and reduces response time.

Realtime models and TTS

The say method requires a TTS plugin. If you're using a realtime model, you need to add a TTS plugin to your session or use the generate_reply() method instead.

await session.say(
   "Hello. How can I help you today?",
   allow_interruptions=False,
)

await session.say(
  'Hello. How can I help you today?',
  {
    allowInterruptions: false,
  }
);

Parameters

You can call session.say() with the following options:

text only: Synthesizes speech using TTS, which is added to the transcript and chat context (unless add_to_chat_ctx=False).
audio only: Plays audio, which is not added to the transcript or chat context.
text + audio: Plays the provided audio and the text is used for the transcript and chat context.

textstr | AsyncIterable[str]

Text for TTS playback, added to the transcript and by default to the chat context.

audioAsyncIterable[rtc.AudioFrame]

Pre-synthesized audio to play. If used without text, nothing is added to the transcript or chat context.

allow_interruptionsbooleanDefault: True

If True, allow the user to interrupt the agent while speaking.

add_to_chat_ctxbooleanDefault: True

If True, add the text to the agent's chat context after playback. Has no effect if text is not provided.

Returns

Returns a SpeechHandle object.

Events

This method triggers a speech_created event.

generate_reply

To make conversations more dynamic, use session.generate_reply() to prompt the LLM to generate a response.

There are two ways to use generate_reply:

give the agent instructions to generate a response

session.generate_reply(
   instructions="greet the user and ask where they are from",
)

session.generateReply({
 instructions: 'greet the user and ask where they are from',
 });

provide the user's input via text

session.generate_reply(
   user_input="how is the weather today?",
)

session.generateReply({
 userInput: 'how is the weather today?',
 });

How instructions interact with session-level instructions

The instructions parameter behaves differently depending on the model type:

STT-LLM-TTS pipeline: instructions are appended to the agent's session-level instructions. Both are active for the reply.
For full control over the instructions used for a reply, use a custom chat context (available in Python).
Realtime models: instructions replace the agent's session-level instructions for that response only. The Agent(instructions=...) you set at startup don't apply to that reply.
If you're using a realtime model and need to preserve the agent's persona or context, include the relevant session instructions explicitly:
```
await session.generate_reply(
    instructions=f"{session.current_agent.instructions}\n\nGreet the user warmly.",
)
```
```
session.generateReply({
    instructions: `${session.currentAgent.instructions}\n\nGreet the user warmly.`,
});
```

Using a custom chat context

Available in

Python

For pipeline agents, you can use the chat_ctx parameter to generate_reply to fully control the context used for that reply, including replacing the agent's session-level instructions entirely rather than appending to them.

This is useful when the instructions parameter isn't enough. For example, if you need to switch contexts for a specific reply, exclude certain messages from the conversation history, or inject additional context before the LLM call. Pass a custom chat context and omit the instructions parameter.

The following example uses a modified copy of the agent's chat context:

# Copy the current chat context to modify for this reply
ctx = session.current_agent.chat_ctx.copy()
# Modify context as needed: replace instructions, trim history, inject context, etc.
# Then pass the modified context to generate_reply without instructions
await session.generate_reply(chat_ctx=ctx)

Parameters

The generate_reply() method accepts the following parameters. For a full list of parameters, see the Python reference and Node.js reference.

user_inputstring

The user input to respond to.

instructionsstring

Instructions for the agent to use for the reply.

allow_interruptionsboolean

If True, allow the user to interrupt the agent while speaking. (default True)

chat_ctxChatContext

Available in

Python

The chat context to use for generating the reply. Defaults to the agent's current chat context. Pass a modified copy to fully control the context for this reply. To learn more, see Using a custom chat context.

Returns

Returns a SpeechHandle object.

Events

This method triggers a speech_created event.

Controlling agent speech

You can control agent speech using the SpeechHandle object returned by the say() and generate_reply() methods, and allowing user interruptions.

SpeechHandle

The say() and generate_reply() methods return a SpeechHandle object, which lets you track the state of the agent's speech. This can be useful for coordinating follow-up actions—for example, notifying the user before ending the call.

# The following is a shortcut for:
# handle = session.say("Goodbye for now.", allow_interruptions=False)
# await handle.wait_for_playout()
await session.say("Goodbye for now.", allow_interruptions=False)

// The following is a shortcut for:
// const handle = session.say('Goodbye for now.', { allowInterruptions: false });
// await handle.waitForPlayout();
await session.say('Goodbye for now.', { allowInterruptions: false });

You can wait for the agent to finish speaking before continuing:

handle = session.generate_reply(instructions="Tell the user we're about to run some slow operations.")

# perform an operation that takes time
...

await handle # finally wait for the speech

const handle = session.generateReply({
  instructions: "Tell the user we're about to run some slow operations."
});

// perform an operation that takes time
...

await handle.waitForPlayout(); // finally wait for the speech

The following example makes a web request for the user, and cancels the request when the user interrupts:

async with aiohttp.ClientSession() as client_session:
    web_request = client_session.get('https://api.example.com/data')
    handle = await session.generate_reply(instructions="Tell the user we're processing their request.")
    if handle.interrupted:
        # if the user interrupts, cancel the web_request too
        web_request.cancel()

import { Task } from '@livekit/agents';

const webRequestTask = Task.from(async (controller) => {
  const response = await fetch('https://api.example.com/data', {
    signal: controller.signal
  });
  return response.json();
});

const handle = session.generateReply({
  instructions: "Tell the user we're processing their request.",
});

await handle.waitForPlayout();

if (handle.interrupted) {
  // if the user interrupts, cancel the web_request too
  webRequestTask.cancel();
}

SpeechHandle has an API similar to asyncio.Future, allowing you to add a callback:

handle = session.say("Hello world")
handle.add_done_callback(lambda _: print("speech done"))

const handle = session.say('Hello world');
handle.then(() => console.log('speech done'));

Getting the current speech handle

The agent session's active speech handle, if any, is available with the current_speech property. If no speech is active, this property returns None. Otherwise, it returns the active SpeechHandle.

Use the active speech handle to coordinate with the speaking state. For instance, you can ensure that a hang up occurs only after the current speech has finished, rather than mid-speech:

# to hang up the call as part of a function call
@function_tool
async def end_call(self, ctx: RunContext):
   """Use this tool when the user has signaled they wish to end the current call. The session ends automatically after invoking this tool."""
   await ctx.wait_for_playout() # let the agent finish speaking


   # call API to delete_room
   ...

const endCall = llm.tool({
  description: 'End the call.',
  parameters: z.object({
    reason: z
      .enum([
        'assistant-ended-call',
        'sip-call-transferred',
        'user-ended-call',
        'unknown-error',
      ])
      .describe('The reason to end the call'),
  }),
  execute: async ({ reason }, { ctx }) => {
    ctx.session.generateReply({
      userInput: `You are about to end the call due to ${reason}, notify the user with one last message`,
    });
    await ctx.waitForPlayout();

    ctx.session.shutdown({ reason });
  },
});

Interruptions

By default, the agent stops speaking when it detects that the user has started speaking. You can customize this behavior. To learn more, see Interruptions in the Turn detection topic.

Additional resources

To learn more, see the following resources.

Audio customization

Customize pronunciation, adjust speech volume, and cache TTS responses.

Background audio

Add ambient sounds, thinking sounds, and on-demand audio playback.

Voice AI quickstart

Use the quickstart as a starting base for adding audio code.

Speech related event

Learn more about the speech_created event, triggered when new agent speech is created.

Text-to-speech (TTS)

TTS models for pipeline agents.

Speech-to-speech

Realtime models that understand speech input and generate speech output directly.