Overview
A single agent can serve both voice and text users in the same session, but the two input types can benefit from different instructions. Spoken input arrives as imperfect transcription and can contain relative expressions (for example, "next Tuesday"), self-corrections, and filler words, so the LLM might need additional guidance to interpret it correctly. Typed input, on the other hand, is usually more precise and literal, so these same instructions can degrade text responses by adding spoken-style confirmations or stripping useful formatting.
The Instructions class holds two variants of your system prompt, one for audio and one for text. The framework applies the variant that matches each turn's input modality before calling the LLM, so voice turns get the audio prompt and text turns get the text prompt automatically.
In Python, Instructions is exported from livekit.agents.beta and is subject to change. In Node.js, it's a stable export of the main llm namespace (and also re-exported from beta).
Define instructions per modality
Create an Instructions object with audio and text variants and pass it wherever you would pass an instructions string, such as the Agent constructor. The text variant is optional and falls back to the audio variant when omitted.
from livekit.agents import Agentfrom livekit.agents.beta import Instructionsinstructions = Instructions(audio=("You are a scheduling assistant. The user is speaking, so their input may be ""imperfect. Resolve spoken expressions like 'next Tuesday' to concrete dates, ""honor verbal self-corrections, and confirm the date and time out loud before booking."),text=("You are a scheduling assistant. The user is typing, so take their input literally. ""Accept exact dates and times in any common format and skip verbal confirmations."),)class SchedulingAgent(Agent):def __init__(self) -> None:super().__init__(instructions=instructions)
import { llm, voice } from '@livekit/agents';const instructions = new llm.Instructions({audio:'You are a scheduling assistant. The user is speaking, so their input may be ' +"imperfect. Resolve spoken expressions like 'next Tuesday' to concrete dates, " +'honor verbal self-corrections, and confirm the date and time out loud before booking.',text:'You are a scheduling assistant. The user is typing, so take their input literally. ' +'Accept exact dates and times in any common format and skip verbal confirmations.',});class SchedulingAgent extends voice.Agent {constructor() {super({ instructions });}}
How variants are applied
During a session, the framework selects the variant that matches the input modality of each turn: the audio variant for spoken turns and the text variant for typed turns. Both variants are preserved across turns, so an agent that handles a voice turn followed by a text turn uses the correct prompt for each.
Select the active variant
When you generate a reply manually, specify the variant with the input_modality (Python) or inputModality (Node.js) parameter:
# Use the audio variant for this replysession.generate_reply(input_modality="audio")
// Use the audio variant for this replysession.generateReply({ inputModality: 'audio' });
To explicitly set the active variant, use as_modality (Python) or asModality (Node.js). This returns a copy of the instructions with the selected variant active. Both variants are preserved, so you can switch between them as needed.
# Return a copy whose active value is the text varianttext_first = instructions.as_modality("text")
// Return a copy whose active value is the text variantconst textFirst = instructions.asModality('text');
Compose instructions
You can build instructions from reusable pieces while keeping both variants intact. A shared base prompt can be combined with modality-specific guidance using concatenation and templating.
In Python, Instructions subclasses str. Use + to concatenate and format to substitute values. Both handle each variant separately:
base = Instructions(audio="You are Alex, a scheduling assistant.\n{modality_specific}",text="You are Alex, a scheduling assistant.\n{modality_specific}",)modality_specific = Instructions(audio="Resolve spoken dates and confirm out loud.",text="Accept literal dates and skip confirmations.",)# `format` applies to both variants at onceinstructions = base.format(modality_specific=modality_specific)# `+` also works and preserves both variantsinstructions = instructions + "\nThe current date is 2026-05-29."
In Node.js, use the Instructions.tpl tagged template to compose with template literals, or concatInstructions to join a mix of strings and Instructions. Both handle each variant separately:
import { llm } from '@livekit/agents';const modalitySpecific = new llm.Instructions({audio: 'Resolve spoken dates and confirm out loud.',text: 'Accept literal dates and skip confirmations.',});// `tpl` interpolates each variant from any embedded Instructionsconst instructions = llm.Instructions.tpl`You are Alex, a scheduling assistant.${modalitySpecific}The current date is 2026-05-29.`;// `concatInstructions` joins strings and Instructions, preserving both variantsconst combined = llm.concatInstructions('Base prompt. ', modalitySpecific);
Customize built-in tasks
Prebuilt tasks ship with their own default prompts. The beta InstructionParts type lets you customize those prompts without rewriting them. Set persona to change the agent's identity and extra to append domain-specific context. Leave a field unset to keep the task's built-in default, or set it to an empty string to remove that section entirely. Each field accepts a plain string or an Instructions object, so customizations can themselves be modality-aware.
To apply a customization, pass an InstructionParts object as a task's instructions argument:
from livekit.agents.beta import Instructionsfrom livekit.agents.beta.workflows import GetEmailTask, InstructionPartstask = GetEmailTask(instructions=InstructionParts(persona="You are Riley, a friendly intake assistant collecting a contact email.",# `extra` is itself modality-aware: confirm out loud for voice, stay quiet for textextra=Instructions(audio="Confirm the spelling out loud, letter by letter, for unusual domains.",text="Accept the email exactly as typed; only re-prompt if it's clearly malformed.",),))
For a complete example that runs the task inside a function tool, see the email registration example .
Additional resources
Complete, runnable example agents that set different instructions for voice and text users: