Multimodality overview | LiveKit Documentation

Overview

LiveKit Agents supports multimodality, enabling your agents to communicate through multiple channels simultaneously. Agents can process and generate speech, text, images, and live video, allowing them to understand context from different sources and respond in the most appropriate format. This flexibility enables richer, more natural interactions where agents can see what users show them, read transcriptions of conversations, send text messages, and speak — all within a single session.

Modality options

Just as humans can see, hear, speak, and read, LiveKit agents can process and produce audio, text, images, and video. You can build agents that use a single modality or combine multiple modalities for richer, more flexible interactions.

Modality	Description	Use cases
Speech and audio	Process realtime audio input from users' microphones, with support for speech-to-text, turn detection, and interruptions. Generate speech output with TTS.	Voice assistants, call center automation, and voice-controlled applications.
Text and transcriptions	Handle text messages and transcriptions, enabling text-only sessions or hybrid voice and text interactions. Send text responses and transcriptions.	Chatbots, text-based customer support, and accessibility features for users who prefer typing.
Images and video	Process images and live video feeds for visual understanding. Send images to the frontend with byte streams, or add a virtual avatar for lifelike video output.	Visual assistants, avatar-based agents, screen sharing analysis, and image-based question answering.

In this section

Speech and audio

Control agent speech, handle interruptions, and customize audio output.

Text and transcriptions

Handle text messages, transcriptions, and text-only sessions.

Images and video