Multimodality overview | LiveKit Documentation

Overview

LiveKit Agents supports multimodality, enabling your agents to communicate through multiple channels simultaneously. Agents can process and generate speech, text, images, and live video, allowing them to understand context from different sources and respond in the most appropriate format. This flexibility enables richer, more natural interactions where agents can see what users show them, read transcriptions of conversations, send text messages, and speak—all within a single session.

Modality options

Just as humans can see, hear, speak, and read, LiveKit agents can process vision, audio, text, and transcriptions. LiveKit Agents supports three main modalities: speech and audio, text and transcriptions, and vision. You can build agents that use a single modality or combine multiple modalities for richer, more flexible interactions.

Modality	Description	Use cases
Speech and audio	Process realtime audio input from users' microphones, with support for speech-to-text, turn detection, and interruptions.	Voice assistants, call center automation, and voice-controlled applications.
Text and transcriptions	Handle text messages and transcriptions, enabling text-only sessions or hybrid voice and text interactions.	Chatbots, text-based customer support, and accessibility features for users who prefer typing.
Vision	Process images and live video feeds, enabling visual understanding and multimodal AI experiences.	Visual assistants that can see what users show them, screen sharing analysis, and image-based question answering.

In this section

Speech and audio

Control agent speech, handle interruptions, and customize audio output.

Text and transcriptions

Handle text messages, transcriptions, and text-only sessions.

Vision

Process images and live video feeds for visual understanding.