Skip to main content

Multimodality overview

Build agents that communicate through multiple channels for richer, more natural interactions.

Overview

LiveKit Agents supports multimodality, enabling your agents to communicate through multiple channels simultaneously. Agents can process and generate speech, text, images, and live video, allowing them to understand context from different sources and respond in the most appropriate format. This flexibility enables richer, more natural interactions where agents can see what users show them, read transcriptions of conversations, send text messages, and speak—all within a single session.

Modality options

Just as humans can see, hear, speak, and read, LiveKit agents can process and produce audio, text, images, and video. You can build agents that use a single modality or combine multiple modalities for richer, more flexible interactions.

ModalityDescriptionUse cases
Speech and audioProcess realtime audio input from users' microphones, with support for speech-to-text, turn detection, and interruptions. Generate speech output with TTS.Voice assistants, call center automation, and voice-controlled applications.
Text and transcriptionsHandle text messages and transcriptions, enabling text-only sessions or hybrid voice and text interactions. Send text responses and transcriptions.Chatbots, text-based customer support, and accessibility features for users who prefer typing.
Images and videoProcess images and live video feeds for visual understanding. Send images to the frontend with byte streams, or add a virtual avatar for lifelike video output.Visual assistants, avatar-based agents, screen sharing analysis, and image-based question answering.

In this section

Read more about each modality.