Overview
LiveKit Agents supports multimodality, enabling your agents to communicate through multiple channels simultaneously. Agents can process and generate speech, text, images, and live video, allowing them to understand context from different sources and respond in the most appropriate format. This flexibility enables richer, more natural interactions where agents can see what users show them, read transcriptions of conversations, send text messages, and speak—all within a single session.
Modality options
Just as humans can see, hear, speak, and read, LiveKit agents can process and produce audio, text, images, and video. You can build agents that use a single modality or combine multiple modalities for richer, more flexible interactions.
| Modality | Description | Use cases |
|---|---|---|
| Speech and audio | Process realtime audio input from users' microphones, with support for speech-to-text, turn detection, and interruptions. Generate speech output with TTS. | Voice assistants, call center automation, and voice-controlled applications. |
| Text and transcriptions | Handle text messages and transcriptions, enabling text-only sessions or hybrid voice and text interactions. Send text responses and transcriptions. | Chatbots, text-based customer support, and accessibility features for users who prefer typing. |
| Images and video | Process images and live video feeds for visual understanding. Send images to the frontend with byte streams, or add a virtual avatar for lifelike video output. | Visual assistants, avatar-based agents, screen sharing analysis, and image-based question answering. |
In this section
Read more about each modality.