Overview
LiveKit Agents supports multimodality, enabling your agents to communicate through multiple channels simultaneously. Agents can process and generate speech, text, images, and live video, allowing them to understand context from different sources and respond in the most appropriate format. This flexibility enables richer, more natural interactions where agents can see what users show them, read transcriptions of conversations, send text messages, and speak—all within a single session.
Modality options
Just as humans can see, hear, speak, and read, LiveKit agents can process vision, audio, text, and transcriptions. LiveKit Agents supports three main modalities: speech and audio, text and transcriptions, and vision. You can build agents that use a single modality or combine multiple modalities for richer, more flexible interactions.
| Modality | Description | Use cases |
|---|---|---|
| Speech and audio | Process realtime audio input from users' microphones, with support for speech-to-text, turn detection, and interruptions. | Voice assistants, call center automation, and voice-controlled applications. |
| Text and transcriptions | Handle text messages and transcriptions, enabling text-only sessions or hybrid voice and text interactions. | Chatbots, text-based customer support, and accessibility features for users who prefer typing. |
| Vision | Process images and live video feeds, enabling visual understanding and multimodal AI experiences. | Visual assistants that can see what users show them, screen sharing analysis, and image-based question answering. |
In this section
Read more about each modality.