Overview
If your LLM includes support for vision, you can add images to its chat context to leverage its full capabilities. LiveKit has tools for adding raw images from disk, the network, or uploaded directly from your frontend app. Additionally, you can use live video either with sampled frames in an STT-LLM-TTS pipeline model or true video input with a realtime model such as Gemini Live.
This guide includes an overview of the vision features and code samples for each use case.
Images
The agent's chat context supports images as well as text. You can add as many images as you want to the chat context, but keep in mind that larger context windows contribute to slow response times.
To add an image to the chat context, create an ImageContent
object and include it in a chat message. The image content can be a base 64 data URL, an external URL, or a frame from a video track.
Load into initial context
The following example shows an agent initialized with an image at startup. This example uses an external URL, but you can modify it to load a local file using a base 64 data URL instead:
def entrypoint(ctx: JobContext):# ctx.connect, etc.session = AgentSession(# ... stt, tts, llm, etc.)initial_ctx = ChatContext()initial_ctx.add_message(role="user",content=["Here is a picture of me",ImageContent(image="https://example.com/image.jpg")],)await session.start(room=ctx.room,agent=Agent(chat_ctx=initial_ctx,),# ... room_input_options, etc.)
Not every provider supports external image URLs. Consult their documentation for details.
Upload from frontend
To upload an image from your frontend app, use the sendFile method of the LiveKit SDK. Add a byte stream handler to your agent to receive the image data and add it to the chat context. Here is a simple agent capable of receiving images from the user on the byte stream topic "images"
:
class Assistant(Agent):def __init__(self) -> None:self._tasks = [] # Prevent garbage collection of running taskssuper().__init__(instructions="You are a helpful voice AI assistant.")async def on_enter(self):def _image_received_handler(reader, participant_identity):task = asyncio.create_task(self._image_received(reader, participant_identity))self._tasks.append(task)task.add_done_callback(lambda t: self._tasks.remove(t))# Add the handler when the agent joinsget_job_context().room.register_byte_stream_handler("images", _image_received_handler)async def _image_received(self, reader, participant_identity):image_bytes = bytes()async for chunk in reader:image_bytes += chunkchat_ctx = self.chat_ctx.copy()# Encode the image to base64 and add it to the chat contextchat_ctx.add_message(role="user",content=[ImageContent(image=f"data:image/png;base64,{base64.b64encode(image_bytes).decode('utf-8')}")],)await self.update_chat_ctx(chat_ctx)
Sample video frames
LLMs can process video in the form of still images, but many LLMs are not trained for this use case and can produce suboptimal results in understanding motion and other changes through a video feed. Realtime models, like Gemini Live, are trained on video and you can enable live video input for automatic support.
If you are using an STT-LLM-TTS pipeline, you can still work with video by sampling the video track at suitable times. For instance, in the following example the agent always includes the latest video frame on each conversation turn from the user. This provides the model with additional context without overwhelming it with data or expecting it to interpret many sequential frames at a time:
class Assistant(Agent):def __init__(self) -> None:self._latest_frame = Noneself._video_stream = Noneself._tasks = []super().__init__(instructions="You are a helpful voice AI assistant.")async def on_enter(self):room = get_job_context().room# Find the first video track (if any) from the remote participantremote_participant = list(room.remote_participants.values())[0]video_tracks = [publication.track for publication in list(remote_participant.track_publications.values()) if publication.track.kind == rtc.TrackKind.KIND_VIDEO]if video_tracks:self._create_video_stream(video_tracks[0])# Watch for new video tracks not yet published@room.on("track_subscribed")def on_track_subscribed(track: rtc.Track, publication: rtc.RemoteTrackPublication, participant: rtc.RemoteParticipant):if track.kind == rtc.TrackKind.KIND_VIDEO:self._create_video_stream(track)async def on_user_turn_completed(self, turn_ctx: ChatContext, new_message: ChatMessage) -> None:# Add the latest video frame, if any, to the new messageif self._latest_frame:new_message.content.append(ImageContent(image=self._latest_frame))self._latest_frame = None# Helper method to buffer the latest video frame from the user's trackdef _create_video_stream(self, track: rtc.Track):# Close any existing stream (we only want one at a time)if self._video_stream is not None:self._video_stream.close()# Create a new stream to receive framesself._video_stream = rtc.VideoStream(track)async def read_stream():async for event in self._video_stream:# Store the latest frame for use laterself._latest_frame = event.frame# Store the async tasktask = asyncio.create_task(read_stream())task.add_done_callback(lambda t: self._tasks.remove(t))self._tasks.append(task)
Video frame encoding
By default, the ImageContent
encodes video frames as JPEG at their native size. To adjust the size of the encoded frames, set the inference_width
and inference_height
parameters. Each frame is resized to fit within the provided dimensions while maintaining the original aspect ratio. For more control, use the encode
method of the livekit.agents.utils.images
module and pass the result as a data URL:
image_bytes = encode(event.frame,EncodeOptions(format="PNG",resize_options=ResizeOptions(width=512,height=512,strategy="scale_aspect_fit")))image_content = ImageContent(image=f"data:image/png;base64,{base64.b64encode(image_bytes).decode('utf-8')}")
Inference detail
If your LLM provider supports it, you can set the inference_detail
parameter to "high"
or "low"
to control the token usage and inference quality applied. The default is "auto"
, which uses the provider's default.
Live video
Live video input requires a realtime model with video support. Currently, only Gemini Live has this feature.
Set the video_enabled
parameter to True
in RoomInputOptions
to enable live video input. Your agent automatically receives frames from the user's camera or screen sharing tracks, if available. Only the single most recently published video track is used.
By default the agent samples one frame per second while the user speaks, and one frame every three seconds otherwise. Each frame is fit into 1024x1024 and encoded to JPEG. To override the frame rate, set video_sampler
on the AgentSession
with a custom instance.
Video input is passive and has no effect on turn detection. To leverage live video input in a non-conversational context, use manual turn control and trigger LLM responses or tool calls on a timer or other schedule.
The following example shows how to add Gemini Live vision to your voice AI quickstart agent:
class VideoAssistant(Agent):def __init__(self) -> None:super().__init__(instructions="You are a helpful voice assistant with live video input from your user.",llm=google.beta.realtime.RealtimeModel(voice="Puck",temperature=0.8,),)async def entrypoint(ctx: JobContext):await ctx.connect()session = AgentSession()await session.start(agent=VideoAssistant(),room=ctx.room,room_input_options=RoomInputOptions(video_enabled=True,# ... noise_cancellation, etc.),)
Additional resources
The following documentation and examples can help you get started with vision in LiveKit Agents.
Voice AI quickstart
Use the quickstart as a starting base for adding vision code.
Byte streams
Send images from your frontend to your agent with byte streams.
RoomIO
Learn more about RoomIO
and how it manages tracks.
Vision Assistant
Camera and microphone
Publish camera and microphone tracks from your frontend.
Screen sharing
Publish screen sharing tracks from your frontend.