Vision Agent Quickstart

Build an AI-powered assistant that engages in realtime conversations with access to vision.

This quickstart tutorial walks you through the steps to build an AI application using Python that has access to your camera. It uses LiveKit's Agents Framework to create an AI-powered voice assistant that can engage in realtime conversations with users, and analyze images captured from your camera.

Prerequisites

Note

By default, the example agent uses Deepgram for STT and OpenAI for TTS and LLM. However, you aren't required to use these providers.

Step 1: Setup a LiveKit account and install the CLI

  1. Create an account or sign in to your LiveKit Cloud account.

  2. Install the LiveKit CLI and authenticate using lk cloud auth.

Step 2: Bootstrap an agent from template

The template provides a working voice assistant to build on. The template includes:

  • Basic voice interaction
  • Audio-only track subscription
  • Voice activity detection (VAD)
  • Speech-to-text (STT)
  • Language model (LLM)
  • Text-to-speech (TTS)
  1. Clone the starter template for a simple Python voice agent:

    lk app create --template voice-pipeline-agent-python
  2. Enter your OpenAI API Key and Deepgram API Key when prompted. If you aren't using Deepgram and OpenAI, see Customizing plugins.

    Note

    If you want to use OpenAI for STT as well as TTS and LLM, you can change the stt plugin to openai.STT().

  3. Install dependencies and start your agent:

    cd <agent_dir>
    python3 -m venv venv
    source venv/bin/activate
    python3 -m pip install -r requirements.txt
    python3 agent.py dev

Add the video-related content to our agent. At the top of your agent.py file, add these imports alongside the existing ones:

from livekit import rtc
from livekit.agents.llm import ChatMessage, ChatImage

These new imports include:

  • rtc: Access to LiveKit's video functionality
  • ChatMessage and ChatImage: Classes we'll use to send images to the LLM

Step 4: Enable video subscription

Find the ctx.connect() line in the entrypoint function. Change AutoSubscribe.AUDIO_ONLY to AutoSubscribe.SUBSCRIBE_ALL:

await ctx.connect(auto_subscribe=AutoSubscribe.SUBSCRIBE_ALL)

This enables the assistant to receive video tracks as well as audio.

Step 5: Add video frame handling

Add these two helper functions after your imports but before the prewarm function:

async def get_video_track(room: rtc.Room):
"""Find and return the first available remote video track in the room."""
for participant_id, participant in room.remote_participants.items():
for track_id, track_publication in participant.track_publications.items():
if track_publication.track and isinstance(
track_publication.track, rtc.RemoteVideoTrack
):
logger.info(
f"Found video track {track_publication.track.sid} "
f"from participant {participant_id}"
)
return track_publication.track
raise ValueError("No remote video track found in the room")

This function searches through all participants to find an available video track. It's used to locate the video feed to process.

Next, add the frame capture function:

async def get_latest_image(room: rtc.Room):
"""Capture and return a single frame from the video track."""
video_stream = None
try:
video_track = await get_video_track(room)
video_stream = rtc.VideoStream(video_track)
async for event in video_stream:
logger.debug("Captured latest video frame")
return event.frame
except Exception as e:
logger.error(f"Failed to get latest image: {e}")
return None
finally:
if video_stream:
await video_stream.aclose()

This function captures a single frame from the video track and ensures proper cleanup of resources. Using aclose() releases system resources like memory buffers and video decoder instances, which helps prevent memory leaks.

Step 6: Add the LLM Callback

Inside the entrypoint function, add this callback function which will inject the latest video frame just before the LLM generates a response:

async def before_llm_cb(assistant: VoicePipelineAgent, chat_ctx: llm.ChatContext):
"""
Callback that runs right before the LLM generates a response.
Captures the current video frame and adds it to the conversation context.
"""
latest_image = await get_latest_image(ctx.room)
if latest_image:
image_content = [ChatImage(image=latest_image)]
chat_ctx.messages.append(ChatMessage(role="user", content=image_content))
logger.debug("Added latest frame to conversation context")

This callback is the key to efficient context management — it only adds visual information when the assistant is about to respond. If visual information was added to every message, it would quickly fill up the LLM's context window.

Step 7: Update the system prompt

Find the initial_ctx creation in the entrypoint function and update it to include vision capabilities:

initial_ctx = llm.ChatContext().append(
role="system",
text=(
"You are a voice assistant created by LiveKit that can both see and hear. "
"You should use short and concise responses, avoiding unpronounceable punctuation. "
"When you see an image in our conversation, naturally incorporate what you see "
"into your response. Keep visual descriptions brief but informative."
),
)

Step 8: Update the assistant configuration

Find the VoicePipelineAgent creation in the entrypoint function and add the callback:

assistant = VoicePipelineAgent(
vad=ctx.proc.userdata["vad"],
stt=deepgram.STT(),
llm=openai.LLM(model="gpt-4o-mini"),
tts=openai.TTS(),
chat_ctx=initial_ctx,
before_llm_cb=before_llm_cb
)

The key change here is the before_llm_cb parameter, which uses the callback created earlier to inject the latest video frame into the conversation context.

Testing your agent

  1. Start your assistant (if your agent is already running, skip this step):

    python agent.py dev
  2. Connect to the LiveKit room with a client that publishes both audio and video. The easiest way to do this is by using the Agents Playground.

  3. Connect to the room, and try asking your agent some questions like:

    • "What do you see right now?"
    • "Can you describe what's happening?"
    • "Has anything changed in the scene?"

How it works

With these changes, your assistant now:

  1. Connects to both audio and video streams.
  2. Listens for user speech as before.
  3. Just before generating each response:
    • Captures the current video frame.
    • Adds it to the conversation context.
    • Uses it to inform the response.
  4. Keeps the context clean by only adding frames when needed.

Next steps