This quickstart tutorial walks you through the steps to build an AI application using Python that has access to your camera. It uses LiveKit's Agents Framework to create an AI-powered voice assistant that can engage in realtime conversations with users, and analyze images captured from your camera.
Prerequisites
By default, the example agent uses Deepgram for STT and OpenAI for TTS and LLM. However, you aren't required to use these providers.
Step 1: Setup a LiveKit account and install the CLI
Create an account or sign in to your LiveKit Cloud account.
Install the LiveKit CLI and authenticate using
lk cloud auth
.
Step 2: Bootstrap an agent from template
The template provides a working voice assistant to build on. The template includes:
- Basic voice interaction
- Audio-only track subscription
- Voice activity detection (VAD)
- Speech-to-text (STT)
- Language model (LLM)
- Text-to-speech (TTS)
Clone the starter template for a simple Python voice agent:
lk app create --template voice-pipeline-agent-pythonEnter your OpenAI API Key and Deepgram API Key when prompted. If you aren't using Deepgram and OpenAI, see Customizing plugins.
NoteIf you want to use OpenAI for STT as well as TTS and LLM, you can change the
stt
plugin toopenai.STT()
.Install dependencies and start your agent:
cd <agent_dir>python3 -m venv venvsource venv/bin/activatepython3 -m pip install -r requirements.txtpython3 agent.py dev
Step 3: Add video-related imports
Add the video-related content to our agent. At the top of your agent.py
file, add these imports alongside the existing ones:
from livekit import rtcfrom livekit.agents.llm import ChatMessage, ChatImage
These new imports include:
rtc
: Access to LiveKit's video functionalityChatMessage
andChatImage
: Classes we'll use to send images to the LLM
Step 4: Enable video subscription
Find the ctx.connect()
line in the entrypoint
function. Change AutoSubscribe.AUDIO_ONLY
to AutoSubscribe.SUBSCRIBE_ALL
:
await ctx.connect(auto_subscribe=AutoSubscribe.SUBSCRIBE_ALL)
This enables the assistant to receive video tracks as well as audio.
Step 5: Add video frame handling
Add these two helper functions after your imports but before the prewarm
function:
async def get_video_track(room: rtc.Room):"""Find and return the first available remote video track in the room."""for participant_id, participant in room.remote_participants.items():for track_id, track_publication in participant.track_publications.items():if track_publication.track and isinstance(track_publication.track, rtc.RemoteVideoTrack):logger.info(f"Found video track {track_publication.track.sid} "f"from participant {participant_id}")return track_publication.trackraise ValueError("No remote video track found in the room")
This function searches through all participants to find an available video track. It's used to locate the video feed to process.
Next, add the frame capture function:
async def get_latest_image(room: rtc.Room):"""Capture and return a single frame from the video track."""video_stream = Nonetry:video_track = await get_video_track(room)video_stream = rtc.VideoStream(video_track)async for event in video_stream:logger.debug("Captured latest video frame")return event.frameexcept Exception as e:logger.error(f"Failed to get latest image: {e}")return Nonefinally:if video_stream:await video_stream.aclose()
This function captures a single frame from the video track and ensures proper cleanup of resources. Using aclose()
releases system resources like memory buffers and video decoder instances, which helps prevent memory leaks.
Step 6: Add the LLM Callback
Inside the entrypoint
function, add this callback function which will inject the latest video frame just before the LLM generates a response:
async def before_llm_cb(assistant: VoicePipelineAgent, chat_ctx: llm.ChatContext):"""Callback that runs right before the LLM generates a response.Captures the current video frame and adds it to the conversation context."""latest_image = await get_latest_image(ctx.room)if latest_image:image_content = [ChatImage(image=latest_image)]chat_ctx.messages.append(ChatMessage(role="user", content=image_content))logger.debug("Added latest frame to conversation context")
This callback is the key to efficient context management — it only adds visual information when the assistant is about to respond. If visual information was added to every message, it would quickly fill up the LLM's context window.
Step 7: Update the system prompt
Find the initial_ctx
creation in the entrypoint
function and update it to include vision capabilities:
initial_ctx = llm.ChatContext().append(role="system",text=("You are a voice assistant created by LiveKit that can both see and hear. ""You should use short and concise responses, avoiding unpronounceable punctuation. ""When you see an image in our conversation, naturally incorporate what you see ""into your response. Keep visual descriptions brief but informative."),)
Step 8: Update the assistant configuration
Find the VoicePipelineAgent
creation in the entrypoint
function and add the callback:
assistant = VoicePipelineAgent(vad=ctx.proc.userdata["vad"],stt=deepgram.STT(),llm=openai.LLM(model="gpt-4o-mini"),tts=openai.TTS(),chat_ctx=initial_ctx,before_llm_cb=before_llm_cb)
The key change here is the before_llm_cb
parameter, which uses the callback created earlier to inject the latest video frame into the conversation context.
Testing your agent
Start your assistant (if your agent is already running, skip this step):
python agent.py devConnect to the LiveKit room with a client that publishes both audio and video. The easiest way to do this is by using the Agents Playground.
Connect to the room, and try asking your agent some questions like:
- "What do you see right now?"
- "Can you describe what's happening?"
- "Has anything changed in the scene?"
How it works
With these changes, your assistant now:
- Connects to both audio and video streams.
- Listens for user speech as before.
- Just before generating each response:
- Captures the current video frame.
- Adds it to the conversation context.
- Uses it to inform the response.
- Keeps the context clean by only adding frames when needed.
Next steps
- For a list of additional plugins you can use, see Available LiveKit integrations.
- Let your friends and colleagues talk to your agent by connecting it to a LiveKit Sandbox.