Overview
This guide covers the full testing API for LiveKit Agents, including test setup, result navigation, assertions, mocking, and multi-turn conversation testing. The examples use pytest for Python and Vitest for Node.js, but are adaptable to other testing frameworks.
When restructuring your project to add tests, ensure you update your Dockerfile too if you move your agent entrypoint file. The default template assumes src/agent.py for Python projects. See Builds and Dockerfiles for details.
Installation
You must install both the pytest and pytest-asyncio packages to write tests for your agent.
uv add pytest pytest-asyncio
You must install vitest to write tests for your agent.
pnpm add -D vitest
Always call initializeLogger({ pretty: false, level: 'warn' }) at the top of your test files to suppress verbose CLI output.
Test setup
Each test typically follows the same pattern:
@pytest.mark.asyncio # Or your async testing framework of choiceasync def test_your_agent() -> None:async with (# You must create an LLM instance for the `judge` methodinference.LLM(model="openai/gpt-5.3-chat-latest") as llm,# Create a session for the life of this test.# LLM is not required - it will use the agent's LLM if you don't provide one hereAgentSession(llm=llm) as session,):# Start the agent in the sessionawait session.start(Assistant())# Run a single conversation turn based on the given user inputresult = await session.run(user_input="Hello")# ...your assertions go here...
import { inference, initializeLogger, voice } from '@livekit/agents';import { describe, it, beforeAll, afterAll } from 'vitest';// Import your agent classimport { Agent } from './agent';// Initialize logger to suppress CLI outputinitializeLogger({ pretty: false, level: 'warn' });const { AgentSession } = voice;describe('YourAgent', () => {let session: voice.AgentSession;let llm: inference.LLM;beforeAll(async () => {// You must create an LLM instance for the `judge` methodllm = new inference.LLM({ model: 'openai/gpt-5.3-chat-latest' });// Create a session for the life of this test.// LLM is not required - it will use the agent's LLM if you don't provide one heresession = new AgentSession({ llm });// Start the agent in the sessionawait session.start({ agent: new Agent() });});afterAll(async () => {await session?.close();});it('should test your agent', async () => {// Run a single conversation turn based on the given user inputconst result = await session.run({ userInput: 'Hello' }).wait();// ...your assertions go here...});});
Result structure
The run method executes a single conversation turn and returns a RunResult, which contains each of the events that occurred during the turn, in order, and offers a fluent assertion API.
A simple turn with no tool calls produces a single event:
Loading diagram…
However, a more complex turn may contain tool calls, tool outputs, handoffs, and one or more messages.
Loading diagram…
To validate these multi-part turns, you can use any of the following approaches.
Sequential navigation
- Step through events one at a time with
next_event(). - Validate each event with
is_*assertions likeis_message(). - Call
no_more_events()at the end to assert no unexpected events remain.
For example, to validate that the agent responds with a friendly greeting, you can use the following code:
result.expect.next_event().is_message(role="assistant")
result.expect.nextEvent().isMessage({ role: 'assistant' });
Skipping events
You can also skip events without validation:
skip_next(n): Skip one or more events. Defaults to 1.skip_next_event_if(type, ...): Skip the next event only if it matches the given type and optional filters (for example,rolefor messages,namefor function calls). Returns the matching Assert, orNoneif the next event doesn't match.next_event(type=...): Advance to the next event of the given type, skipping everything else. Raises an assertion error if no match is found.
Example:
result.expect.skip_next() # skips one eventresult.expect.skip_next(2) # skips two eventsresult.expect.skip_next_event_if(type="message", role="assistant") # Skips the next event if it's an assistant messageresult.expect.skip_next_event_if(type="function_call", name="lookup_weather") # Skips the next event if it's a call to lookup_weatherresult.expect.next_event(type="function_call") # Advances to the next function call, skipping non-function-call events. Raises an assertion error if not found.
result.expect.skipNext(); // skips one eventresult.expect.skipNext(2); // skips two eventsresult.expect.skipNextEventIf({ type: 'message', role: 'assistant' }); // Skips the next event if it's an assistant messageresult.expect.nextEvent({ type: 'message', role: 'assistant' }); // Advances to the next assistant message, skipping anything else. If no matching event is found, an assertion error is raised.
Passing a type to next_event() returns a type-specific Assert (for example, FunctionCallAssert) that doesn't have is_* methods. Don't chain .is_function_call() after next_event(type="function_call").
To assert additional properties like function name, either omit type and chain the is_* method, or check the event directly:
# Option 1: chain is_function_call on a generic EventAssertresult.expect.next_event().is_function_call(name="lookup_weather")# Option 2: advance to any function call, then check the namefnc = result.expect.next_event(type="function_call")assert fnc.event().item.name == "lookup_weather"
Indexed access
Access a specific event by index without advancing the cursor. You can use negative indices to access events from the end of the list. For example, -1 for the last event.
result.expect[0].is_message(role="assistant")
result.expect.at(0).isMessage({ role: 'assistant' });
Search
Search for events regardless of position with contains_* methods like contains_message(). You can also search within a range using slices ([:] in Python, .range() in Node.js).
result.expect.contains_message(role="assistant")result.expect[0:2].contains_message(role="assistant")
result.expect.containsMessage({ role: 'assistant' });result.expect.range(0, 2).containsMessage({ role: 'assistant' });
Assertions
The test framework includes assertion helpers to validate messages, tool calls, and agent handoffs within each result. Use exact assertions like is_message() to check a specific event, or search assertions like contains_message() to find a match anywhere in a range of events.
Message assertions
Use is_message() and contains_message() to test individual messages. Both accept an optional role argument.
result.expect.next_event().is_message(role="assistant")result.expect[0:2].contains_message(role="assistant")
result.expect.nextEvent().isMessage({ role: 'assistant' });result.expect.range(0, 2).containsMessage({ role: 'assistant' });
Access additional properties with the event() method:
event().item.content- Message contentevent().item.role- Message role
LLM-based judgment
Use judge() to evaluate whether a message matches a given intent. Pass an LLM instance and an intent string describing the expected content. The LLM judges the message against the intent without surrounding conversation context.
result = await session.run(user_input="Hello")await (result.expect.next_event().is_message(role="assistant").judge(llm, intent="Offers a friendly introduction and offer of assistance."))
const result = await session.run({ userInput: 'Hello' }).wait();await result.expect.nextEvent().isMessage({ role: 'assistant' }).judge(llm, {intent: 'Offers a friendly introduction and offer of assistance.',});
The llm argument can be any LLM instance and does not need to be the same one used in the agent itself.
Tool call assertions
Test three aspects of tool use:
- Function calls: The agent calls the correct tool with the correct arguments.
- Function call outputs: The tool returns the expected output.
- Agent response: The agent responds appropriately based on the tool output.
The following example tests all three:
result = await session.run(user_input="What's the weather in Tokyo?")# Test that the agent's first conversation item is a function callfnc_call = result.expect.next_event().is_function_call(name="lookup_weather", arguments={"location": "Tokyo"})# Test that the tool returned the expected output to the agentresult.expect.next_event().is_function_call_output(output="sunny with a temperature of 70 degrees.")# Test that the agent's response is appropriate based on the tool outputawait (result.expect.next_event().is_message(role="assistant").judge(llm,intent="Informs the user that the weather in Tokyo is sunny with a temperature of 70 degrees.",))# Verify the agent's turn is complete, with no additional messages or function callsresult.expect.no_more_events()
const result = await session.run({ userInput: "What's the weather in Tokyo?" }).wait();// Test that the agent's first conversation item is a function callresult.expect.nextEvent().isFunctionCall({ name: 'getWeather', args: { location: 'Tokyo' } });// Test that the tool returned the expected output to the agentresult.expect.nextEvent().isFunctionCallOutput();// Test that the agent's response is appropriate based on the tool outputawait result.expect.nextEvent().isMessage({ role: 'assistant' }).judge(llm, {intent: 'Informs the user that the weather in Tokyo is sunny with a temperature of 70 degrees.',});// Verify the agent's turn is complete, with no additional messages or function callsresult.expect.noMoreEvents();
Access individual properties with the event() method:
is_function_call().event().item.name- Function nameis_function_call().event().item.arguments- Function argumentsis_function_call_output().event().item.output- Raw function outputis_function_call_output().event().item.is_error- Whether the output is an erroris_function_call_output().event().item.call_id- The function call ID
Agent handoff assertions
Use is_agent_handoff() and contains_agent_handoff() to test that the agent performs a handoff to a new agent.
# The next event must be an agent handoff to the specified agentresult.expect.next_event().is_agent_handoff(new_agent_type=MyAgent)# A handoff must occur somewhere in the turnresult.expect.contains_agent_handoff(new_agent_type=MyAgent)
// The next event must be an agent handoff to the specified agentresult.expect.nextEvent().isAgentHandoff({ newAgentType: MyAgent });// A handoff must occur somewhere in the turnresult.expect.containsAgentHandoff({ newAgentType: MyAgent });
Evaluating full conversations
The judge() method evaluates a single message against an intent. To evaluate the full conversation against multiple criteria at once, use JudgeGroup, which runs a list of judges concurrently against a ChatContext and returns an aggregate EvaluationResult.
Built-in judges
LiveKit Agents ships eight judges as factory functions in livekit.agents.evals. Each one returns an LLM-based judge with preset evaluation criteria:
accuracy_judge: Verifies the agent grounds information in tool outputs. Catches hallucinations and contradictions.coherence_judge: Checks that responses follow a logical structure and don't jump between topics or contradict themselves.conciseness_judge: Catches unnecessary verbosity, repetition, or redundant detail.handoff_judge: Checks that the agent retains context across handoffs. Passes automatically when no handoffs occur, so it's safe to include in every test.relevancy_judge: Checks that responses stay on topic and address what the user asked.safety_judge: Catches unauthorized advice, improper disclosure, missed escalation, and harmful language.task_completion_judge: Checks if the agent completed its goal based on the latest agent instructions in the chat context.tool_use_judge: Checks tool selection, parameter accuracy, output handling, and error recovery.
Run a JudgeGroup in pytest
Run a multi-turn conversation, then evaluate it with JudgeGroup. The llm argument accepts either an LLM instance or a model string like "openai/gpt-4o-mini", which routes through LiveKit Inference.
import pytestfrom livekit.agents import AgentSession, inferencefrom livekit.agents.evals import (JudgeGroup,accuracy_judge,relevancy_judge,task_completion_judge,tool_use_judge,)from agent import Assistant@pytest.mark.asyncioasync def test_assistant_conversation() -> None:async with (inference.LLM(model="openai/gpt-5.3-chat-latest") as llm,AgentSession(llm=llm) as session,):await session.start(Assistant())await session.run(user_input="Hello")await session.run(user_input="What's the weather in Tokyo?")judges = JudgeGroup(llm="openai/gpt-4o-mini",# Pick the judges relevant for this testjudges=[task_completion_judge(),accuracy_judge(),tool_use_judge(),relevancy_judge(),],)result = await judges.evaluate(session.history)assert result.all_passed, f"Some judges failed: {result.judgments}"
Result properties
JudgeGroup.evaluate() returns an EvaluationResult with the following properties:
score: Float from 0.0 to 1.0. Pass counts as 1, maybe as 0.5, fail as 0.all_passed: True only if every judge returned a pass verdict.any_passed: True if at least one judge passed.majority_passed: True if more than half of the judges passed.none_failed: True if no judge explicitly failed. Maybes are allowed.judgments: A dict keyed by judge name. Each value is aJudgmentResultwithverdict("pass","fail", or"maybe"),reasoning,instructions, and the convenience propertiespassed,failed, anduncertain.
Use score and all_passed for assertions, and inspect judgments[name].reasoning to debug failures:
result = await judges.evaluate(session.history)for name, judgment in result.judgments.items():print(f"{name}: {judgment.verdict} ({judgment.reasoning})")
Custom judges
For deterministic checks that don't need an LLM, subclass Judge and override evaluate:
from livekit.agents.evals import Judge, JudgmentResultclass CitationJudge(Judge):def __init__(self) -> None:super().__init__(name="citation")async def evaluate(self, *, chat_ctx, reference=None, llm=None) -> JudgmentResult:has_citation = any("[source]" in (item.text_content or "")for item in chat_ctx.itemsif item.type == "message")return JudgmentResult(verdict="pass" if has_citation else "fail",reasoning="Found citation markers" if has_citation else "No citations found",)judges = JudgeGroup(llm="openai/gpt-4o-mini",judges=[accuracy_judge(), CitationJudge()],)
Subclassing Judge is the standard approach. As an escape hatch, any object that satisfies the Evaluator protocol can also be passed alongside the built-in judges: the protocol requires a name property and an async evaluate(*, chat_ctx, reference, llm) method that returns a JudgmentResult.
When JudgeGroup.evaluate() runs inside a job context, such as an on_session_end callback in production, it tags the session with each judgment as lk.judge.<name>:<verdict> so the results surface in LiveKit Cloud. In a pytest environment there's no job context, so tagging silently no-ops. The same JudgeGroup works in both places. For the production wiring, see the front-desk example .
Mocking tools
In many cases, you should mock your tools for testing. This is useful to easily test edge cases, such as errors or other unexpected behavior, or when the tool has a dependency on an external service that you don't need to test against.
mock_tools requires LiveKit Agents 1.2.6 or later.
Use the mock_tools helper in a with block to mock one or more tools for a specific Agent. To mock a tool that raises an error:
from livekit.agents import mock_tools# Mock a tool errorwith mock_tools(Assistant,{"lookup_weather": lambda: RuntimeError("Weather service is unavailable")},):result = await session.run(user_input="What's the weather in Tokyo?")await result.expect.next_event(type="message").judge(llm, intent="Should inform the user that an error occurred while looking up the weather.")
Mock function signatures
The mock function receives only the parameters it declares. Tool arguments are matched against the mock's signature, and anything not declared, including self and RunContext, is dropped. That's why the error mock above takes no arguments even though lookup_weather accepts location. The unused argument is trimmed away.
For more complex mocks, pass a function instead of a lambda:
def _mock_weather_tool(location: str) -> str:if location == "Tokyo":return "sunny with a temperature of 70 degrees."else:return "UNSUPPORTED_LOCATION"# Mock a specific tool responsewith mock_tools(Assistant, {"lookup_weather": _mock_weather_tool}):result = await session.run(user_input="What's the weather in Tokyo?")await result.expect.next_event(type="message").judge(llm,intent="Should indicate the weather in Tokyo is sunny with a temperature of 70 degrees.",)result = await session.run(user_input="What's the weather in Paris?")await result.expect.next_event(type="message").judge(llm,intent="Should indicate that weather lookups in Paris are not supported.",)
Testing multiple turns
You can test multiple turns of a conversation by executing the run method multiple times. The conversation history builds automatically across turns.
# First turnresult1 = await session.run(user_input="Hello")await result1.expect.next_event().is_message(role="assistant").judge(llm, intent="Friendly greeting")# Second turn builds on conversation historyresult2 = await session.run(user_input="What's the weather like in Tokyo?")result2.expect.next_event().is_function_call(name="lookup_weather")result2.expect.next_event().is_function_call_output()await result2.expect.next_event().is_message(role="assistant").judge(llm, intent="Provides weather information")
// First turnconst result1 = await session.run({ userInput: 'Hello' }).wait();await result1.expect.nextEvent().isMessage({ role: 'assistant' }).judge(llm, {intent: 'Friendly greeting',});// Second turn builds on conversation historyconst result2 = await session.run({ userInput: "What's the weather like in Tokyo?" }).wait();result2.expect.nextEvent().isFunctionCall({ name: 'getWeather' });result2.expect.nextEvent().isFunctionCallOutput();await result2.expect.nextEvent().isMessage({ role: 'assistant' }).judge(llm, {intent: 'Provides weather information',});
Loading conversation history
To load conversation history manually, use the ChatContext class just as in your agent code:
from livekit.agents import ChatContextagent = Assistant()await session.start(agent)# update_chat_ctx is on the Agent instance, not the session.# In tests where you don't hold a reference, use session.current_agent.chat_ctx = ChatContext()chat_ctx.add_message(role="user", content="My name is Alice")chat_ctx.add_message(role="assistant", content="Nice to meet you, Alice!")await agent.update_chat_ctx(chat_ctx)# Test that the agent remembers the contextresult = await session.run(user_input="What's my name?")await result.expect.next_event().is_message(role="assistant").judge(llm, intent="Should remember and mention the user's name is Alice")
import { llm } from '@livekit/agents';const { ChatContext } = llm;const agent = new Assistant();await session.start({ agent });// updateChatCtx is on the Agent instance, not the session.// In tests where you don't hold a reference, use session.currentAgent.const chatCtx = new ChatContext();chatCtx.addMessage({ role: 'user', content: 'My name is Alice' });chatCtx.addMessage({ role: 'assistant', content: 'Nice to meet you, Alice!' });await agent.updateChatCtx(chatCtx);// Test that the agent remembers the contextconst result = await session.run({ userInput: "What's my name?" }).wait();await result.expect.nextEvent().isMessage({ role: 'assistant' }).judge(llm, {intent: "Should remember and mention the user's name is Alice",});