Test framework

Overview

This guide covers the full testing API for LiveKit Agents, including test setup, result navigation, assertions, mocking, and multi-turn conversation testing. The examples use pytest for Python and Vitest for Node.js, but are adaptable to other testing frameworks.

Project structure and deployment

When restructuring your project to add tests, ensure you update your Dockerfile too if you move your agent entrypoint file. The default template assumes src/agent.py for Python projects. See Builds and Dockerfiles for details.

Installation

You must install both the pytest and pytest-asyncio packages to write tests for your agent.

uv add pytest pytest-asyncio

You must install vitest to write tests for your agent.

pnpm add -D vitest

Suppress CLI output

Always call initializeLogger({ pretty: false, level: 'warn' }) at the top of your test files to suppress verbose CLI output.

Test setup

Each test typically follows the same pattern:

@pytest.mark.asyncio # Or your async testing framework of choice
async def test_your_agent() -> None:
    async with (
        # You must create an LLM instance for the `judge` method
        inference.LLM(model="google/gemma-4-31b-it") as llm,

        # Create a session for the life of this test.
        # LLM is not required - it will use the agent's LLM if you don't provide one here
        AgentSession(llm=llm) as session,
    ):
        # Start the agent in the session
        await session.start(Assistant())

        # Run a single conversation turn based on the given user input
        result = await session.run(user_input="Hello")

        # ...your assertions go here...

import { inference, initializeLogger, voice } from '@livekit/agents';
import { describe, it, beforeAll, afterAll } from 'vitest';
// Import your agent class
import { Agent } from './agent';

// Initialize logger to suppress CLI output
initializeLogger({ pretty: false, level: 'warn' });

const { AgentSession } = voice;

describe('YourAgent', () => {
  let session: voice.AgentSession;
  let llm: inference.LLM;

  beforeAll(async () => {
    // You must create an LLM instance for the `judge` method
    llm = new inference.LLM({ model: 'google/gemma-4-31b-it' });

    // Create a session for the life of this test.
    // LLM is not required - it will use the agent's LLM if you don't provide one here
    session = new AgentSession({ llm });

    // Start the agent in the session
    await session.start({ agent: new Agent() });
  });

  afterAll(async () => {
    await session?.close();
  });

  it('should test your agent', async () => {
    // Run a single conversation turn based on the given user input
    const result = await session.run({ userInput: 'Hello' }).wait();

    // ...your assertions go here...
  });
});

Result structure

The run method executes a single conversation turn and returns a RunResult, which contains each of the events that occurred during the turn, in order, and offers a fluent assertion API.

A simple turn with no tool calls produces a single event:

Loading diagram…

However, a more complex turn may contain tool calls, tool outputs, handoffs, and one or more messages.

Loading diagram…

To validate these multi-part turns, you can use any of the following approaches.

Step through events one at a time with next_event().
Validate each event with is_* assertions like is_message().
Call no_more_events() at the end to assert no unexpected events remain.

For example, to validate that the agent responds with a friendly greeting, you can use the following code:

result.expect.next_event().is_message(role="assistant")

result.expect.nextEvent().isMessage({ role: 'assistant' });

Skipping events

You can also skip events without validation:

skip_next(n): Skip one or more events. Defaults to 1.
skip_next_event_if(type, ...): Skip the next event only if it matches the given type and optional filters (for example, role for messages, name for function calls). Returns the matching Assert, or None if the next event doesn't match.
next_event(type=...): Advance to the next event of the given type, skipping everything else. Raises an assertion error if no match is found.

Example:

result.expect.skip_next() # skips one event
result.expect.skip_next(2) # skips two events
result.expect.skip_next_event_if(type="message", role="assistant") # Skips the next event if it's an assistant message
result.expect.skip_next_event_if(type="function_call", name="lookup_weather") # Skips the next event if it's a call to lookup_weather

result.expect.next_event(type="function_call") # Advances to the next function call, skipping non-function-call events. Raises an assertion error if not found.

result.expect.skipNext(); // skips one event
result.expect.skipNext(2); // skips two events
result.expect.skipNextEventIf({ type: 'message', role: 'assistant' }); // Skips the next event if it's an assistant message

result.expect.nextEvent({ type: 'message', role: 'assistant' }); // Advances to the next assistant message, skipping anything else. If no matching event is found, an assertion error is raised.

Return types for next_event(type=...)

Passing a type to next_event() returns a type-specific Assert (for example, FunctionCallAssert) that doesn't have is_* methods. Don't chain .is_function_call() after next_event(type="function_call").

To assert additional properties like function name, either omit type and chain the is_* method, or check the event directly:

# Option 1: chain is_function_call on a generic EventAssert
result.expect.next_event().is_function_call(name="lookup_weather")

# Option 2: advance to any function call, then check the name
fnc = result.expect.next_event(type="function_call")
assert fnc.event().item.name == "lookup_weather"

Indexed access

Access a specific event by index without advancing the cursor. You can use negative indices to access events from the end of the list. For example, -1 for the last event.

result.expect[0].is_message(role="assistant")

result.expect.at(0).isMessage({ role: 'assistant' });

Search

Search for events regardless of position with contains_* methods like contains_message(). You can also search within a range using slices ([:] in Python, .range() in Node.js).

result.expect.contains_message(role="assistant")
result.expect[0:2].contains_message(role="assistant")

result.expect.containsMessage({ role: 'assistant' });
result.expect.range(0, 2).containsMessage({ role: 'assistant' });

Assertions

The test framework includes assertion helpers to validate messages, tool calls, and agent handoffs within each result. Use exact assertions like is_message() to check a specific event, or search assertions like contains_message() to find a match anywhere in a range of events.

Message assertions

Use is_message() and contains_message() to test individual messages. Both accept an optional role argument.

result.expect.next_event().is_message(role="assistant")
result.expect[0:2].contains_message(role="assistant")

result.expect.nextEvent().isMessage({ role: 'assistant' });
result.expect.range(0, 2).containsMessage({ role: 'assistant' });

Access additional properties with the event() method:

event().item.content - Message content
event().item.role - Message role

LLM-based judgment

Use judge() to evaluate whether a message matches a given intent. Pass an LLM instance and an intent string describing the expected content. The LLM judges the message against the intent without surrounding conversation context.

result = await session.run(user_input="Hello")

await (
    result.expect.next_event().is_message(role="assistant")
    .judge(
        llm, intent="Offers a friendly introduction and offer of assistance."
    )
)

const result = await session.run({ userInput: 'Hello' }).wait();

await result.expect
  .nextEvent()
  .isMessage({ role: 'assistant' })
  .judge(llm, {
    intent: 'Offers a friendly introduction and offer of assistance.',
  });

The llm argument can be any LLM instance and does not need to be the same one used in the agent itself.

Tool call assertions

Test three aspects of tool use:

Function calls: The agent calls the correct tool with the correct arguments.
Function call outputs: The tool returns the expected output.
Agent response: The agent responds appropriately based on the tool output.

The following example tests all three:

result = await session.run(user_input="What's the weather in Tokyo?")

# Test that the agent's first conversation item is a function call
fnc_call = result.expect.next_event().is_function_call(name="lookup_weather", arguments={"location": "Tokyo"})

# Test that the tool returned the expected output to the agent
result.expect.next_event().is_function_call_output(output="sunny with a temperature of 70 degrees.")

# Test that the agent's response is appropriate based on the tool output
await (
    result.expect.next_event()
    .is_message(role="assistant")
    .judge(
        llm,
        intent="Informs the user that the weather in Tokyo is sunny with a temperature of 70 degrees.",
    )
)

# Verify the agent's turn is complete, with no additional messages or function calls
result.expect.no_more_events()

const result = await session
  .run({ userInput: "What's the weather in Tokyo?" })
  .wait();

// Test that the agent's first conversation item is a function call
result.expect
  .nextEvent()
  .isFunctionCall({ name: 'getWeather', args: { location: 'Tokyo' } });

// Test that the tool returned the expected output to the agent
result.expect.nextEvent().isFunctionCallOutput();

// Test that the agent's response is appropriate based on the tool output
await result.expect
  .nextEvent()
  .isMessage({ role: 'assistant' })
  .judge(llm, {
    intent: 'Informs the user that the weather in Tokyo is sunny with a temperature of 70 degrees.',
  });

// Verify the agent's turn is complete, with no additional messages or function calls
result.expect.noMoreEvents();

Access individual properties with the event() method:

is_function_call().event().item.name - Function name
is_function_call().event().item.arguments - Function arguments
is_function_call_output().event().item.output - Raw function output
is_function_call_output().event().item.is_error - Whether the output is an error
is_function_call_output().event().item.call_id - The function call ID

Agent handoff assertions

Use is_agent_handoff() and contains_agent_handoff() to test that the agent performs a handoff to a new agent.

# The next event must be an agent handoff to the specified agent
result.expect.next_event().is_agent_handoff(new_agent_type=MyAgent)

# A handoff must occur somewhere in the turn
result.expect.contains_agent_handoff(new_agent_type=MyAgent)

// The next event must be an agent handoff to the specified agent
result.expect.nextEvent().isAgentHandoff({ newAgentType: MyAgent });

// A handoff must occur somewhere in the turn
result.expect.containsAgentHandoff({ newAgentType: MyAgent });

Evaluating full conversations

ONLY Available inPython

The judge() method evaluates a single message against an intent. To evaluate the full conversation against multiple criteria at once, use JudgeGroup, which runs a list of judges concurrently against a ChatContext and returns an aggregate EvaluationResult.

Built-in judges

LiveKit Agents ships eight judges as factory functions in livekit.agents.evals. Each one returns an LLM-based judge with preset evaluation criteria:

accuracy_judge: Verifies the agent grounds information in tool outputs. Catches hallucinations and contradictions.
coherence_judge: Checks that responses follow a logical structure and don't jump between topics or contradict themselves.
conciseness_judge: Catches unnecessary verbosity, repetition, or redundant detail.
handoff_judge: Checks that the agent retains context across handoffs. Passes automatically when no handoffs occur, so it's safe to include in every test.
relevancy_judge: Checks that responses stay on topic and address what the user asked.
safety_judge: Catches unauthorized advice, improper disclosure, missed escalation, and harmful language.
task_completion_judge: Checks if the agent completed its goal based on the latest agent instructions in the chat context.
tool_use_judge: Checks tool selection, parameter accuracy, output handling, and error recovery.

Run a JudgeGroup in pytest

Run a multi-turn conversation, then evaluate it with JudgeGroup. The llm argument accepts either an LLM instance or a model string like "openai/gpt-4o-mini", which routes through LiveKit Inference.

import pytest
from livekit.agents import AgentSession, inference
from livekit.agents.evals import (
    JudgeGroup,
    accuracy_judge,
    relevancy_judge,
    task_completion_judge,
    tool_use_judge,
)

from agent import Assistant

@pytest.mark.asyncio
async def test_assistant_conversation() -> None:
    async with (
        inference.LLM(model="google/gemma-4-31b-it") as llm,
        AgentSession(llm=llm) as session,
    ):
        await session.start(Assistant())

        await session.run(user_input="Hello")
        await session.run(user_input="What's the weather in Tokyo?")

        judges = JudgeGroup(
            llm="openai/gpt-4o-mini",
            # Pick the judges relevant for this test
            judges=[
                task_completion_judge(),
                accuracy_judge(),
                tool_use_judge(),
                relevancy_judge(),
            ],
        )

        result = await judges.evaluate(session.history)

        assert result.all_passed, f"Some judges failed: {result.judgments}"

Result properties

JudgeGroup.evaluate() returns an EvaluationResult with the following properties:

score: Float from 0.0 to 1.0. Pass counts as 1, maybe as 0.5, fail as 0.
all_passed: True only if every judge returned a pass verdict.
any_passed: True if at least one judge passed.
majority_passed: True if more than half of the judges passed.
none_failed: True if no judge explicitly failed. Maybes are allowed.
judgments: A dict keyed by judge name. Each value is a JudgmentResult with verdict ("pass", "fail", or "maybe"), reasoning, instructions, and the convenience properties passed, failed, and uncertain.

Use score and all_passed for assertions, and inspect judgments[name].reasoning to debug failures:

result = await judges.evaluate(session.history)

for name, judgment in result.judgments.items():
    print(f"{name}: {judgment.verdict} ({judgment.reasoning})")

Custom judges

For deterministic checks that don't need an LLM, subclass Judge and override evaluate:

from livekit.agents.evals import Judge, JudgmentResult

class CitationJudge(Judge):
    def __init__(self) -> None:
        super().__init__(name="citation")

    async def evaluate(self, *, chat_ctx, reference=None, llm=None) -> JudgmentResult:
        has_citation = any(
            "[source]" in (item.text_content or "")
            for item in chat_ctx.items
            if item.type == "message"
        )
        return JudgmentResult(
            verdict="pass" if has_citation else "fail",
            reasoning="Found citation markers" if has_citation else "No citations found",
        )

judges = JudgeGroup(
    llm="openai/gpt-4o-mini",
    judges=[accuracy_judge(), CitationJudge()],
)

Subclassing Judge is the standard approach. As an escape hatch, any object that satisfies the Evaluator protocol can also be passed alongside the built-in judges: the protocol requires a name property and an async evaluate(*, chat_ctx, reference, llm) method that returns a JudgmentResult.

Auto-tagging in production vs. tests

When JudgeGroup.evaluate() runs inside a job context, such as an on_session_end callback in production, it tags the session with each judgment as lk.judge.<name>:<verdict> so the results surface in LiveKit Cloud. In a pytest environment there's no job context, so tagging silently no-ops. The same JudgeGroup works in both places. For the production wiring, see the front-desk example .

Mocking tools

In many cases, you should mock your tools for testing. This is useful to easily test edge cases, such as errors or other unexpected behavior, or when the tool has a dependency on an external service that you don't need to test against.

In Python, use the mock_tools helper in a with block. In Node.js, use voice.testing.withMockTools, which returns a Disposable you scope with using. Both override one or more tools for a specific Agent. Returning an Error from a mock makes the tool raise. To mock a tool that raises an error:

from livekit.agents import mock_tools

# Mock a tool error
with mock_tools(
    Assistant,
    {"lookup_weather": lambda: RuntimeError("Weather service is unavailable")},
):
    result = await session.run(user_input="What's the weather in Tokyo?")

    await result.expect.next_event(type="message").judge(
        llm, intent="Should inform the user that an error occurred while looking up the weather."
    )

import { voice } from '@livekit/agents';

// Mock a tool error
{
  using _mock = voice.testing.withMockTools(Assistant, {
    lookupWeather: () => new Error('Weather service is unavailable'),
  });

  const result = await session.run({ userInput: "What's the weather in Tokyo?" }).wait();

  await result.expect.nextEvent({ type: 'message' }).judge(llm, {
    intent: 'Should inform the user that an error occurred while looking up the weather.',
  });
}

Mock function signatures

In Python, the mock function receives only the parameters it declares. Tool arguments are matched against the mock's signature, and anything not declared, including self and RunContext, is dropped. That's why the error mock above takes no arguments even though lookup_weather accepts location. The unused argument is trimmed away. In Node.js, a mock is a function (...args) => result; declare the parameters you need (for example location) and ignore the rest.

For more complex mocks, pass a named function instead of an inline one:

def _mock_weather_tool(location: str) -> str:
    if location == "Tokyo":
        return "sunny with a temperature of 70 degrees."
    else:
        return "UNSUPPORTED_LOCATION"

# Mock a specific tool response
with mock_tools(Assistant, {"lookup_weather": _mock_weather_tool}):
    result = await session.run(user_input="What's the weather in Tokyo?")

    await result.expect.next_event(type="message").judge(
        llm,
        intent="Should indicate the weather in Tokyo is sunny with a temperature of 70 degrees.",
    )

    result = await session.run(user_input="What's the weather in Paris?")

    await result.expect.next_event(type="message").judge(
        llm,
        intent="Should indicate that weather lookups in Paris are not supported.",
    )

function mockWeatherTool(location: string): string {
  if (location === 'Tokyo') {
    return 'sunny with a temperature of 70 degrees.';
  }
  return 'UNSUPPORTED_LOCATION';
}

// Mock a specific tool response
{
  using _mock = voice.testing.withMockTools(Assistant, { lookupWeather: mockWeatherTool });

  let result = await session.run({ userInput: "What's the weather in Tokyo?" }).wait();
  await result.expect.nextEvent({ type: 'message' }).judge(llm, {
    intent: 'Should indicate the weather in Tokyo is sunny with a temperature of 70 degrees.',
  });

  result = await session.run({ userInput: "What's the weather in Paris?" }).wait();
  await result.expect.nextEvent({ type: 'message' }).judge(llm, {
    intent: 'Should indicate that weather lookups in Paris are not supported.',
  });
}

Testing multiple turns

You can test multiple turns of a conversation by executing the run method multiple times. The conversation history builds automatically across turns.

# First turn
result1 = await session.run(user_input="Hello")
await result1.expect.next_event().is_message(role="assistant").judge(
    llm, intent="Friendly greeting"
)

# Second turn builds on conversation history
result2 = await session.run(user_input="What's the weather like in Tokyo?")
result2.expect.next_event().is_function_call(name="lookup_weather")
result2.expect.next_event().is_function_call_output()
await result2.expect.next_event().is_message(role="assistant").judge(
    llm, intent="Provides weather information"
)

// First turn
const result1 = await session.run({ userInput: 'Hello' }).wait();
await result1.expect
  .nextEvent()
  .isMessage({ role: 'assistant' })
  .judge(llm, {
    intent: 'Friendly greeting',
  });

// Second turn builds on conversation history
const result2 = await session.run({ userInput: "What's the weather like in Tokyo?" }).wait();
result2.expect.nextEvent().isFunctionCall({ name: 'getWeather' });
result2.expect.nextEvent().isFunctionCallOutput();
await result2.expect
  .nextEvent()
  .isMessage({ role: 'assistant' })
  .judge(llm, {
    intent: 'Provides weather information',
  });

Loading conversation history

To load conversation history manually, use the ChatContext class just as in your agent code:

from livekit.agents import ChatContext

agent = Assistant()
await session.start(agent)
# update_chat_ctx is on the Agent instance, not the session.
# In tests where you don't hold a reference, use session.current_agent.

chat_ctx = ChatContext()
chat_ctx.add_message(role="user", content="My name is Alice")
chat_ctx.add_message(role="assistant", content="Nice to meet you, Alice!")
await agent.update_chat_ctx(chat_ctx)

# Test that the agent remembers the context
result = await session.run(user_input="What's my name?")
await result.expect.next_event().is_message(role="assistant").judge(
    llm, intent="Should remember and mention the user's name is Alice"
)

import { llm } from '@livekit/agents';

const { ChatContext } = llm;

const agent = new Assistant();
await session.start({ agent });
// updateChatCtx is on the Agent instance, not the session.
// In tests where you don't hold a reference, use session.currentAgent.

const chatCtx = new ChatContext();
chatCtx.addMessage({ role: 'user', content: 'My name is Alice' });
chatCtx.addMessage({ role: 'assistant', content: 'Nice to meet you, Alice!' });
await agent.updateChatCtx(chatCtx);

// Test that the agent remembers the context
const result = await session.run({ userInput: "What's my name?" }).wait();
await result.expect
  .nextEvent()
  .isMessage({ role: 'assistant' })
  .judge(llm, {
    intent: "Should remember and mention the user's name is Alice",
  });