LiveKit docs › Testing & evaluation › Test framework

---

# Test framework

> Set up tests, navigate results, write assertions, and test multi-turn conversations.

## Overview

This guide covers the full testing API for LiveKit Agents, including test setup, result navigation, assertions, mocking, and multi-turn conversation testing. The examples use [pytest](https://docs.pytest.org/en/stable/) for Python and [Vitest](https://vitest.dev/) for Node.js, but are adaptable to other testing frameworks.

> ℹ️ **Project structure and deployment**
> 
> When restructuring your project to add tests, ensure you update your Dockerfile too if you move your agent entrypoint file. The default template assumes `src/agent.py` for Python projects. See [Builds and Dockerfiles](https://docs.livekit.io/deploy/agents/builds.md) for details.

## Installation

**Python**:

You must install both the `pytest` and `pytest-asyncio` packages to write tests for your agent.

```shell
uv add pytest pytest-asyncio

```

---

**Node.js**:

You must install `vitest` to write tests for your agent.

```shell
pnpm add -D vitest

```

> ℹ️ **Suppress CLI output**
> 
> Always call `initializeLogger({ pretty: false, level: 'warn' })` at the top of your test files to suppress verbose CLI output.

## Test setup

Each test typically follows the same pattern:

**Python**:

```python
@pytest.mark.asyncio # Or your async testing framework of choice
async def test_your_agent() -> None:
    async with (
        # You must create an LLM instance for the `judge` method
        inference.LLM(model="openai/gpt-5.3-chat-latest") as llm,

        # Create a session for the life of this test.
        # LLM is not required - it will use the agent's LLM if you don't provide one here
        AgentSession(llm=llm) as session,
    ):
        # Start the agent in the session
        await session.start(Assistant())

        # Run a single conversation turn based on the given user input
        result = await session.run(user_input="Hello")

        # ...your assertions go here...

```

---

**Node.js**:

```typescript
import { inference, initializeLogger, voice } from '@livekit/agents';
import { describe, it, beforeAll, afterAll } from 'vitest';
// Import your agent class
import { Agent } from './agent';

// Initialize logger to suppress CLI output
initializeLogger({ pretty: false, level: 'warn' });

const { AgentSession } = voice;

describe('YourAgent', () => {
  let session: voice.AgentSession;
  let llm: inference.LLM;

  beforeAll(async () => {
    // You must create an LLM instance for the `judge` method
    llm = new inference.LLM({ model: 'openai/gpt-5.3-chat-latest' });

    // Create a session for the life of this test.
    // LLM is not required - it will use the agent's LLM if you don't provide one here
    session = new AgentSession({ llm });

    // Start the agent in the session
    await session.start({ agent: new Agent() });
  });

  afterAll(async () => {
    await session?.close();
  });

  it('should test your agent', async () => {
    // Run a single conversation turn based on the given user input
    const result = await session.run({ userInput: 'Hello' }).wait();

    // ...your assertions go here...
  });
});

```

## Result structure

The `run` method executes a single conversation turn and returns a `RunResult`, which contains each of the events that occurred during the turn, in order, and offers a fluent assertion API.

A simple turn with no tool calls produces a single event:

```mermaid
flowchart LR
greeting("User: 'Hello'") --> response("Agent: 'How can I help you today?'")
```

However, a more complex turn may contain tool calls, tool outputs, handoffs, and one or more messages.

```mermaid
flowchart TD
greeting("User: 'What's the weather in Tokyo?'") --> tool_call("ToolCall: lookup_weather(location='Tokyo')")
tool_call --> tool_output("ToolOutput: 'sunny with a temperature of 70 degrees.'")
tool_output --> response("Agent: 'The weather in Tokyo is sunny with a temperature of 70 degrees.'")
```

To validate these multi-part turns, you can use any of the following approaches.

### Sequential navigation

- Step through events one at a time with `next_event()`.
- Validate each event with `is_*` assertions like `is_message()`.
- Call `no_more_events()` at the end to assert no unexpected events remain.

For example, to validate that the agent responds with a friendly greeting, you can use the following code:

**Python**:

```python
result.expect.next_event().is_message(role="assistant")

```

---

**Node.js**:

```typescript
result.expect.nextEvent().isMessage({ role: 'assistant' });

```

#### Skipping events

You can also skip events without validation:

- **`skip_next(n)`**: Skip one or more events. Defaults to 1.
- **`skip_next_event_if(type, ...)`**: Skip the next event only if it matches the given type and optional filters (for example, `role` for messages, `name` for function calls). Returns the matching Assert, or `None` if the next event doesn't match.
- **`next_event(type=...)`**: Advance to the next event of the given type, skipping everything else. Raises an assertion error if no match is found.

Example:

**Python**:

```python
result.expect.skip_next() # skips one event
result.expect.skip_next(2) # skips two events
result.expect.skip_next_event_if(type="message", role="assistant") # Skips the next event if it's an assistant message
result.expect.skip_next_event_if(type="function_call", name="lookup_weather") # Skips the next event if it's a call to lookup_weather

result.expect.next_event(type="function_call") # Advances to the next function call, skipping non-function-call events. Raises an assertion error if not found.

```

---

**Node.js**:

```typescript
result.expect.skipNext(); // skips one event
result.expect.skipNext(2); // skips two events
result.expect.skipNextEventIf({ type: 'message', role: 'assistant' }); // Skips the next event if it's an assistant message

result.expect.nextEvent({ type: 'message', role: 'assistant' }); // Advances to the next assistant message, skipping anything else. If no matching event is found, an assertion error is raised.

```

> ℹ️ **Return types for next_event(type=...)**
> 
> Passing a `type` to `next_event()` returns a type-specific Assert (for example, `FunctionCallAssert`) that doesn't have `is_*` methods. Don't chain `.is_function_call()` after `next_event(type="function_call")`.
> 
> To assert additional properties like function name, either omit `type` and chain the `is_*` method, or check the event directly:
> 
> ```python
> # Option 1: chain is_function_call on a generic EventAssert
> result.expect.next_event().is_function_call(name="lookup_weather")
> 
> # Option 2: advance to any function call, then check the name
> fnc = result.expect.next_event(type="function_call")
> assert fnc.event().item.name == "lookup_weather"
> 
> ```

### Indexed access

Access a specific event by index without advancing the cursor. You can use negative indices to access events from the end of the list. For example, `-1` for the last event.

**Python**:

```python
result.expect[0].is_message(role="assistant")

```

---

**Node.js**:

```typescript
result.expect.at(0).isMessage({ role: 'assistant' });

```

### Search

Search for events regardless of position with `contains_*` methods like `contains_message()`. You can also search within a range using slices (`[:]` in Python, `.range()` in Node.js).

**Python**:

```python
result.expect.contains_message(role="assistant")
result.expect[0:2].contains_message(role="assistant")

```

---

**Node.js**:

```typescript
result.expect.containsMessage({ role: 'assistant' });
result.expect.range(0, 2).containsMessage({ role: 'assistant' });

```

## Assertions

The test framework includes assertion helpers to validate messages, tool calls, and agent handoffs within each result. Use exact assertions like `is_message()` to check a specific event, or search assertions like `contains_message()` to find a match anywhere in a range of events.

### Message assertions

Use `is_message()` and `contains_message()` to test individual messages. Both accept an optional `role` argument.

**Python**:

```python
result.expect.next_event().is_message(role="assistant")
result.expect[0:2].contains_message(role="assistant")

```

---

**Node.js**:

```typescript
result.expect.nextEvent().isMessage({ role: 'assistant' });
result.expect.range(0, 2).containsMessage({ role: 'assistant' });

```

Access additional properties with the `event()` method:

- **`event().item.content`** - Message content
- **`event().item.role`** - Message role

### LLM-based judgment

Use `judge()` to evaluate whether a message matches a given intent. Pass an [LLM](https://docs.livekit.io/agents/models/llm.md) instance and an intent string describing the expected content. The LLM judges the message against the intent without surrounding conversation context.

**Python**:

```python
result = await session.run(user_input="Hello")

await (
    result.expect.next_event().is_message(role="assistant")
    .judge(
        llm, intent="Offers a friendly introduction and offer of assistance."
    )
)

```

---

**Node.js**:

```typescript
const result = await session.run({ userInput: 'Hello' }).wait();

await result.expect
  .nextEvent()
  .isMessage({ role: 'assistant' })
  .judge(llm, {
    intent: 'Offers a friendly introduction and offer of assistance.',
  });

```

The `llm` argument can be any LLM instance and does not need to be the same one used in the agent itself.

### Tool call assertions

Test three aspects of tool use:

1. **Function calls**: The agent calls the correct tool with the correct arguments.
2. **Function call outputs**: The tool returns the expected output.
3. **Agent response**: The agent responds appropriately based on the tool output.

The following example tests all three:

**Python**:

```python
result = await session.run(user_input="What's the weather in Tokyo?")

# Test that the agent's first conversation item is a function call
fnc_call = result.expect.next_event().is_function_call(name="lookup_weather", arguments={"location": "Tokyo"})

# Test that the tool returned the expected output to the agent
result.expect.next_event().is_function_call_output(output="sunny with a temperature of 70 degrees.")

# Test that the agent's response is appropriate based on the tool output
await (
    result.expect.next_event()
    .is_message(role="assistant")
    .judge(
        llm,
        intent="Informs the user that the weather in Tokyo is sunny with a temperature of 70 degrees.",
    )
)

# Verify the agent's turn is complete, with no additional messages or function calls
result.expect.no_more_events()

```

---

**Node.js**:

```typescript
const result = await session
  .run({ userInput: "What's the weather in Tokyo?" })
  .wait();

// Test that the agent's first conversation item is a function call
result.expect
  .nextEvent()
  .isFunctionCall({ name: 'getWeather', args: { location: 'Tokyo' } });

// Test that the tool returned the expected output to the agent
result.expect.nextEvent().isFunctionCallOutput();

// Test that the agent's response is appropriate based on the tool output
await result.expect
  .nextEvent()
  .isMessage({ role: 'assistant' })
  .judge(llm, {
    intent: 'Informs the user that the weather in Tokyo is sunny with a temperature of 70 degrees.',
  });

// Verify the agent's turn is complete, with no additional messages or function calls
result.expect.noMoreEvents();

```

Access individual properties with the `event()` method:

- **`is_function_call().event().item.name`** - Function name
- **`is_function_call().event().item.arguments`** - Function arguments
- **`is_function_call_output().event().item.output`** - Raw function output
- **`is_function_call_output().event().item.is_error`** - Whether the output is an error
- **`is_function_call_output().event().item.call_id`** - The function call ID

### Agent handoff assertions

Use `is_agent_handoff()` and `contains_agent_handoff()` to test that the agent performs a [handoff](https://docs.livekit.io/agents/logic/workflows.md) to a new agent.

**Python**:

```python
# The next event must be an agent handoff to the specified agent
result.expect.next_event().is_agent_handoff(new_agent_type=MyAgent)

# A handoff must occur somewhere in the turn
result.expect.contains_agent_handoff(new_agent_type=MyAgent)

```

---

**Node.js**:

```typescript
// The next event must be an agent handoff to the specified agent
result.expect.nextEvent().isAgentHandoff({ newAgentType: MyAgent });

// A handoff must occur somewhere in the turn
result.expect.containsAgentHandoff({ newAgentType: MyAgent });

```

## Evaluating full conversations

Available in:
- [ ] Node.js
- [x] Python

The `judge()` method evaluates a single message against an intent. To evaluate the full conversation against multiple criteria at once, use `JudgeGroup`, which runs a list of judges concurrently against a `ChatContext` and returns an aggregate `EvaluationResult`.

### Built-in judges

LiveKit Agents ships eight judges as factory functions in `livekit.agents.evals`. Each one returns an LLM-based judge with preset evaluation criteria:

- **`accuracy_judge`**: Verifies the agent grounds information in tool outputs. Catches hallucinations and contradictions.
- **`coherence_judge`**: Checks that responses follow a logical structure and don't jump between topics or contradict themselves.
- **`conciseness_judge`**: Catches unnecessary verbosity, repetition, or redundant detail.
- **`handoff_judge`**: Checks that the agent retains context across handoffs. Passes automatically when no handoffs occur, so it's safe to include in every test.
- **`relevancy_judge`**: Checks that responses stay on topic and address what the user asked.
- **`safety_judge`**: Catches unauthorized advice, improper disclosure, missed escalation, and harmful language.
- **`task_completion_judge`**: Checks if the agent completed its goal based on the latest agent instructions in the chat context.
- **`tool_use_judge`**: Checks tool selection, parameter accuracy, output handling, and error recovery.

### Run a JudgeGroup in pytest

Run a multi-turn conversation, then evaluate it with `JudgeGroup`. The `llm` argument accepts either an `LLM` instance or a model string like `"openai/gpt-4o-mini"`, which routes through [LiveKit Inference](https://docs.livekit.io/agents/models/inference.md).

```python
import pytest
from livekit.agents import AgentSession, inference
from livekit.agents.evals import (
    JudgeGroup,
    accuracy_judge,
    relevancy_judge,
    task_completion_judge,
    tool_use_judge,
)

from agent import Assistant

@pytest.mark.asyncio
async def test_assistant_conversation() -> None:
    async with (
        inference.LLM(model="openai/gpt-5.3-chat-latest") as llm,
        AgentSession(llm=llm) as session,
    ):
        await session.start(Assistant())

        await session.run(user_input="Hello")
        await session.run(user_input="What's the weather in Tokyo?")

        judges = JudgeGroup(
            llm="openai/gpt-4o-mini",
            # Pick the judges relevant for this test
            judges=[
                task_completion_judge(),
                accuracy_judge(),
                tool_use_judge(),
                relevancy_judge(),
            ],
        )

        result = await judges.evaluate(session.history)

        assert result.all_passed, f"Some judges failed: {result.judgments}"

```

### Result properties

`JudgeGroup.evaluate()` returns an `EvaluationResult` with the following properties:

- **`score`**: Float from 0.0 to 1.0. Pass counts as 1, maybe as 0.5, fail as 0.
- **`all_passed`**: True only if every judge returned a pass verdict.
- **`any_passed`**: True if at least one judge passed.
- **`majority_passed`**: True if more than half of the judges passed.
- **`none_failed`**: True if no judge explicitly failed. Maybes are allowed.
- **`judgments`**: A dict keyed by judge name. Each value is a `JudgmentResult` with `verdict` (`"pass"`, `"fail"`, or `"maybe"`), `reasoning`, `instructions`, and the convenience properties `passed`, `failed`, and `uncertain`.

Use `score` and `all_passed` for assertions, and inspect `judgments[name].reasoning` to debug failures:

```python
result = await judges.evaluate(session.history)

for name, judgment in result.judgments.items():
    print(f"{name}: {judgment.verdict} ({judgment.reasoning})")

```

### Custom judges

For deterministic checks that don't need an LLM, subclass `Judge` and override `evaluate`:

```python
from livekit.agents.evals import Judge, JudgmentResult

class CitationJudge(Judge):
    def __init__(self) -> None:
        super().__init__(name="citation")

    async def evaluate(self, *, chat_ctx, reference=None, llm=None) -> JudgmentResult:
        has_citation = any(
            "[source]" in (item.text_content or "")
            for item in chat_ctx.items
            if item.type == "message"
        )
        return JudgmentResult(
            verdict="pass" if has_citation else "fail",
            reasoning="Found citation markers" if has_citation else "No citations found",
        )

judges = JudgeGroup(
    llm="openai/gpt-4o-mini",
    judges=[accuracy_judge(), CitationJudge()],
)

```

Subclassing `Judge` is the standard approach. As an escape hatch, any object that satisfies the `Evaluator` protocol can also be passed alongside the built-in judges: the protocol requires a `name` property and an `async evaluate(*, chat_ctx, reference, llm)` method that returns a `JudgmentResult`.

> ℹ️ **Auto-tagging in production vs. tests**
> 
> When `JudgeGroup.evaluate()` runs inside a job context, such as an `on_session_end` callback in production, it tags the session with each judgment as `lk.judge.<name>:<verdict>` so the results surface in LiveKit Cloud. In a pytest environment there's no job context, so tagging silently no-ops. The same `JudgeGroup` works in both places. For the production wiring, see the [front-desk example](https://github.com/livekit/agents/blob/main/examples/frontdesk/agent.py).

## Mocking tools

Available in:
- [ ] Node.js
- [x] Python

In many cases, you should mock your tools for testing. This is useful to easily test edge cases, such as errors or other unexpected behavior, or when the tool has a dependency on an external service that you don't need to test against.

> ℹ️ **Version requirement**
> 
> `mock_tools` requires LiveKit Agents 1.2.6 or later.

Use the `mock_tools` helper in a `with` block to mock one or more tools for a specific Agent. To mock a tool that raises an error:

```python
from livekit.agents import mock_tools

# Mock a tool error
with mock_tools(
    Assistant,
    {"lookup_weather": lambda: RuntimeError("Weather service is unavailable")},
):
    result = await session.run(user_input="What's the weather in Tokyo?")

    await result.expect.next_event(type="message").judge(
        llm, intent="Should inform the user that an error occurred while looking up the weather."
    )

```

### Mock function signatures

The mock function receives only the parameters it declares. Tool arguments are matched against the mock's signature, and anything not declared, including `self` and [`RunContext`](https://docs.livekit.io/agents/logic/tools/definition.md#runcontext), is dropped. That's why the error mock above takes no arguments even though `lookup_weather` accepts `location`. The unused argument is trimmed away.

For more complex mocks, pass a function instead of a lambda:

```python
def _mock_weather_tool(location: str) -> str:
    if location == "Tokyo":
        return "sunny with a temperature of 70 degrees."
    else:
        return "UNSUPPORTED_LOCATION"

# Mock a specific tool response
with mock_tools(Assistant, {"lookup_weather": _mock_weather_tool}):
    result = await session.run(user_input="What's the weather in Tokyo?")

    await result.expect.next_event(type="message").judge(
        llm,
        intent="Should indicate the weather in Tokyo is sunny with a temperature of 70 degrees.",
    )

    result = await session.run(user_input="What's the weather in Paris?")

    await result.expect.next_event(type="message").judge(
        llm,
        intent="Should indicate that weather lookups in Paris are not supported.",
    )

```

## Testing multiple turns

You can test multiple turns of a conversation by executing the `run` method multiple times. The conversation history builds automatically across turns.

**Python**:

```python
# First turn
result1 = await session.run(user_input="Hello")
await result1.expect.next_event().is_message(role="assistant").judge(
    llm, intent="Friendly greeting"
)

# Second turn builds on conversation history
result2 = await session.run(user_input="What's the weather like in Tokyo?")
result2.expect.next_event().is_function_call(name="lookup_weather")
result2.expect.next_event().is_function_call_output()
await result2.expect.next_event().is_message(role="assistant").judge(
    llm, intent="Provides weather information"
)

```

---

**Node.js**:

```typescript
// First turn
const result1 = await session.run({ userInput: 'Hello' }).wait();
await result1.expect
  .nextEvent()
  .isMessage({ role: 'assistant' })
  .judge(llm, {
    intent: 'Friendly greeting',
  });

// Second turn builds on conversation history
const result2 = await session.run({ userInput: "What's the weather like in Tokyo?" }).wait();
result2.expect.nextEvent().isFunctionCall({ name: 'getWeather' });
result2.expect.nextEvent().isFunctionCallOutput();
await result2.expect
  .nextEvent()
  .isMessage({ role: 'assistant' })
  .judge(llm, {
    intent: 'Provides weather information',
  });

```

## Loading conversation history

To load conversation history manually, use the `ChatContext` class just as in your agent code:

**Python**:

```python
from livekit.agents import ChatContext

agent = Assistant()
await session.start(agent)
# update_chat_ctx is on the Agent instance, not the session.
# In tests where you don't hold a reference, use session.current_agent.

chat_ctx = ChatContext()
chat_ctx.add_message(role="user", content="My name is Alice")
chat_ctx.add_message(role="assistant", content="Nice to meet you, Alice!")
await agent.update_chat_ctx(chat_ctx)

# Test that the agent remembers the context
result = await session.run(user_input="What's my name?")
await result.expect.next_event().is_message(role="assistant").judge(
    llm, intent="Should remember and mention the user's name is Alice"
)

```

---

**Node.js**:

```typescript
import { llm } from '@livekit/agents';

const { ChatContext } = llm;

const agent = new Assistant();
await session.start({ agent });
// updateChatCtx is on the Agent instance, not the session.
// In tests where you don't hold a reference, use session.currentAgent.

const chatCtx = new ChatContext();
chatCtx.addMessage({ role: 'user', content: 'My name is Alice' });
chatCtx.addMessage({ role: 'assistant', content: 'Nice to meet you, Alice!' });
await agent.updateChatCtx(chatCtx);

// Test that the agent remembers the context
const result = await session.run({ userInput: "What's my name?" }).wait();
await result.expect
  .nextEvent()
  .isMessage({ role: 'assistant' })
  .judge(llm, {
    intent: "Should remember and mention the user's name is Alice",
  });

```

---

This document was rendered at 2026-06-07T11:32:47.376Z.
For the latest version of this document, see [https://docs.livekit.io/agents/start/testing/test-framework.md](https://docs.livekit.io/agents/start/testing/test-framework.md).

To explore all LiveKit documentation, see [llms.txt](https://docs.livekit.io/llms.txt).