Testing and evaluation | LiveKit Documentation

Overview

Writing effective tests and evaluations are a key part of developing a reliable and production-ready AI agent. LiveKit Agents includes helpers that work with testing frameworks like pytest for Python or Vitest for Node.js, to write behavioral tests and evaluations alongside your existing unit and integration tests.

Use these tools to fine-tune your agent's behavior, work around tricky edge cases, and iterate on your agent's capabilities without breaking previously existing functionality.

What to test

You should plan to test your agent's behavior in the following areas:

Expected behavior: Does your agent respond with the right intent and tone for typical use cases?
Tool usage: Are functions called with correct arguments and proper context?
Error handling: How does your agent respond to invalid inputs or tool failures?
Grounding: Does your agent stay factual and avoid hallucinating information?
Misuse resistance: How does your agent handle intentional attempts to misuse or manipulate it?

Text-only testing

The built-in testing helpers are designed to work with text input and output, using an LLM via LiveKit Inference or a plugin in text-only mode. This is the most cost-effective and intuitive way to write comprehensive tests of your agent's behavior.

For testing options that exercise the entire audio pipeline, see the third party testing tools section at the end of this guide.

Example test

Here is a simple behavioral test for the agent created in the voice AI quickstart. It ensures that the agent responds with a friendly greeting and offers assistance.

from livekit.agents import AgentSession, inference

from agent import Assistant

@pytest.mark.asyncio
async def test_assistant_greeting() -> None:
    async with (
        inference.LLM(model="openai/gpt-5.3-chat-latest") as llm,
        AgentSession(llm=llm) as session,
    ):
        await session.start(Assistant())

        result = await session.run(user_input="Hello")

        await result.expect.next_event().is_message(role="assistant").judge(
            llm, intent="Makes a friendly introduction and offers assistance."
        )

        result.expect.no_more_events()

import { inference, initializeLogger, voice } from '@livekit/agents';
import { describe, it, beforeAll, afterAll } from 'vitest';
// Import your agent class
import { Agent } from './agent';

// Initialize logger to suppress CLI output
initializeLogger({ pretty: false, level: 'warn' });

const { AgentSession } = voice;

describe('Assistant', () => {
  let session: voice.AgentSession;
  let llm: inference.LLM;

  beforeAll(async () => {
    llm = new inference.LLM({ model: 'openai/gpt-5.3-chat-latest' });
    session = new AgentSession({ llm });
    await session.start({ agent: new Agent() });
  });

  afterAll(async () => {
    await session?.close();
  });

  it('should greet and offer assistance', async () => {
    const result = await session.run({ userInput: 'Hello' }).wait();

    await result.expect
      .nextEvent()
      .isMessage({ role: 'assistant' })
      .judge(llm, {
        intent: 'Makes a friendly introduction and offers assistance.',
      });

    result.expect.noMoreEvents();
  });
});

For the full testing API, including setup, assertions, mocking, and multi-turn testing, see Test framework.

Verbose output

Environment variables can turn on detailed output for each agent execution.

The LIVEKIT_EVALS_VERBOSE environment variable turns on detailed output for each agent execution. To use it with pytest, you must also set the -s flag to disable pytest's automatic capture of stdout:

LIVEKIT_EVALS_VERBOSE=1 uv run pytest -s -o log_cli=true <your-test-file>

The LIVEKIT_EVALS_VERBOSE environment variable turns on detailed output for each agent execution.

LIVEKIT_EVALS_VERBOSE=1

Sample verbose output:

evals/test_agent.py::test_offers_assistance 
+ RunResult(
   user_input=`Hello`
   events:
     [0] ChatMessageEvent(item={'role': 'assistant', 'content': ['Hi there! How can I assist you today?']})
)
- Judgment succeeded for `Hi there! How can I assist...`: `The message provides a friendly greeting and explicitly offers assistance, fulfilling the intent.`
PASSED

stdout | conversation-history.test.ts > RunResult > should greet user by name

+ RunResult {
    userInput: "What's my name?"
    events: [
      [0] { type: "message", role: "assistant", content: "Your name is Alice.", interrupted: false }
    ]
  }

stdout | conversation-history.test.js > RunResult > should greet user by name
- Judgment succeeded for `Your name is Alice.`: `The message explicitly states the user's name is Alice, fulfilling the intent to remember and mention the user's name.`

Integrating with CI

The testing helpers work live against your LLM provider to test real agent behavior. If you're using LiveKit Inference, set LIVEKIT_API_KEY and LIVEKIT_API_SECRET in your CI environment. If you're using a plugin directly, set the appropriate provider API keys instead. Testing does not make a LiveKit room connection.

For GitHub Actions, see the guide on using secrets in GitHub Actions.

Warning

Never commit API keys to your repository. Use environment variables and CI secrets instead.

Considerations

The following considerations apply when testing agents:

get_job_context() is unavailable in test environments and raises a RuntimeError when called. If your agent uses get_job_context(), avoid testing code paths that invoke it, or mock the call using unittest.mock (Python-only).
When testing agents that use task groups, consider testing each task in isolation as well as the overall flow. Test transitions between tasks, regression to previous steps, and proper completion with summarized results. For specific guidelines, see Best practices for testing task groups.