Overview
Writing effective tests and evaluations are a key part of developing a reliable and production-ready AI agent. LiveKit Agents includes helpers that work with testing frameworks like pytest for Python or Vitest for Node.js, to write behavioral tests and evaluations alongside your existing unit and integration tests.
Use these tools to fine-tune your agent's behavior, work around tricky edge cases, and iterate on your agent's capabilities without breaking previously existing functionality.
What to test
You should plan to test your agent's behavior in the following areas:
- Expected behavior: Does your agent respond with the right intent and tone for typical use cases?
- Tool usage: Are functions called with correct arguments and proper context?
- Error handling: How does your agent respond to invalid inputs or tool failures?
- Grounding: Does your agent stay factual and avoid hallucinating information?
- Misuse resistance: How does your agent handle intentional attempts to misuse or manipulate it?
The built-in testing helpers are designed to work with text input and output, using an LLM via LiveKit Inference or a plugin in text-only mode. This is the most cost-effective and intuitive way to write comprehensive tests of your agent's behavior.
For testing options that exercise the entire audio pipeline, see the third party testing tools section at the end of this guide.
Example test
Here is a simple behavioral test for the agent created in the voice AI quickstart. It ensures that the agent responds with a friendly greeting and offers assistance.
from livekit.agents import AgentSession, inferencefrom agent import Assistant@pytest.mark.asyncioasync def test_assistant_greeting() -> None:async with (inference.LLM(model="openai/gpt-5.3-chat-latest") as llm,AgentSession(llm=llm) as session,):await session.start(Assistant())result = await session.run(user_input="Hello")await result.expect.next_event().is_message(role="assistant").judge(llm, intent="Makes a friendly introduction and offers assistance.")result.expect.no_more_events()
import { inference, initializeLogger, voice } from '@livekit/agents';import { describe, it, beforeAll, afterAll } from 'vitest';// Import your agent classimport { Agent } from './agent';// Initialize logger to suppress CLI outputinitializeLogger({ pretty: false, level: 'warn' });const { AgentSession } = voice;describe('Assistant', () => {let session: voice.AgentSession;let llm: inference.LLM;beforeAll(async () => {llm = new inference.LLM({ model: 'openai/gpt-5.3-chat-latest' });session = new AgentSession({ llm });await session.start({ agent: new Agent() });});afterAll(async () => {await session?.close();});it('should greet and offer assistance', async () => {const result = await session.run({ userInput: 'Hello' }).wait();await result.expect.nextEvent().isMessage({ role: 'assistant' }).judge(llm, {intent: 'Makes a friendly introduction and offers assistance.',});result.expect.noMoreEvents();});});
For the full testing API, including setup, assertions, mocking, and multi-turn testing, see Test framework.
Verbose output
Environment variables can turn on detailed output for each agent execution.
The LIVEKIT_EVALS_VERBOSE environment variable turns on detailed output for each agent execution. To use it with pytest, you must also set the -s flag to disable pytest's automatic capture of stdout:
LIVEKIT_EVALS_VERBOSE=1 uv run pytest -s -o log_cli=true <your-test-file>
The LIVEKIT_EVALS_VERBOSE environment variable turns on detailed output for each agent execution.
LIVEKIT_EVALS_VERBOSE=1
Sample verbose output:
evals/test_agent.py::test_offers_assistance+ RunResult(user_input=`Hello`events:[0] ChatMessageEvent(item={'role': 'assistant', 'content': ['Hi there! How can I assist you today?']}))- Judgment succeeded for `Hi there! How can I assist...`: `The message provides a friendly greeting and explicitly offers assistance, fulfilling the intent.`PASSED
stdout | conversation-history.test.ts > RunResult > should greet user by name+ RunResult {userInput: "What's my name?"events: [[0] { type: "message", role: "assistant", content: "Your name is Alice.", interrupted: false }]}stdout | conversation-history.test.js > RunResult > should greet user by name- Judgment succeeded for `Your name is Alice.`: `The message explicitly states the user's name is Alice, fulfilling the intent to remember and mention the user's name.`
Integrating with CI
The testing helpers work live against your LLM provider to test real agent behavior. If you're using LiveKit Inference, set LIVEKIT_API_KEY and LIVEKIT_API_SECRET in your CI environment. If you're using a plugin directly, set the appropriate provider API keys instead. Testing does not make a LiveKit room connection.
For GitHub Actions, see the guide on using secrets in GitHub Actions.
Never commit API keys to your repository. Use environment variables and CI secrets instead.
Considerations
The following considerations apply when testing agents:
get_job_context()is unavailable in test environments and raises aRuntimeErrorwhen called. If your agent usesget_job_context(), avoid testing code paths that invoke it, or mock the call usingunittest.mock(Python-only).When testing agents that use task groups, consider testing each task in isolation as well as the overall flow. Test transitions between tasks, regression to previous steps, and proper completion with summarized results. For specific guidelines, see Best practices for testing task groups.
Third-party testing tools
To perform end-to-end testing of deployed agents, including the audio pipeline, consider these third-party services:
Bluejay
End-to-end testing for voice agents powered by real-world simulations.
Cekura
Testing and monitoring for voice AI agents.
Coval
Manage your AI conversational agents. Simulation & evaluations for voice and chat agents.
Hamming
At-scale testing & production monitoring for AI voice agents.
Additional resources
These examples and resources provide more help with testing and evaluation.
Python agent evals
Node.js agent evals
Agent starter project
Starter project with a complete testing integration.
Agent starter project (Node.js)
Starter project with a complete testing integration.
Testing framework API reference (Python)
API reference for the RunResult class.
Testing framework API reference (Node.js)
API reference for the RunResult class.