OpenAI + LiveKit banner

OpenAI Realtime API integration guide

A guide on how to use OpenAI's Realtime API with LiveKit's WebRTC infrastructure.

OpenAI Realtime API and LiveKit

OpenAI’s Realtime API is a WebSocket interface for low-latency audio streaming, best suited for server-to-server use rather than direct consumption by end-user devices.

LiveKit offers Python and Node.js integrations for the API, enabling developers to build realtime conversational AI applications using LiveKit’s Agents framework. This framework integrates with LiveKit’s SDKs and telephony solutions, allowing you to build applications for any platform.

How it works

How the OpenAI Realtime API works with LiveKit Agents

WebSocket is not ideal for realtime audio and video over long distances or slower networks. LiveKit bridges this gap by converting the transport to WebRTC and routing data through our global edge network to minimize transmission latency.

With the Agents framework, user audio is first transmitted to LiveKit’s edge network via WebRTC and routed to your backend agent over low-latency connections. The agent then uses Agents framework integration to relay audio to OpenAI’s model via WebSocket. Similarly, speech from OpenAI is streamed back through WebSocket to the agent and relayed to the user via WebRTC.

The Agents framework

The Agents framework provides everything needed to build conversational applications using OpenAI's Realtime API, including:

  • Support for Python and Node.js
  • SDKs for nearly every platform
  • Inbound and outbound calling (using SIP trunks)
  • WebRTC transport via LiveKit Cloud or self-host OSS
  • Worker load balancing and request distribution (see Agent lifecycle)

LiveKit concepts

The LiveKit Agents framework uses the following concepts:

  • Room: a realtime session with participants. The room acts as bridge between your end user and your agent. Each room has a name and is identified by a unique ID.
  • Participant: a user or process (i.e. agent) participating in a room.
  • Agent: a programmable AI participant in a room.
  • Track: audio, video, text, or data published by a user or agent, and subscribed to by other participants in the room.

MultimodalAgent

The framework includes the MultimodalAgent class for building speech-to-speech agents that use the OpenAI Realtime API. To learn more about the differences between speech-to-speech and voice pipeline agents, see Voice agents comparison.