Voice activity detection (VAD) parameters

VAD is a technique used to determine when a user has finished speaking, or has ended their turn at speaking, and lets the assistant know to respond. Accurate turn detection is key to maintaining a natural conversational flow and avoiding interruptions or awkward pauses. To learn more about turn detection, see Turn detection.

Modifying the VAD parameters

By default, OpenAI's Realtime API handles turn detection using VAD on the server side. You can disable this to manually handle turn detection.

Server-side VAD

Server-side VAD is enabled by default. This means the API determines when the user has started or stopped speaking, and responds automatically. For server-side VAD, you can fine-tune the behavior by adjusting various parameters to suit your application's needs. Here are the parameters you can adjust:

threshold: Adjusts the sensitivity of the VAD. A lower threshold makes the VAD more sensitive to speech (detects quieter sounds), while a higher threshold makes it less sensitive. The default value is 0.5.
prefix_padding_ms: Minimum duration of speech (in milliseconds) required to start a new speech chunk. This helps prevent very short sounds from triggering speech detection.
silence_duration_ms: Minimum duration of silence (in milliseconds) at the end of speech before ending the speech segment. This ensures brief pauses do not prematurely end a speech segment.

model=openai.realtime.RealtimeModel(
    voice="alloy",
    temperature=0.8,
    instructions="You are a helpful assistant",
    turn_detection=openai.realtime.ServerVadOptions(
        threshold=0.6, prefix_padding_ms=200, silence_duration_ms=500
    ),
)
agent = multimodal.MultimodalAgent(
    model=model
)
agent.start(ctx.room)

Client-side VAD

Note

This option is currently only available for Python.

If you want to have more control over audio input, you can turn off VAD and implement manual VAD. This is useful for push-to-talk interfaces where there is an obvious signal a user has started and stopped speaking. When you turn off VAD, your have to trigger audio responses explicitly.

Usage

To turn off server-side VAD, update the turn detection parameter:

model=openai.realtime.RealtimeModel(
    voice="alloy",
    temperature=0.8,
    instructions="You are a helpful assistant",
    turn_detection=None,
)
agent = multimodal.MultimodalAgent(
    model=model
)
agent.start(ctx.room)

To manually generate speech, use the generate_reply method:

# When it's time to generate a new response, call generate_reply
agent.generate_reply(on_duplicate="cancel_existing")