Turn detection

Build natural conversations with accurate turn detection.

Turn detection is crucial in AI voice applications, helping the assistant know when the user has finished speaking and when to respond. Accurate turn detection is key to maintaining a natural conversational flow and avoiding interruptions or awkward pauses.

Modifying the VAD parameters

By default, OpenAI's Realtime API handles turn detection using voice activity detection (VAD) on the server side. You can disable this to manually handle turn detection.

Server-side VAD

Server-side VAD is enabled by default. This means the API determines when the user has started or stopped speaking, and responds automatically. For server-side VAD, you can fine-tune the behavior by adjusting various parameters to suit your application's needs. Here are the parameters you can adjust:

  • threshold: Adjusts the sensitivity of the VAD. A lower threshold makes the VAD more sensitive to speech (detects quieter sounds), while a higher threshold makes it less sensitive. The default value is 0.5.
  • prefix_padding_ms: Minimum duration of speech (in milliseconds) required to start a new speech chunk. This helps prevent very short sounds from triggering speech detection.
  • silence_duration_ms: Minimum duration of silence (in milliseconds) at the end of speech before ending the speech segment. This ensures brief pauses do not prematurely end a speech segment.
model=openai.realtime.RealtimeModel(
voice="alloy",
temperature=0.8,
instructions="You are a helpful assistant",
turn_detection=openai.realtime.ServerVadOptions(
threshold=0.6, prefix_padding_ms=200, silence_duration_ms=500
),
)
agent = multimodal.MultimodalAgent(
model=model
)
agent.start(ctx.room)

Client-side VAD

Note

This option is currently only available for Python.

If you want to have more control over audio input, you can turn off VAD and implement manual VAD. This is useful for push-to-talk interfaces where there is an obvious signal a user has started and stopped speaking. When you turn off VAD, your have to trigger audio responses explicitly.

Usage

To turn off server-side VAD, update the turn detection parameter:

model=openai.realtime.RealtimeModel(
voice="alloy",
temperature=0.8,
instructions="You are a helpful assistant",
turn_detection=None,
)
agent = multimodal.MultimodalAgent(
model=model
)
agent.start(ctx.room)

To manually generate speech, use the generate_reply method:

# When it's time to generate a new response, call generate_reply
agent.generate_reply(on_duplicate="cancel_existing")