Overview
Turn detection and user interruptions help facilitate natural conversations for users and voice AI agents. This topic describes the configuration options for voice activity detection (VAD), the turn-detection model, and user interruptions. To learn more about these features, see Turn detection and interruptions.
The following concepts are useful to understand when configuring turn detection and user interruptions:
- Speech segmentation: VAD segments audio into "speech" and "non-speech" segments.
- Speech probability: The probability that there is speech in an audio frame.
- Speech activity duration: The duration of the user's speech detected by VAD.
- Silence duration: The duration of silence that must pass before the agent considers a user to have finished speaking.
- Speech interruption duration: The duration of user speech that identifies an intentional interruption.
Enhanced noise cancellation is available for LiveKit Cloud users and can improve the accuracy of turn detection by reducing background noise. Your optimal configuration settings might vary depending on the use of noise cancellation and your specific use case. To learn more about noise cancellation, see Noise cancellation.
VAD parameters for speech detection
The following parameters are used to configure Silero VAD options for AgentSession
and Node.js VoicePipelineAgent
:
Minimum speech duration required to consider the interruption intentional.
Duration of silence to wait after speech ends to determine if the user has finished speaking.
Threshold that determines if there is speech in an audio frame. A higher threshold results in more conservative detection but might potentially miss soft speech. A lower threshold results in more sensitive detection, but might identify noise as speech.
User interruptions
The following are parameters that control the interruption behavior for the voice AI agents:
Set interruption options for VoicePipelineAgent
using the following parameters for VPAOptions
.
Whether to allow the user to interrupt the agent. Set to False
to disable user interruptions.
Minimum number of transcribed words needed for the interruption to be considered intentional.
Minimum speech duration required to consider the interruption intentional.
Delay to wait before considering user speech done.
Turn detection settings
LiveKit's turn detection model has one configuration parameter: unlikely_threshold
. This is the speech probability threshold for the turn detection model. If the endpoint probability is below this threshold, the user has not finished speaking and the agent waits longer before responding.
This is an advanced configuration option and LiveKit recommends you don't change the unlikely_threshold
parameter for most use cases.