start_of_speech
events, this contains the audio chunks that triggered the detection.inference_done
events, this contains the audio chunks that were processed.end_of_speech
events, this contains the complete user speech.Time taken to perform the inference, in seconds (only for INFERENCE_DONE
events).
Probability that speech is present (only for INFERENCE_DONE
events).
Threshold used to detect silence.
Threshold used to detect speech.
Index of the audio sample where the event occurred, relative to the inference sample rate.
Duration of the silence segment.
Indicates whether speech was detected in the frames.
Duration of the speech segment.
Timestamp when the event was fired.
Type of the VAD event (e.g., start of speech, end of speech, inference done).
List of audio frames associated with the speech.