The RealtimeModel
class is used to create a realtime conversational AI session. Below are the key parameters that can be passed when initializing the model, with a focus on the modalities
, instructions
, voice
, turn_detection
, temperature
, and max_output_tokens
options.
Parameters
modalities
Type: list[api_proto.Modality]
Default: ["text", "audio"]
Description: Specifies the input/output modalities supported by the model. This can be either or both of:
"text"
: The model processes text-based input and generates text responses."audio"
: The model processes audio input and can generate audio responses.
Example:
modalities=["text", "audio"]
instructions
Type: str | None
Default: None
Description: Custom instructions are the 'system prompt' for the model to follow during the conversation. This can be used to guide the behavior of the model or set specific goals.
Example:
instructions="Please provide responses that are brief and informative."
voice
Type: api_proto.Voice
Default: "alloy"
Description: Determines the voice used for audio responses. Some examples of voices include:
"alloy"
"echo"
"shimmer"
Example:
voice="alloy"
turn_detection
Type: api_proto.TurnDetectionType
Default: {"type": "server_vad"}
Description: Controls how the model detects when a speaker has finished talking, which is critical in realtime interactions.
"server_vad"
: OpenAI uses server side Voice Activity Detection (VAD) to detect when the user has stopped speaking. This can be fine-tuned using the following parameters:threshold
(optional): Float value to control the sensitivity of speech detection.prefix_padding_ms
(optional): The amount of time (in milliseconds) to pad before the detected speech.silence_duration_ms
(optional): The amount of silence (in milliseconds) required to consider the speech finished.
Example:
turn_detection={"type": "server_vad","threshold": 0.6,"prefix_padding_ms": 300,"silence_duration_ms": 500}
temperature
Type: float
Default: 0.8
Description: Controls the randomness of the model's output. Higher values (e.g., 1.0
and above) make the model's output more diverse and creative, while lower values (e.g., 0.6
) makes it more focused and deterministic.
Example:
temperature=0.7
max_output_tokens
Type: int
Default: 2048
Description: Limits the maximum number of tokens in the generated output. This helps control the length of the responses from the model, where one token roughly corresponds to one word.
Example:
max_output_tokens=1500
Example Initialization
Here is a full example of how to initialize the RealtimeModel
with these parameters:
realtime_model = RealtimeModel(modalities=["text", "audio"],instructions="Give brief, concise answers.",voice="alloy",turn_detection=openai.realtime.ServerVadOptions(threshold=0.6, prefix_padding_ms=200, silence_duration_ms=500,),temperature=0.7,max_output_tokens=1500,)