Overview
LiveKit Inference LLMs let you customize the behavior of the language model when generating responses by passing additional parameters. You can specify these model parameters when creating an instance of the LLM class in the inference module using extra_kwargs in Python or modelOptions in Node.js.
Model parameters
The following is a complete list of supported Chat Completion options. Not every model supports every parameter; unsupported parameters are silently ignored. For model-specific details, see the documentation for the model you're using.
Parameters not supported by reasoning models are automatically stripped at request time.
temperaturefloatDefault: 1Sampling temperature that controls the randomness of the model's output. Higher values make the output more random, while lower values make it more focused and deterministic. Range of valid values can vary by model.
You can set this or top_p, but not both. Not supported by reasoning models.
top_pfloatDefault: 1An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
You can set this or temperature, but not both. Not supported by reasoning models.
max_tokensintThe maximum number of tokens that can be generated in the chat completion. The total length of input tokens and generated tokens is limited by the model's context length.
Not supported by newer models; use max_completion_tokens instead.
max_completion_tokensintAn upper bound for the number of tokens that can be generated for a completion, including visible output tokens and reasoning tokens. Preferred over max_tokens for newer models.
reasoning_effort"low" | "medium" | "high"Controls how much reasoning effort the model spends. Only supported by reasoning models.
frequency_penaltyfloatDefault: 0Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. Not supported by reasoning models.
presence_penaltyfloatDefault: 0Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. Not supported by reasoning models.
seedintIf specified, the system makes a best effort to sample deterministically. Repeated requests with the same seed and parameters should return the same result.
stopstr | list[str]List of sequences that cause the API to stop generating further tokens. For example, stop=["\n"] stops generation when the model outputs a newline character.
nintNumber of completions to generate for each prompt. Not supported by reasoning models.
logprobsboolIf true, returns the log probabilities of each output token returned in the content of message. Not supported by reasoning models.
top_logprobsintAn integer specifying the number of most likely tokens to return at each token position, each with an associated log probability. Valid range varies by provider.
Requires logprobs: true. Not supported by reasoning models.
logit_biasdict[str, int]Modify the likelihood of specified tokens appearing in the completion. Not supported by reasoning models.
parallel_tool_callsboolWhether the model can make multiple tool calls in a single response.
tool_choiceToolChoice | Literal['auto', 'required', 'none']Default: "auto"Controls how the model uses tools. String options are as follows:
'auto': Let the model decide.'required': Force tool usage.'none': Disable tool usage.
userstrUnique identifier for the end user, used for abuse monitoring. Deprecated: See safety_identifier and prompt_cache_key instead.
service_tier"auto" | "default" | "flex" | "scale" | "priority"Specifies the latency tier for processing the request.
metadataMetadataDeveloper-defined tags and values for filtering completions in the dashboard.
storeboolWhether to store the output for model distillation or evals.
predictionChatCompletionPredictionContentParamConfiguration for predicted output to reduce latency for known response patterns.
modalitieslist[Literal["text", "audio"]]Output types the model can generate.
web_search_optionsWebSearchOptionsConfiguration for web search for relevant results to use in a response.
verbosity"low" | "medium" | "high"Constrains the verbosity of the model's response. Lower values result in more concise responses, while higher values result in more verbose responses.
prompt_cache_keystrKey for caching responses for similar requests. See prompt caching.
safety_identifierstrString that uniquely identifies each user. Hash the username or email address to avoid sending any identifying information. For non-logged in users, you can send a session ID instead. Supercedes user parameter.
Usage
The following example sets the temperature and max_completion_tokens parameters when creating an LLM instance:
from livekit.agents import AgentSession, inferencesession = AgentSession(llm=inference.LLM(model="openai/gpt-4.1-mini",extra_kwargs={"temperature": 0.7,"max_completion_tokens": 1000,}),# ... tts, stt, vad, turnHandling, etc.)
import { AgentSession, inference } from '@livekit/agents';const session = new AgentSession({llm: new inference.LLM({model: "openai/gpt-4.1-mini",provider: "openai",modelOptions: {temperature: 0.7,max_completion_tokens: 1000}}),// ... tts, stt, vad, turnHandling, etc.});