LiveKit Inference LLM parameters | LiveKit Documentation

Overview

LiveKit Inference LLMs let you customize the behavior of the language model when generating responses by passing additional parameters. You can specify these model parameters when creating an instance of the LLM class in the inference module using extra_kwargs in Python or modelOptions in Node.js.

Model parameters

The following is a complete list of supported Chat Completion options. Not every model supports every parameter; unsupported parameters are silently ignored. For model-specific details, see the documentation for the model you're using.

Reasoning model compatibility

Parameters not supported by reasoning models are automatically stripped at request time.

temperaturefloatDefault: 1

Sampling temperature that controls the randomness of the model's output. Higher values make the output more random, while lower values make it more focused and deterministic. Range of valid values can vary by model.

You can set this or top_p, but not both. Not supported by reasoning models.

top_pfloatDefault: 1

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

You can set this or temperature, but not both. Not supported by reasoning models.

max_tokensint

The maximum number of tokens that can be generated in the chat completion. The total length of input tokens and generated tokens is limited by the model's context length.

Not supported by newer models; use max_completion_tokens instead.

max_completion_tokensint

An upper bound for the number of tokens that can be generated for a completion, including visible output tokens and reasoning tokens. Preferred over max_tokens for newer models.

reasoning_effort"low" | "medium" | "high"

Controls how much reasoning effort the model spends. Only supported by reasoning models.

frequency_penaltyfloatDefault: 0

Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. Not supported by reasoning models.

presence_penaltyfloatDefault: 0

Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. Not supported by reasoning models.

seedint

If specified, the system makes a best effort to sample deterministically. Repeated requests with the same seed and parameters should return the same result.

stopstr | list[str]

List of sequences that cause the API to stop generating further tokens. For example, stop=["\n"] stops generation when the model outputs a newline character.

nint

Number of completions to generate for each prompt. Not supported by reasoning models.

logprobsbool

If true, returns the log probabilities of each output token returned in the content of message. Not supported by reasoning models.

top_logprobsint

An integer specifying the number of most likely tokens to return at each token position, each with an associated log probability. Valid range varies by provider.

Requires logprobs: true. Not supported by reasoning models.

logit_biasdict[str, int]

Modify the likelihood of specified tokens appearing in the completion. Not supported by reasoning models.

parallel_tool_callsbool

Whether the model can make multiple tool calls in a single response.

tool_choiceToolChoice | Literal['auto', 'required', 'none']Default: "auto"

Controls how the model uses tools. String options are as follows:

'auto': Let the model decide.
'required': Force tool usage.
'none': Disable tool usage.

user

Deprecated

str

Unique identifier for the end user, used for abuse monitoring. Deprecated: See safety_identifier and prompt_cache_key instead.

service_tier"auto" | "default" | "flex" | "scale" | "priority"

Specifies the latency tier for processing the request.

metadataMetadata

Developer-defined tags and values for filtering completions in the dashboard.

storebool

Whether to store the output for model distillation or evals.

predictionChatCompletionPredictionContentParam

Configuration for predicted output to reduce latency for known response patterns.

modalitieslist[Literal["text", "audio"]]

Output types the model can generate.

web_search_optionsWebSearchOptions

Configuration for web search for relevant results to use in a response.

verbosity"low" | "medium" | "high"

Constrains the verbosity of the model's response. Lower values result in more concise responses, while higher values result in more verbose responses.

prompt_cache_keystr

Key for caching responses for similar requests. See prompt caching.

safety_identifierstr

String that uniquely identifies each user. Hash the username or email address to avoid sending any identifying information. For non-logged in users, you can send a session ID instead. Supercedes user parameter.

Usage

The following example sets the temperature and max_completion_tokens parameters when creating an LLM instance:

from livekit.agents import AgentSession, inference

session = AgentSession(
    llm=inference.LLM(
        model="openai/gpt-4.1-mini",
        extra_kwargs={
            "temperature": 0.7,
            "max_completion_tokens": 1000,
        }
    ),
    # ... tts, stt, vad, turnHandling, etc.
)

import { AgentSession, inference } from '@livekit/agents';

const session = new AgentSession({
    llm: new inference.LLM({ 
        model: "openai/gpt-4.1-mini",
        provider: "openai",
        modelOptions: { 
            temperature: 0.7,
            max_completion_tokens: 1000
        }
    }),
    // ... tts, stt, vad, turnHandling, etc.
});