Hosted by LiveKit | LiveKit Documentation

Create a new agent in your browser using this model

Overview

LiveKit hosts a fast, open-weight model through LiveKit Inference and tunes the deployment for the low latency that voice agents need, making it the recommended default LLM for your agents. You don't manage a separate provider API key, and usage and rate limits are handled through LiveKit Cloud. See the pricing page for current rates.

LiveKit Inference

Use LiveKit Inference to access LiveKit's open-weight hosted models.

Model name	Model ID
Gemma 4 31B	google/gemma-4-31b-it

Usage

To use LiveKit hosted models, use the LLM class from the inference module. You can use this LLM in the Voice AI quickstart:

from livekit.agents import AgentSession, inference

session = AgentSession(
    llm=inference.LLM(
        model="google/gemma-4-31b-it",
        extra_kwargs={
            "max_completion_tokens": 1000
        }
    ),
    # ... tts, stt, vad, turn_handling, etc.
)

import { AgentSession, inference } from '@livekit/agents';

const session = new AgentSession({
    llm: new inference.LLM({ 
        model: "google/gemma-4-31b-it",
        modelOptions: { 
            max_completion_tokens: 1000 
        }
    }),
    // ... tts, stt, vad, turnHandling, etc.
});

Parameters

The following are parameters for configuring LiveKit open-weight hosted models with LiveKit Inference. For model behavior parameters like temperature and max_completion_tokens, see model parameters.

model

Required

string

The model ID from the models list.

providerstring

Set a specific provider to use for the LLM. If not set, LiveKit Inference uses the best available provider, and bills accordingly.

extra_kwargsdict

Additional parameters to pass to the (OpenAI-compatible) Chat Completions API, such as max_tokens or temperature. See model parameters for supported fields.

In Node.js this parameter is called modelOptions.

Model parameters

Pass the following parameters inside extra_kwargs (Python) or modelOptions (Node.js). For more details about each parameter in the list, see Inference parameters.

Parameter	Type	Default	Notes
`temperature`	`float`	`1`	Controls the randomness of the model's output. Valid range: `0`-`2`. Not supported by reasoning models.
`top_p`	`float`	`1`	Alternative to `temperature`. Valid range: `0`-`1`. Not supported by reasoning models.
`max_tokens`	`int`		Maximum tokens to generate. Use `max_completion_tokens` for newer models.
`max_completion_tokens`	`int`		Maximum tokens to generate, including reasoning tokens. Preferred over `max_tokens` for newer models.
`frequency_penalty`	`float`	`0`	Reduces the model's likelihood to repeat the same line verbatim. Valid range: `-2.0`-`2.0`. Not supported by reasoning models.
`presence_penalty`	`float`	`0`	Increases the model's likelihood to talk about new topics. Valid range: `-2.0`-`2.0`. Not supported by reasoning models.
`seed`	`int`		Enables deterministic sampling. The system makes a best effort to return the same result for identical requests.
`stop`	`str \| list[str]`		Sequences that stop generation. Up to 4 sequences.
`n`	`int`		Number of completions to generate. Not supported by reasoning models.
`logprobs`	`bool`		Returns log probabilities of each output token. Not supported by reasoning models.
`top_logprobs`	`int`		Number of most likely tokens to return at each position. Valid range: `0`-`20`. Requires `logprobs: true`. Not supported by reasoning models.
`logit_bias`	`dict[str, int]`		Adjusts likelihood of specified tokens appearing in the output. Not supported by reasoning models.
`parallel_tool_calls`	`bool`		Whether the model can make multiple tool calls in a single response.
`tool_choice`	`ToolChoice \| Literal['auto', 'required', 'none']`	`"auto"`	Controls how the model uses tools.
`reasoning_effort`	`str`	`"none"`	Enables reasoning when set to any value other than `none`. Gemma doesn't support multiple reasoning levels, so unlike other reasoning models, the value itself doesn't control effort — any non-`none` string turns reasoning on.
`add_generation_prompt`	`bool`	`true`	Whether to append the assistant generation prompt to the chat template, signaling the model to begin its response. Set to `false` to omit it. Specific to open-weight models.
`continue_final_message`	`bool`	`false`	Whether to continue the final message in the conversation instead of starting a new turn. Useful for prefilling the start of the model's response. Cannot be used together with `add_generation_prompt`. Specific to open-weight models.

String descriptors

As a shortcut, you can also pass a model ID directly to the llm argument in your AgentSession:

from livekit.agents import AgentSession

session = AgentSession(
    llm="google/gemma-4-31b-it",
    # ... tts, stt, vad, turn_handling, etc.
)

import { AgentSession } from '@livekit/agents';

const session = new AgentSession({
    llm: "google/gemma-4-31b-it",
    // ... tts, stt, vad, turnHandling, etc.
});

Additional resources

The following resources provide more information about using open-weight hosted models with LiveKit Agents.

Voice AI quickstart

Get started with LiveKit Agents.

Workflows

How to model repeatable, accurate tasks with multiple agents.

Tool definition and usage

Let your agents call external tools and more.

Inference pricing

The latest pricing information for all models in LiveKit Inference.