Skip to main content

Inworld STT plugin guide

How to use the Inworld STT plugin for LiveKit Agents.

Available inPython
|
Node.js

Overview

This plugin allows you to use Inworld  as an STT provider for your voice agents.

Installation

Install the plugin:

uv add "livekit-agents[inworld]~=1.5"
pnpm add @livekit/agents-plugin-inworld@1.x

Authentication

The Inworld plugin requires a Base64-encoded Inworld API key .

Set INWORLD_API_KEY in your .env file.

Usage

Use Inworld STT in an AgentSession or as a standalone transcription service. For example, you can use this STT in the Voice AI quickstart.

from livekit.agents import AgentSession
from livekit.plugins import inworld
session = AgentSession(
stt=inworld.STT(
model="inworld/inworld-stt-1",
language="en-US",
),
# ... llm, tts, etc.
)
import { voice } from '@livekit/agents';
import * as inworld from '@livekit/agents-plugin-inworld';
const session = new voice.AgentSession({
stt: new inworld.STT({
model: "inworld/inworld-stt-1",
language: "en-US",
}),
// ... llm, tts, etc.
});

Parameters

This section describes commonly used parameters. See the plugin reference links in the Additional resources section for a complete list of all available parameters.

modelstringDefault: inworld/inworld-stt-1

The Inworld STT model to use. Inworld serves several models through the same API, including inworld/inworld-stt-1, assemblyai/universal-streaming-multilingual, and soniox/stt-rt-v4. See the Inworld STT docs  for the current list of supported models.

languageLanguageCodeDefault: en-US

Language code for the input audio. See the Inworld STT docs  for supported languages.

sample_rateintegerDefault: 16000

Input audio sample rate in Hz.

num_channelsintegerDefault: 1

Number of audio channels in the input stream.

enable_voice_profilebooleanDefault: true

Enables voice profiling, which detects speaker characteristics such as age, gender, emotion, and accent on each transcript.

voice_profile_top_nintegerDefault: 1

Number of top voice profile results to return per category when enable_voice_profile is set.

vad_thresholdfloat

Voice activity detection sensitivity. If unset, Inworld applies its own default.

min_end_of_turn_silence_when_confidentintegerDefault: 200

Minimum silence, in milliseconds, required to end a turn when the model is confident the speaker has finished.

end_of_turn_confidence_thresholdfloatDefault: 0.3

Confidence threshold used to decide when a turn has ended.

Valid range: 0.01.0

Voice profile

When voice profiling is enabled (the default), each transcript exposes the detected voice profile on the metadata field, with attributes such as age, emotion, pitch, vocal style, accent, and gender. In Python, read it from metadata["voice_profile"]. In Node.js, read it from metadata.voiceProfile. Use voice_profile_top_n to control how many results are returned per category, or disable it with the enable_voice_profile parameter.

For an example of reading metadata from transcript events, see Provider-specific metadata on the STT overview.

Additional resources

The following resources provide more information about using Inworld with LiveKit Agents.