AssemblyAI STT | LiveKit Documentation

Create a new agent in your browser using this model

Overview

AssemblyAI speech-to-text is available in LiveKit Agents through LiveKit Inference and the AssemblyAI plugin. With LiveKit Inference, your agent runs on LiveKit's infrastructure to minimize latency. No separate provider API key is required, and usage and rate limits are managed through LiveKit Cloud. Use the plugin instead if you want to manage your own billing and rate limits. Pricing for LiveKit Inference is available on the pricing page .

LiveKit Inference

Use LiveKit Inference to access AssemblyAI STT without a separate AssemblyAI API key.

Model name	Model ID	Languages
Universal-3 Pro Streaming	assemblyai/u3-rt-pro	enen-USen-GBen-AUen-CAen-INen-NZeses-ESes-MXes-ARes-COes-CLes-PEes-VEes-ECes-GTes-CUes-BOes-DOes-HNes-PYes-SVes-NIes-CRes-PAes-UYes-PRfrfr-FRfr-CAfr-BEfr-CHdede-DEde-ATde-CHitit-ITit-CHptpt-BRpt-PT
Universal-Streaming	assemblyai/universal-streaming	enen-US
Universal-Streaming-Multilingual	assemblyai/universal-streaming-multilingual	multienen-USen-GBen-AUen-CAen-INen-NZeses-ESes-MXes-ARes-COes-CLes-PEes-VEes-ECes-GTes-CUes-BOes-DOes-HNes-PYes-SVes-NIes-CRes-PAes-UYes-PRfrfr-FRfr-CAfr-BEfr-CHdede-DEde-ATde-CHitit-ITit-CHptpt-BRpt-PT

Usage

To use AssemblyAI, use the STT class from the inference module:

from livekit.agents import AgentSession, inference

session = AgentSession(
    stt=inference.STT(
        model="assemblyai/u3-rt-pro", 
        language="en"
    ),
    # ... llm, tts, vad, turn_handling, etc.
)

import { AgentSession, inference } from '@livekit/agents';

session = new AgentSession({
    stt: new inference.STT({ 
        model: "assemblyai/u3-rt-pro", 
        language: "en" 
    }),
    // ... llm, tts, vad, turnHandling, etc.
});

Parameters

model

Required

string

The model to use for the STT. Available models: assemblyai/u3-rt-pro, assemblyai/universal-streaming, assemblyai/universal-streaming-multilingual.

languageLanguageCode

Language code for the transcription. If not set, the provider default applies. Universal-3 Pro and Universal-Streaming Multilingual automatically detect between English, Spanish, German, French, Portuguese, and Italian.

extra_kwargsdict

Additional parameters to pass to the AssemblyAI streaming API. Supported fields depend on the selected model. See model parameters for supported fields.

In Node.js this parameter is called modelOptions.

Model parameters

Pass the following parameters inside extra_kwargs (Python) or modelOptions (Node.js).

All models:

Parameter	Type	Default	Notes
`keyterms_prompt`	`list[str]`		List of terms to boost recognition accuracy for.
`language_detection`	`bool`		Whether to include `language_code` and `language_confidence` in turn messages. Defaults to `True` for Universal-3 Pro and Universal-Streaming Multilingual; `False` for Universal-Streaming.
`inactivity_timeout`	`float`		Duration of inactivity in seconds before the session closes.
`min_turn_silence`	`int`		Minimum duration of silence in milliseconds before the model checks for end of turn. Universal-3 Pro defaults to `100` ms (triggers the punctuation-based EOT check); Universal-Streaming uses it as the confident-EOT silence floor. Replaces the deprecated `min_end_of_turn_silence_when_confident`.
`max_turn_silence`	`int`		Maximum duration of silence in milliseconds allowed in a turn before end of turn is triggered.
`vad_threshold`	`float`		Confidence threshold for classifying audio frames as silence. Frames below this value are considered silent. Increase in noisy environments. Server-side defaults: `0.3` (Universal-3 Pro), `0.4` (Universal-Streaming). Valid range: `0.0`–`1.0`.
`domain`	`string`		Enables domain-specific recognition. Set to `medical-v1` to use AssemblyAI's Medical Mode . Works with all three streaming models. Supported languages: English, Spanish, German, French. Other languages are ignored with a warning.
`speaker_labels`	`bool`	`False`	Set to `True` to enable speaker diarization.

Model-specific parameters:

Parameter Type Default Notes

prompt str Custom transcription instructions for the model. When not set, a default prompt optimized for turn detection is used.

continuous_partials bool false Emit a non-final partial transcript approximately every 3 seconds while speech continues, regardless of silence. Useful for long, uninterrupted turns. The first partial still arrives at the early-partial timing controlled by interruption_delay.

Parameter	Type	Default	Notes
`prompt`	`str`		Custom transcription instructions for the model. When not set, a default prompt optimized for turn detection is used.
`continuous_partials`	`bool`	`false`	Emit a non-final partial transcript approximately every 3 seconds while speech continues, regardless of silence. Useful for long, uninterrupted turns. The first partial still arrives at the early-partial timing controlled by `interruption_delay`.
`interruption_delay`	`int`	`500`	Milliseconds before the first early partial is emitted. Lower values produce a faster time-to-first-token for barge-in; higher values produce more confident first partials. Valid range: `0`–`1000`.

interruption_delay

int

500

Milliseconds before the first early partial is emitted. Lower values produce a faster time-to-first-token for barge-in; higher values produce more confident first partials.

Valid range: 0–1000.

Prompt and Keyterms Prompt

You can use prompt and keyterms_prompt together in the same streaming request. When you use keyterms_prompt, your boosted words are appended to the default prompt (or your custom prompt if provided) automatically.

Parameter	Type	Default	Notes
`format_turns`	`bool`	`False`	Whether to return formatted final transcripts.
`end_of_turn_confidence_threshold`	`float`	`0.01`	Confidence threshold for determining the end of a turn.

String descriptors

As a shortcut, you can also pass a model ID string directly to the stt argument in your AgentSession:

from livekit.agents import AgentSession

session = AgentSession(
    stt="assemblyai/u3-rt-pro:en",
    # ... llm, tts, vad, turn_handling, etc.
)

import { AgentSession } from '@livekit/agents';

session = new AgentSession({
    stt: "assemblyai/u3-rt-pro:en",
    // ... llm, tts, vad, turnHandling, etc.
});

Turn detection

Universal-3 Pro uses punctuation-based turn detection. It checks for terminal punctuation (. ? !) after periods of silence rather than using a confidence score. To use this for turn detection, set turn_detection="stt" in the turn handling options.

Default parameter differences: The LiveKit plugin defaults to min_turn_silence=100 and max_turn_silence=100. The AssemblyAI API defaults are min_turn_silence=100 and max_turn_silence=1000. When using turn_detection="stt", explicitly set max_turn_silence=1000 to restore AssemblyAI's intended behavior.

Endpointing delay is additive in STT mode: LiveKit's default min_delay (0.5 seconds) in the turn handling endpointing options is applied on top of AssemblyAI's own endpointing. Set endpointing.min_delay to 0 in the turn handling options to avoid extra latency — AssemblyAI's min_turn_silence and max_turn_silence already control the timing.

VAD threshold alignment: Universal-3 Pro defaults to a vad_threshold of 0.3. Set LiveKit's Silero activation_threshold to 0.3 as well to ensure consistent barge-in behavior.

Tuning guidance: Experiment with min_turn_silence and max_turn_silence. Settings can vary depending on your use case. Increase min_turn_silence if brief pauses cause the speculative EOT check to fire too early, ending turns on terminal punctuation before the user has finished speaking. Increase max_turn_silence if the forced turn end is cutting off users mid-thought.

For a detailed guide on configuring Universal-3 Pro with LiveKit — including entity splitting tradeoffs, VAD threshold alignment, and prompt engineering — see the AssemblyAI LiveKit guide .

session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="stt",
        endpointing={"min_delay": 0},
    ),
    stt=inference.STT(
        model="assemblyai/u3-rt-pro",
        extra_kwargs={
            "min_turn_silence": 100,
            "max_turn_silence": 1000,
            "vad_threshold": 0.3,
        }
    ),
    vad=silero.VAD.load(activation_threshold=0.3),
    # ... llm, tts, etc.
)

AssemblyAI includes a custom phrase endpointing model that uses both audio and linguistic information to detect turn boundaries. To use this model for turn detection, set turn_detection="stt" in the turn handling options. You should also provide a VAD plugin for responsive interruption handling.

session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="stt",
    ),
    stt=inference.STT(
        model="assemblyai/universal-streaming", 
        language="en"
    ),
    vad=silero.VAD.load(),  # Recommended for responsive interruption handling
    # ... llm, tts, etc.
)

Plugin

LiveKit's plugin support for AssemblyAI lets you connect directly to AssemblyAI's API with your own API key. For Node.js, use LiveKit Inference.

Available inPython

Installation

Install the plugin from PyPI:

uv add "livekit-agents[assemblyai]~=1.5"

Authentication

The AssemblyAI plugin requires an AssemblyAI API key .

Set ASSEMBLYAI_API_KEY in your .env file.

Usage

Use AssemblyAI STT in an AgentSession or as a standalone transcription service. For example, you can use this STT in the Voice AI quickstart.

from livekit.plugins import assemblyai

session = AgentSession(
    stt=assemblyai.STT(
        model="u3-rt-pro",
        min_turn_silence=100,
        max_turn_silence=1000,
        vad_threshold=0.3,
    ),
    vad=silero.VAD.load(activation_threshold=0.3),
    # ... llm, tts, etc.
)

Parameters

This section describes some of the available parameters. See the plugin reference for a complete list of all available parameters.

Shared parameters

These parameters apply to all AssemblyAI streaming models.

modelstringDefault: universal-streaming-english

STT model to use. Accepted options are u3-rt-pro, universal-streaming-english, and universal-streaming-multilingual.

keyterms_promptlist[str]

List of terms to boost recognition for.

vad_thresholdfloat

AssemblyAI's internal Silero VAD onset threshold. Defaults to 0.3 for Universal-3 Pro and 0.4 for Universal-Streaming. For best results, align this with LiveKit's Silero activation_threshold.

language_detectionbool

Whether to include language_code and language_confidence in turn messages. Defaults to true for Universal-3 Pro and Universal-Streaming Multilingual, false for Universal-Streaming English.

min_turn_silenceintDefault: 100

The minimum duration of silence (in milliseconds) before the model checks for end of turn. The LiveKit plugin defaults this to 100 for all streaming models. Replaces the deprecated min_end_of_turn_silence_when_confident. See the model-specific sections below for how each model uses this parameter.

max_turn_silenceint

The maximum duration of silence (in milliseconds) allowed in a turn before end of turn is triggered. See the model-specific sections below for defaults.

speaker_labelsbool

Enable speaker diarization. When set to True, each transcript event includes a speaker_id identifying the speaker ("A", "B", etc.). Short utterances under ~1 second return speaker_id=None. Use with MultiSpeakerAdapter to detect the primary speaker or format transcripts by speaker.

max_speakersint

Maximum number of speakers to detect. If not set, AssemblyAI detects the number of speakers automatically.

domainstring

Enables domain-specific recognition. Set to medical-v1 to use AssemblyAI's Medical Mode for improved accuracy on medical terminology such as medication names, procedures, conditions, and dosages. Works with all three streaming models.

Model-specific parameters

min_turn_silenceintDefault: 100

Milliseconds of silence before a speculative end-of-turn check. When the check fires, the model looks for terminal punctuation (. ? !) to decide whether the turn has ended. If no terminal punctuation is found, a partial is emitted and the turn continues.

This parameter replaces the now deprecated min_end_of_turn_silence_when_confident.

max_turn_silenceintDefault: 100

Maximum milliseconds of silence before the turn is forced to end, regardless of punctuation. The LiveKit plugin defaults to 100. When using turn_detection="stt", set this to 1000 to match AssemblyAI's API default.

promptstring

Custom transcription instructions for the model. When not provided, a default prompt optimized for turn detection is used automatically. This parameter is only supported with Universal-3 Pro.

Note: Prompting is a beta feature for Universal-3 Pro. Start without a prompt to establish baseline performance.

continuous_partialsboolDefault: True

LiveKit plugin default is True — AssemblyAI's server default is False. When True, the model emits additional partial transcripts at a steady ~3 second cadence during long turns, on top of the baseline partials emitted at the first-partial point (interruption_delay) and at each min_turn_silence silence period. Useful for long, uninterrupted turns where silence-based partials don't fire often enough for downstream consumers. Can be updated mid-session via update_options(). Only supported with Universal-3 Pro (u3-rt-pro); passing it with any other model raises a ValueError.

interruption_delayintDefault: 500

How soon (in milliseconds) the first early partial is emitted. Lower values produce a faster time-to-first-token for barge-in; higher values produce more confident first partials. Set at construction only — it cannot be changed mid-session via update_options(). Only supported with Universal-3 Pro (u3-rt-pro); passing it with any other model raises a ValueError.

Valid range: 0–1000.

Prompt and Keyterms Prompt

end_of_turn_confidence_thresholdfloatDefault: 0.4

The confidence threshold to use when determining if the end of a turn has been reached. Not applicable to Universal-3 Pro.

min_end_of_turn_silence_when_confidentint

The minimum duration of silence (in milliseconds) required to detect end of turn when confident.

Deprecated: This parameter has been renamed to min_turn_silence. Use min_turn_silence instead. Note that the LiveKit plugin defaults min_turn_silence to 100 for all streaming models (not just Universal-3 Pro), so the effective default is 100 ms.

max_turn_silenceintDefault: 1280

The maximum duration of silence (in milliseconds) allowed in a turn before end of turn is triggered.

format_turnsbool

Whether to return formatted final transcripts. Not applicable to Universal-3 Pro (always returns formatted transcripts).

Turn detection

Universal-3 Pro uses punctuation-based turn detection — it checks for terminal punctuation (. ? !) after periods of silence rather than using a confidence score. To use this for turn detection, set turn_detection="stt" in the turn handling options.

VAD threshold alignment: Universal-3 Pro defaults to a vad_threshold of 0.3. Set LiveKit's Silero activation_threshold to 0.3 as well to ensure consistent barge-in behavior.

session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="stt",
        endpointing={"min_delay": 0},
    ),
    stt=assemblyai.STT(
        model="u3-rt-pro",
        min_turn_silence=100,
        max_turn_silence=1000,
        vad_threshold=0.3,
    ),
    vad=silero.VAD.load(activation_threshold=0.3),
    # ... llm, tts, etc.
)

You can also use LiveKit's MultilingualModel() turn detector instead of turn_detection="stt". The plugin defaults (min_turn_silence=100, max_turn_silence=100) are automatically tuned to provide transcripts to the turn detection model as fast as possible. However, raising these values (e.g., 200–300 ms) may help by giving the model more time before finalizing transcripts, which can reduce over-segmentation.

For a detailed guide on configuring Universal-3 Pro with LiveKit — including entity splitting tradeoffs, VAD threshold alignment, and prompt engineering — see the AssemblyAI LiveKit guide .

AssemblyAI Universal-Streaming includes a custom phrase endpointing model that uses both audio and linguistic information to detect turn boundaries. To use this model for turn detection, set turn_detection="stt" in the turn handling options. You should also provide a VAD plugin for responsive interruption handling.

session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="stt",
    ),
    stt=assemblyai.STT(
      end_of_turn_confidence_threshold=0.4,
      min_end_of_turn_silence_when_confident=400,
      max_turn_silence=1280,
    ),
    vad=silero.VAD.load(),  # Recommended for responsive interruption handling
    # ... llm, tts, etc.
)

Session information

When a WebSocket session starts, AssemblyAI sends a Begin event that includes a session ID and expiry timestamp. The plugin exposes the following information on the SpeechStream object:

Field	Description
`session_id`	UUID string identifying the transcription session. The session ID is also logged automatically at INFO level. Share it with AssemblyAI support when troubleshooting transcription issues.
`expires_at`	Unix timestamp indicating when the session expires.

stream = stt.stream()
async for event in stream:
    # session_id is set before any speech events arrive
    print(stream.session_id)   # e.g. "676d673c-83fc-4d8a-bd95-bfe23b1c5a50"
    print(stream.expires_at)   # e.g. 1773775624

These properties are None until the Begin event is received from AssemblyAI, which happens shortly after the stream starts.

The session ID is also automatically logged:

AssemblyAI session started id=676d673c-83fc-4d8a-bd95-bfe23b1c5a50 expires_at=1773775624

Speaker diarization

Enable speaker diarization so the STT assigns a speaker identifier to each word or segment. When enabled, transcript events include a speaker_id, and the STT reports capabilities.diarization = True.

With diarization enabled, you can wrap the AssemblyAI STT with MultiSpeakerAdapter for primary speaker detection and transcript formatting.

Enable speaker diarization in the STT constructor:

stt = inference.STT(
    model="assemblyai/u3-rt-pro",
    extra_kwargs={
        "speaker_labels": True,
    },
)

stt = assemblyai.STT(
    model="u3-rt-pro",
    speaker_labels=True,
)

Speaker labels are assigned alphabetically ("A", "B", etc.) per session. Short utterances under ~1 second return speaker_id=None.

Additional resources

The following resources provide more information about using AssemblyAI with LiveKit Agents.

Python package

The livekit-plugins-assemblyai package on PyPI.

Plugin reference

Reference for the AssemblyAI STT plugin.

GitHub repo

View the source or contribute to the LiveKit AssemblyAI STT plugin.

AssemblyAI docs

AssemblyAI's full docs for the Universal Streaming API.

Universal-3 Pro docs

AssemblyAI's docs for the Universal-3 Pro streaming model.

Voice AI quickstart

Get started with LiveKit Agents and AssemblyAI.

AssemblyAI LiveKit guide

Guide to using AssemblyAI Universal Streaming STT with LiveKit.