AssemblyAI STT | LiveKit Documentation

Create a new agent in your browser using this model

Overview

AssemblyAI speech-to-text is available in LiveKit Agents through LiveKit Inference and the AssemblyAI plugin. Pricing for LiveKit Inference is available on the pricing page.

Model name	Model ID	Languages
Universal-3 Pro Streaming	assemblyai/u3-rt-pro	enen-USen-GBen-AUen-CAen-INen-NZeses-ESes-MXes-ARes-COes-CLes-PEes-VEes-ECes-GTes-CUes-BOes-DOes-HNes-PYes-SVes-NIes-CRes-PAes-UYes-PRfrfr-FRfr-CAfr-BEfr-CHdede-DEde-ATde-CHitit-ITit-CHptpt-BRpt-PT
Universal-Streaming	assemblyai/universal-streaming	enen-US
Universal-Streaming-Multilingual	assemblyai/universal-streaming-multilingual	enen-USen-GBen-AUen-CAen-INen-NZeses-ESes-MXes-ARes-COes-CLes-PEes-VEes-ECes-GTes-CUes-BOes-DOes-HNes-PYes-SVes-NIes-CRes-PAes-UYes-PRfrfr-FRfr-CAfr-BEfr-CHdede-DEde-ATde-CHitit-ITit-CHptpt-BRpt-PT

LiveKit Inference

Use LiveKit Inference to access AssemblyAI STT without a separate AssemblyAI API key.

Usage

To use AssemblyAI, use the STT class from the inference module:

from livekit.agents import AgentSession, inference

session = AgentSession(
    stt=inference.STT(
        model="assemblyai/u3-rt-pro", 
        language="en"
    ),
    # ... tts, stt, vad, turn_detection, etc.
)

import { AgentSession, inference } from '@livekit/agents';

session = new AgentSession({
    stt: new inference.STT({ 
        model: "assemblyai/u3-rt-pro", 
        language: "en" 
    }),
    // ... tts, stt, vad, turn_detection, etc.
});

Parameters

modelstringRequired

The model to use for the STT. Available models: assemblyai/u3-rt-pro, assemblyai/universal-streaming, assemblyai/universal-streaming-multilingual.

languagestringOptional

Language code for the transcription. If not set, the provider default applies. Universal-3 Pro and Universal-Streaming Multilingual automatically detect between English, Spanish, German, French, Portuguese, and Italian.

extra_kwargsdictOptional

Additional parameters to pass to the AssemblyAI streaming API. Available parameters depend on the model:

All models: keyterms_prompt, vad_threshold, language_detection, max_turn_silence, min_turn_silence
Universal-3 Pro: prompt
Universal-Streaming: format_turns, end_of_turn_confidence_threshold, min_end_of_turn_silence_when_confident (deprecated — use min_turn_silence)

See the provider's documentation for more information.

In Node.js this parameter is called modelOptions.

String descriptors

As a shortcut, you can also pass a model descriptor string directly to the stt argument in your AgentSession:

from livekit.agents import AgentSession

session = AgentSession(
    stt="assemblyai/u3-rt-pro:en",
    # ... tts, stt, vad, turn_detection, etc.
)

import { AgentSession } from '@livekit/agents';

session = new AgentSession({
    stt: "assemblyai/u3-rt-pro:en",
    // ... tts, stt, vad, turn_detection, etc.
});

Turn detection

Universal-3 Pro uses punctuation-based turn detection — it checks for terminal punctuation (. ? !) after periods of silence rather than using a confidence score. To use this for turn detection, set turn_detection="stt" in the AgentSession constructor.

Default parameter differences: The LiveKit plugin defaults to min_turn_silence=100 and max_turn_silence=100. The AssemblyAI API defaults are min_turn_silence=100 and max_turn_silence=1000. When using turn_detection="stt", explicitly set max_turn_silence=1000 to restore AssemblyAI's intended behavior.

min_endpointing_delay is additive in STT mode: LiveKit's min_endpointing_delay (default 0.5 seconds) is applied on top of AssemblyAI's own endpointing. Set min_endpointing_delay=0 to avoid extra latency — AssemblyAI's min_turn_silence and max_turn_silence already control the timing.

VAD threshold alignment: Universal-3 Pro defaults to a vad_threshold of 0.3. Set LiveKit's Silero activation_threshold to 0.3 as well to ensure consistent barge-in behavior.

Tuning guidance: You will likely need to experiment with min_turn_silence and max_turn_silence depending on your use case. Increase min_turn_silence if brief pauses cause the speculative EOT check to fire too early, ending turns on terminal punctuation before the user has finished speaking. Increase max_turn_silence if the forced turn end is cutting off users mid-thought.

For a detailed guide on configuring Universal-3 Pro with LiveKit — including entity splitting tradeoffs, VAD threshold alignment, and prompt engineering — see the AssemblyAI LiveKit guide.

session = AgentSession(
    turn_detection="stt",
    stt=inference.STT(
        model="assemblyai/u3-rt-pro",
        extra_kwargs={
            "min_turn_silence": 100,
            "max_turn_silence": 1000,
            "vad_threshold": 0.3,
        }
    ),
    vad=silero.VAD.load(activation_threshold=0.3),
    min_endpointing_delay=0,
    # ... llm, tts, etc.
)

AssemblyAI includes a custom phrase endpointing model that uses both audio and linguistic information to detect turn boundaries. To use this model for turn detection, set turn_detection="stt" in the AgentSession constructor. You should also provide a VAD plugin for responsive interruption handling.

session = AgentSession(
    turn_detection="stt",
    stt=inference.STT(
        model="assemblyai/universal-streaming", 
        language="en"
    ),
    vad=silero.VAD.load(), # Recommended for responsive interruption handling
    # ... llm, tts, etc.
)

Plugin

Use the AssemblyAI plugin to connect directly to AssemblyAI's API with your own API key.

Available in

Python

Installation

Install the plugin from PyPI:

uv add "livekit-agents[assemblyai]~=1.4"

Authentication

The AssemblyAI plugin requires an AssemblyAI API key.

Set ASSEMBLYAI_API_KEY in your .env file.

Usage

Use AssemblyAI STT in an AgentSession or as a standalone transcription service. For example, you can use this STT in the Voice AI quickstart.

from livekit.plugins import assemblyai

session = AgentSession(
    stt=assemblyai.STT(
        model="u3-rt-pro",
        min_turn_silence=100,
        max_turn_silence=1000,
        vad_threshold=0.3,
    ),
    vad=silero.VAD.load(activation_threshold=0.3),
    # ... llm, tts, etc.
)

Parameters

This section describes some of the available parameters. See the plugin reference for a complete list of all available parameters.

Shared parameters

These parameters apply to all AssemblyAI streaming models.

modelstringOptionalDefault: universal-streaming-english

STT model to use. Accepted options are u3-rt-pro, universal-streaming-english, and universal-streaming-multilingual.

keyterms_promptlist[str]Optional

List of terms to boost recognition for.

vad_thresholdfloatOptional

AssemblyAI's internal Silero VAD onset threshold. Defaults to 0.3 for Universal-3 Pro and 0.4 for Universal-Streaming. For best results, align this with LiveKit's Silero activation_threshold.

language_detectionboolOptional

Whether to include language_code and language_confidence in turn messages. Defaults to true for Universal-3 Pro and Universal-Streaming Multilingual, false for Universal-Streaming English.

max_turn_silenceintOptional

The maximum duration of silence (in milliseconds) allowed in a turn before end of turn is triggered. See the model-specific sections below for defaults.

Model-specific parameters

min_turn_silenceintOptionalDefault: 100

Milliseconds of silence before a speculative end-of-turn check. When the check fires, the model looks for terminal punctuation (. ? !) to decide whether the turn has ended. If no terminal punctuation is found, a partial is emitted and the turn continues.

This parameter replaces the now deprecated min_end_of_turn_silence_when_confident.

max_turn_silenceintOptionalDefault: 100

Maximum milliseconds of silence before the turn is forced to end, regardless of punctuation. The LiveKit plugin defaults to 100. When using turn_detection="stt", set this to 1000 to match AssemblyAI's API default.

promptstringOptional

Custom transcription instructions for the model. When not provided, a default prompt optimized for turn detection is used automatically. Cannot be used with keyterms_prompt. This parameter is only supported with Universal-3 Pro.

Note: Prompting is a beta feature for Universal-3 Pro. Start without a prompt to establish baseline performance.

end_of_turn_confidence_thresholdfloatOptionalDefault: 0.4

The confidence threshold to use when determining if the end of a turn has been reached. Not applicable to Universal-3 Pro.

min_end_of_turn_silence_when_confidentintOptionalDefault: 400

The minimum duration of silence (in milliseconds) required to detect end of turn when confident.

Deprecated: This parameter has been renamed to min_turn_silence. Use min_turn_silence instead.

max_turn_silenceintOptionalDefault: 1280

The maximum duration of silence (in milliseconds) allowed in a turn before end of turn is triggered.

format_turnsboolOptional

Whether to return formatted final transcripts. Not applicable to Universal-3 Pro (always returns formatted transcripts).

Turn detection

VAD threshold alignment: Universal-3 Pro defaults to a vad_threshold of 0.3. Set LiveKit's Silero activation_threshold to 0.3 as well to ensure consistent barge-in behavior.

session = AgentSession(
    turn_detection="stt",
    stt=assemblyai.STT(
        model="u3-rt-pro",
        min_turn_silence=100,
        max_turn_silence=1000,
        vad_threshold=0.3,
    ),
    vad=silero.VAD.load(activation_threshold=0.3),
    min_endpointing_delay=0,
    # ... llm, tts, etc.
)

You can also use LiveKit's MultilingualModel() turn detector instead of turn_detection="stt". The plugin defaults (min_turn_silence=100, max_turn_silence=100) are automatically tuned to provide transcripts to the turn detection model as fast as possible. However, raising these values (e.g., 200–300ms) may help by giving the model more time before finalizing transcripts, which can reduce over-segmentation.

For a detailed guide on configuring Universal-3 Pro with LiveKit — including entity splitting tradeoffs, VAD threshold alignment, and prompt engineering — see the AssemblyAI LiveKit guide.

AssemblyAI Universal-Streaming includes a custom phrase endpointing model that uses both audio and linguistic information to detect turn boundaries. To use this model for turn detection, set turn_detection="stt" in the AgentSession constructor. You should also provide a VAD plugin for responsive interruption handling.

session = AgentSession(
    turn_detection="stt",
    stt=assemblyai.STT(
      end_of_turn_confidence_threshold=0.4,
      min_end_of_turn_silence_when_confident=400,
      max_turn_silence=1280,
    ),
    vad=silero.VAD.load(), # Recommended for responsive interruption handling
    # ... llm, tts, etc.
)

Additional resources

The following resources provide more information about using AssemblyAI with LiveKit Agents.

Python package

The livekit-plugins-assemblyai package on PyPI.

Plugin reference

Reference for the AssemblyAI STT plugin.

GitHub repo

View the source or contribute to the LiveKit AssemblyAI STT plugin.

AssemblyAI docs

AssemblyAI's full docs for the Universal Streaming API.

Universal-3 Pro docs

AssemblyAI's docs for the Universal-3 Pro streaming model.

Voice AI quickstart

Get started with LiveKit Agents and AssemblyAI.

AssemblyAI LiveKit guide

Guide to using AssemblyAI Universal Streaming STT with LiveKit.