LiveKit docs › Models › STT › AssemblyAI

---

# AssemblyAI STT

> How to use AssemblyAI STT with LiveKit Agents.

- **[Use in Agent Builder](https://cloud.livekit.io/projects/p_/agents/builder/new?stt=assemblyai%2Funiversal-streaming)**: Create a new agent in your browser using assemblyai/universal-streaming

## Overview

AssemblyAI speech-to-text is available in LiveKit Agents through [LiveKit Inference](https://docs.livekit.io/agents/models/inference.md) and the [AssemblyAI plugin](#plugin). With LiveKit Inference, your agent runs on LiveKit's infrastructure to minimize latency. No separate provider API key is required, and usage and rate limits are managed through LiveKit Cloud. Use the plugin instead if you want to manage your own billing and rate limits. Pricing for LiveKit Inference is available on the [pricing page](https://livekit.com/pricing/inference#stt).

## LiveKit Inference

Use [LiveKit Inference](https://docs.livekit.io/agents/models/inference.md) to access AssemblyAI STT without a separate AssemblyAI API key.

| Model name | Model ID | Languages |
| -------- | -------- | --------- |
| Universal-3 Pro Streaming | `assemblyai/u3-rt-pro` | `en`, `en-US`, `en-GB`, `en-AU`, `en-CA`, `en-IN`, `en-NZ`, `es`, `es-ES`, `es-MX`, `es-AR`, `es-CO`, `es-CL`, `es-PE`, `es-VE`, `es-EC`, `es-GT`, `es-CU`, `es-BO`, `es-DO`, `es-HN`, `es-PY`, `es-SV`, `es-NI`, `es-CR`, `es-PA`, `es-UY`, `es-PR`, `fr`, `fr-FR`, `fr-CA`, `fr-BE`, `fr-CH`, `de`, `de-DE`, `de-AT`, `de-CH`, `it`, `it-IT`, `it-CH`, `pt`, `pt-BR`, `pt-PT` |
| Universal-Streaming | `assemblyai/universal-streaming` | `en`, `en-US` |
| Universal-Streaming-Multilingual | `assemblyai/universal-streaming-multilingual` | `multi`, `en`, `en-US`, `en-GB`, `en-AU`, `en-CA`, `en-IN`, `en-NZ`, `es`, `es-ES`, `es-MX`, `es-AR`, `es-CO`, `es-CL`, `es-PE`, `es-VE`, `es-EC`, `es-GT`, `es-CU`, `es-BO`, `es-DO`, `es-HN`, `es-PY`, `es-SV`, `es-NI`, `es-CR`, `es-PA`, `es-UY`, `es-PR`, `fr`, `fr-FR`, `fr-CA`, `fr-BE`, `fr-CH`, `de`, `de-DE`, `de-AT`, `de-CH`, `it`, `it-IT`, `it-CH`, `pt`, `pt-BR`, `pt-PT` |

### Usage

To use AssemblyAI, use the `STT` class from the `inference` module:

**Python**:

```python
from livekit.agents import AgentSession, inference

session = AgentSession(
    stt=inference.STT(
        model="assemblyai/u3-rt-pro", 
        language="en"
    ),
    # ... llm, tts, vad, turn_handling, etc.
)

```

---

**Node.js**:

```typescript
import { AgentSession, inference } from '@livekit/agents';

session = new AgentSession({
    stt: new inference.STT({ 
        model: "assemblyai/u3-rt-pro", 
        language: "en" 
    }),
    // ... llm, tts, vad, turnHandling, etc.
});

```

### Parameters

- **`model`** _(string)_: The model to use for the STT. Available models: `assemblyai/u3-rt-pro`, `assemblyai/universal-streaming`, `assemblyai/universal-streaming-multilingual`.

- **`language`** _(LanguageCode)_ (optional): [Language code](https://docs.livekit.io/agents/models/stt.md#language-codes) for the transcription. If not set, the provider default applies. Universal-3 Pro and Universal-Streaming Multilingual automatically detect between English, Spanish, German, French, Portuguese, and Italian.

- **`extra_kwargs`** _(dict)_ (optional): Additional parameters to pass to the AssemblyAI streaming API. Supported fields depend on the selected model. See [model parameters](#model-parameters) for supported fields.

In Node.js this parameter is called `modelOptions`.

#### Model parameters

Pass the following parameters inside `extra_kwargs` (Python) or `modelOptions` (Node.js).

**All models:**

| Parameter | Type | Default | Notes |
| keyterms_prompt | `list[str]` |  | List of terms to boost recognition accuracy for. |
| language_detection | `bool` |  | Whether to include `language_code` and `language_confidence` in turn messages. Defaults to `True` for Universal-3 Pro and Universal-Streaming Multilingual; `False` for Universal-Streaming. |
| inactivity_timeout | `float` |  | Duration of inactivity in seconds before the session closes. |
| min_turn_silence | `int` |  | Minimum duration of silence in milliseconds before the model checks for end of turn. Universal-3 Pro defaults to `100` ms (triggers the punctuation-based EOT check); Universal-Streaming uses it as the confident-EOT silence floor. Replaces the deprecated `min_end_of_turn_silence_when_confident`. |
| max_turn_silence | `int` |  | Maximum duration of silence in milliseconds allowed in a turn before end of turn is triggered. |
| vad_threshold | `float` |  | Confidence threshold for classifying audio frames as silence. Frames below this value are considered silent. Increase in noisy environments. Server-side defaults: `0.3` (Universal-3 Pro), `0.4` (Universal-Streaming).

Valid range: `0.0`–`1.0`. |
| domain | `string` |  | Enables domain-specific recognition. Set to `medical-v1` to use AssemblyAI's [Medical Mode](https://www.assemblyai.com/docs/streaming/medical-mode). Works with all three streaming models. Supported languages: English, Spanish, German, French. Other languages are ignored with a warning. |
| speaker_labels | `bool` | `False` | Set to `True` to enable [speaker diarization](#speaker-diarization). |

**Model-specific parameters:**

**Universal-3 Pro**:

| Parameter | Type | Default | Notes |
| prompt | `str` |  | Custom transcription instructions for the model. When not set, a default prompt optimized for turn detection is used. |
| continuous_partials | `bool` | `false` | Emit a non-final partial transcript approximately every 3 seconds while speech continues, regardless of silence. Useful for long, uninterrupted turns. The first partial still arrives at the early-partial timing controlled by `interruption_delay`. |
| interruption_delay | `int` | `500` | Milliseconds before the first early partial is emitted. Lower values produce a faster time-to-first-token for barge-in; higher values produce more confident first partials.

Valid range: `0`–`1000`. |

> ℹ️ **Prompt and Keyterms Prompt**
> 
> You can use `prompt` and `keyterms_prompt` together in the same streaming request. When you use `keyterms_prompt`, your boosted words are appended to the default prompt (or your custom `prompt` if provided) automatically.

---

**Universal-Streaming**:

| Parameter | Type | Default | Notes |
| format_turns | `bool` | `False` | Whether to return formatted final transcripts. |
| end_of_turn_confidence_threshold | `float` | `0.01` | Confidence threshold for determining the end of a turn. |

### String descriptors

As a shortcut, you can also pass a [model ID](#inference) string directly to the `stt` argument in your `AgentSession`:

**Python**:

```python
from livekit.agents import AgentSession

session = AgentSession(
    stt="assemblyai/u3-rt-pro:en",
    # ... llm, tts, vad, turn_handling, etc.
)

```

---

**Node.js**:

```typescript
import { AgentSession } from '@livekit/agents';

session = new AgentSession({
    stt: "assemblyai/u3-rt-pro:en",
    // ... llm, tts, vad, turnHandling, etc.
});

```

### Turn detection

**Universal-3 Pro**:

Universal-3 Pro uses **punctuation-based turn detection**. It checks for terminal punctuation (`.` `?` `!`) after periods of silence rather than using a confidence score. To use this for [turn detection](https://docs.livekit.io/agents/logic/turns.md), set `turn_detection="stt"` in the turn handling options.

**Default parameter differences:** The LiveKit plugin defaults to `min_turn_silence=100` and `max_turn_silence=100`. The AssemblyAI API defaults are `min_turn_silence=100` and `max_turn_silence=1000`. When using `turn_detection="stt"`, explicitly set `max_turn_silence=1000` to restore AssemblyAI's intended behavior.

**Endpointing delay is additive in STT mode:** LiveKit's default `min_delay` (0.5 seconds) in the turn handling endpointing options is applied on top of AssemblyAI's own endpointing. Set `endpointing.min_delay` to `0` in the turn handling options to avoid extra latency — AssemblyAI's `min_turn_silence` and `max_turn_silence` already control the timing.

**VAD threshold alignment:** Universal-3 Pro defaults to a `vad_threshold` of `0.3`. Set LiveKit's Silero `activation_threshold` to `0.3` as well to ensure consistent barge-in behavior.

**Tuning guidance:** Experiment with `min_turn_silence` and `max_turn_silence`. Settings can vary depending on your use case. Increase `min_turn_silence` if brief pauses cause the speculative EOT check to fire too early, ending turns on terminal punctuation before the user has finished speaking. Increase `max_turn_silence` if the forced turn end is cutting off users mid-thought.

For a detailed guide on configuring Universal-3 Pro with LiveKit — including entity splitting tradeoffs, VAD threshold alignment, and prompt engineering — see the [AssemblyAI LiveKit guide](https://www.assemblyai.com/docs/voice-agents/livekit-u3-rt-pro).

```python
session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="stt",
        endpointing={"min_delay": 0},
    ),
    stt=inference.STT(
        model="assemblyai/u3-rt-pro",
        extra_kwargs={
            "min_turn_silence": 100,
            "max_turn_silence": 1000,
            "vad_threshold": 0.3,
        }
    ),
    vad=silero.VAD.load(activation_threshold=0.3),
    # ... llm, tts, etc.
)

```

---

**Universal-Streaming**:

AssemblyAI includes a custom phrase endpointing model that uses both audio and linguistic information to detect turn boundaries. To use this model for [turn detection](https://docs.livekit.io/agents/logic/turns.md), set `turn_detection="stt"` in the turn handling options. You should also provide a VAD plugin for responsive interruption handling.

```python
session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="stt",
    ),
    stt=inference.STT(
        model="assemblyai/universal-streaming", 
        language="en"
    ),
    vad=silero.VAD.load(),  # Recommended for responsive interruption handling
    # ... llm, tts, etc.
)

```

## Plugin

LiveKit's plugin support for AssemblyAI lets you connect directly to AssemblyAI's API with your own API key. For Node.js, use [LiveKit Inference](#inference).

Available in:
- [ ] Node.js
- [x] Python

### Installation

Install the plugin from PyPI:

```shell
uv add "livekit-agents[assemblyai]~=1.5"

```

### Authentication

The AssemblyAI plugin requires an [AssemblyAI API key](https://www.assemblyai.com/docs/api-reference/overview#authorization).

Set `ASSEMBLYAI_API_KEY` in your `.env` file.

### Usage

Use AssemblyAI STT in an `AgentSession` or as a standalone transcription service. For example, you can use this STT in the [Voice AI quickstart](https://docs.livekit.io/agents/start/voice-ai.md).

```python
from livekit.plugins import assemblyai

session = AgentSession(
    stt=assemblyai.STT(
        model="u3-rt-pro",
        min_turn_silence=100,
        max_turn_silence=1000,
        vad_threshold=0.3,
    ),
    vad=silero.VAD.load(activation_threshold=0.3),
    # ... llm, tts, etc.
)

```

### Parameters

This section describes some of the available parameters. See the [plugin reference](https://docs.livekit.io/reference/python/livekit/plugins/assemblyai/stt.html.md) for a complete list of all available parameters.

#### Shared parameters

These parameters apply to all AssemblyAI streaming models.

- **`model`** _(string)_ (optional) - Default: `universal-streaming-english`: STT model to use. Accepted options are `u3-rt-pro`, `universal-streaming-english`, and `universal-streaming-multilingual`.

- **`keyterms_prompt`** _(list[str])_ (optional): List of terms to boost recognition for.

- **`vad_threshold`** _(float)_ (optional): AssemblyAI's internal Silero VAD onset threshold. Defaults to `0.3` for Universal-3 Pro and `0.4` for Universal-Streaming. For best results, align this with LiveKit's Silero `activation_threshold`.

- **`language_detection`** _(bool)_ (optional): Whether to include `language_code` and `language_confidence` in turn messages. Defaults to `true` for Universal-3 Pro and Universal-Streaming Multilingual, `false` for Universal-Streaming English.

- **`min_turn_silence`** _(int)_ (optional) - Default: `100`: The minimum duration of silence (in milliseconds) before the model checks for end of turn. The LiveKit plugin defaults this to `100` for **all** streaming models. Replaces the deprecated `min_end_of_turn_silence_when_confident`. See the model-specific sections below for how each model uses this parameter.

- **`max_turn_silence`** _(int)_ (optional): The maximum duration of silence (in milliseconds) allowed in a turn before end of turn is triggered. See the model-specific sections below for defaults.

- **`speaker_labels`** _(bool)_ (optional): Enable speaker diarization. When set to `True`, each transcript event includes a `speaker_id` identifying the speaker (`"A"`, `"B"`, etc.). Short utterances under ~1 second return `speaker_id=None`. Use with [`MultiSpeakerAdapter`](https://docs.livekit.io/agents/models/stt.md#speaker-diarization) to detect the primary speaker or format transcripts by speaker.

- **`max_speakers`** _(int)_ (optional): Maximum number of speakers to detect. If not set, AssemblyAI detects the number of speakers automatically.

- **`domain`** _(string)_ (optional): Enables domain-specific recognition. Set to `medical-v1` to use AssemblyAI's [Medical Mode](https://www.assemblyai.com/docs/streaming/medical-mode) for improved accuracy on medical terminology such as medication names, procedures, conditions, and dosages. Works with all three streaming models.

#### Model-specific parameters

**Universal-3 Pro**:

- **`min_turn_silence`** _(int)_ (optional) - Default: `100`: Milliseconds of silence before a speculative end-of-turn check. When the check fires, the model looks for terminal punctuation (`.` `?` `!`) to decide whether the turn has ended. If no terminal punctuation is found, a partial is emitted and the turn continues.

This parameter replaces the now deprecated `min_end_of_turn_silence_when_confident`.

- **`max_turn_silence`** _(int)_ (optional) - Default: `100`: Maximum milliseconds of silence before the turn is forced to end, regardless of punctuation. The LiveKit plugin defaults to `100`. When using `turn_detection="stt"`, set this to `1000` to match AssemblyAI's API default.

- **`prompt`** _(string)_ (optional): Custom transcription instructions for the model. When not provided, a default prompt optimized for turn detection is used automatically. This parameter is only supported with Universal-3 Pro.

**Note:** Prompting is a beta feature for Universal-3 Pro. Start without a prompt to establish baseline performance.

- **`continuous_partials`** _(bool)_ (optional) - Default: `True`: LiveKit plugin default is `True` — AssemblyAI's server default is `False`. When `True`, the model emits additional partial transcripts at a steady ~3 second cadence during long turns, on top of the baseline partials emitted at the first-partial point (`interruption_delay`) and at each `min_turn_silence` silence period. Useful for long, uninterrupted turns where silence-based partials don't fire often enough for downstream consumers. Can be updated mid-session via `update_options()`. Only supported with Universal-3 Pro (`u3-rt-pro`); passing it with any other model raises a `ValueError`.

- **`interruption_delay`** _(int)_ (optional) - Default: `500`: How soon (in milliseconds) the first early partial is emitted. Lower values produce a faster time-to-first-token for barge-in; higher values produce more confident first partials. Set at construction only — it cannot be changed mid-session via `update_options()`. Only supported with Universal-3 Pro (`u3-rt-pro`); passing it with any other model raises a `ValueError`.

Valid range: `0`–`1000`.

> ℹ️ **Prompt and Keyterms Prompt**
> 
> You can use `prompt` and `keyterms_prompt` together in the same streaming request. When you use `keyterms_prompt`, your boosted words are appended to the default prompt (or your custom `prompt` if provided) automatically.

---

**Universal-Streaming**:

- **`end_of_turn_confidence_threshold`** _(float)_ (optional) - Default: `0.4`: The confidence threshold to use when determining if the end of a turn has been reached. Not applicable to Universal-3 Pro.

- **`min_end_of_turn_silence_when_confident`** _(int)_ (optional): The minimum duration of silence (in milliseconds) required to detect end of turn when confident.

**Deprecated:** This parameter has been renamed to `min_turn_silence`. Use `min_turn_silence` instead. Note that the LiveKit plugin defaults `min_turn_silence` to `100` for **all** streaming models (not just Universal-3 Pro), so the effective default is `100` ms.

- **`max_turn_silence`** _(int)_ (optional) - Default: `1280`: The maximum duration of silence (in milliseconds) allowed in a turn before end of turn is triggered.

- **`format_turns`** _(bool)_ (optional): Whether to return formatted final transcripts. Not applicable to Universal-3 Pro (always returns formatted transcripts).

### Turn detection

**Universal-3 Pro**:

Universal-3 Pro uses **punctuation-based turn detection** — it checks for terminal punctuation (`.` `?` `!`) after periods of silence rather than using a confidence score. To use this for [turn detection](https://docs.livekit.io/agents/logic/turns.md), set `turn_detection="stt"` in the turn handling options.

**Default parameter differences:** The LiveKit plugin defaults to `min_turn_silence=100` and `max_turn_silence=100`. The AssemblyAI API defaults are `min_turn_silence=100` and `max_turn_silence=1000`. When using `turn_detection="stt"`, explicitly set `max_turn_silence=1000` to restore AssemblyAI's intended behavior.

**Endpointing delay is additive in STT mode:** LiveKit's default `min_delay` (0.5 seconds) in the turn handling endpointing options is applied on top of AssemblyAI's own endpointing. Set `endpointing.min_delay` to `0` in the turn handling options to avoid extra latency — AssemblyAI's `min_turn_silence` and `max_turn_silence` already control the timing.

**VAD threshold alignment:** Universal-3 Pro defaults to a `vad_threshold` of `0.3`. Set LiveKit's Silero `activation_threshold` to `0.3` as well to ensure consistent barge-in behavior.

**Tuning guidance:** Experiment with `min_turn_silence` and `max_turn_silence`. Settings can vary depending on your use case. Increase `min_turn_silence` if brief pauses cause the speculative EOT check to fire too early, ending turns on terminal punctuation before the user has finished speaking. Increase `max_turn_silence` if the forced turn end is cutting off users mid-thought.

```python
session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="stt",
        endpointing={"min_delay": 0},
    ),
    stt=assemblyai.STT(
        model="u3-rt-pro",
        min_turn_silence=100,
        max_turn_silence=1000,
        vad_threshold=0.3,
    ),
    vad=silero.VAD.load(activation_threshold=0.3),
    # ... llm, tts, etc.
)

```

You can also use LiveKit's [`MultilingualModel()`](https://docs.livekit.io/agents/logic/turns/turn-detector.md) turn detector instead of `turn_detection="stt"`. The plugin defaults (`min_turn_silence=100`, `max_turn_silence=100`) are automatically tuned to provide transcripts to the turn detection model as fast as possible. However, raising these values (e.g., 200–300 ms) may help by giving the model more time before finalizing transcripts, which can reduce over-segmentation.

For a detailed guide on configuring Universal-3 Pro with LiveKit — including entity splitting tradeoffs, VAD threshold alignment, and prompt engineering — see the [AssemblyAI LiveKit guide](https://www.assemblyai.com/docs/voice-agents/livekit-u3-rt-pro).

---

**Universal-Streaming**:

AssemblyAI Universal-Streaming includes a custom phrase endpointing model that uses both audio and linguistic information to detect turn boundaries. To use this model for [turn detection](https://docs.livekit.io/agents/logic/turns.md), set `turn_detection="stt"` in the turn handling options. You should also provide a VAD plugin for responsive interruption handling.

```python
session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="stt",
    ),
    stt=assemblyai.STT(
      end_of_turn_confidence_threshold=0.4,
      min_end_of_turn_silence_when_confident=400,
      max_turn_silence=1280,
    ),
    vad=silero.VAD.load(),  # Recommended for responsive interruption handling
    # ... llm, tts, etc.
)

```

### Session information

When a WebSocket session starts, AssemblyAI sends a `Begin` event that includes a session ID and expiry timestamp. The plugin exposes the following information on the `SpeechStream` object:

| Field | Description |
| `session_id` | UUID string identifying the transcription session. The session ID is also logged automatically at INFO level. Share it with AssemblyAI support when troubleshooting transcription issues. |
| `expires_at` | Unix timestamp indicating when the session expires. |

```python
stream = stt.stream()
async for event in stream:
    # session_id is set before any speech events arrive
    print(stream.session_id)   # e.g. "676d673c-83fc-4d8a-bd95-bfe23b1c5a50"
    print(stream.expires_at)   # e.g. 1773775624

```

These properties are `None` until the `Begin` event is received from AssemblyAI, which happens shortly after the stream starts.

The session ID is also automatically logged:

```bash
AssemblyAI session started id=676d673c-83fc-4d8a-bd95-bfe23b1c5a50 expires_at=1773775624

```

## Speaker diarization

Enable speaker diarization so the STT assigns a speaker identifier to each word or segment. When enabled, transcript events include a `speaker_id`, and the STT reports `capabilities.diarization = True`.

With diarization enabled, you can wrap the AssemblyAI STT with [`MultiSpeakerAdapter`](https://docs.livekit.io/agents/models/stt.md#speaker-diarization) for primary speaker detection and transcript formatting.

Enable speaker diarization in the `STT` constructor:

**LiveKit Inference**:

```python
stt = inference.STT(
    model="assemblyai/u3-rt-pro",
    extra_kwargs={
        "speaker_labels": True,
    },
)

```

---

**Plugin**:

```python
stt = assemblyai.STT(
    model="u3-rt-pro",
    speaker_labels=True,
)

```

Speaker labels are assigned alphabetically (`"A"`, `"B"`, etc.) per session. Short utterances under ~1 second return `speaker_id=None`.

## Additional resources

The following resources provide more information about using AssemblyAI with LiveKit Agents.

- **[Python package](https://pypi.org/project/livekit-plugins-assemblyai/)**: The `livekit-plugins-assemblyai` package on PyPI.

- **[Plugin reference](https://docs.livekit.io/reference/python/livekit/plugins/assemblyai/stt.html.md)**: Reference for the AssemblyAI STT plugin.

- **[GitHub repo](https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-assemblyai)**: View the source or contribute to the LiveKit AssemblyAI STT plugin.

- **[AssemblyAI docs](https://www.assemblyai.com/docs/speech-to-text/universal-streaming)**: AssemblyAI's full docs for the Universal Streaming API.

- **[Universal-3 Pro docs](https://www.assemblyai.com/docs/streaming/universal-3-pro)**: AssemblyAI's docs for the Universal-3 Pro streaming model.

- **[Voice AI quickstart](https://docs.livekit.io/agents/start/voice-ai.md)**: Get started with LiveKit Agents and AssemblyAI.

- **[AssemblyAI LiveKit guide](https://www.assemblyai.com/docs/voice-agents/livekit-u3-rt-pro)**: Guide to using AssemblyAI Universal Streaming STT with LiveKit.

---

This document was rendered at 2026-06-07T11:35:39.327Z.
For the latest version of this document, see [https://docs.livekit.io/agents/models/stt/assemblyai.md](https://docs.livekit.io/agents/models/stt/assemblyai.md).

To explore all LiveKit documentation, see [llms.txt](https://docs.livekit.io/llms.txt).