Create a new agent in your browser using this model
Overview
AssemblyAI speech-to-text is available in LiveKit Agents through LiveKit Inference and the AssemblyAI plugin. Pricing for LiveKit Inference is available on the pricing page.
| Model name | Model ID | Languages |
|---|---|---|
| Universal-3 Pro Streaming | assemblyai/u3-rt-pro | enen-USen-GBen-AUen-CAen-INen-NZeses-ESes-MXes-ARes-COes-CLes-PEes-VEes-ECes-GTes-CUes-BOes-DOes-HNes-PYes-SVes-NIes-CRes-PAes-UYes-PRfrfr-FRfr-CAfr-BEfr-CHdede-DEde-ATde-CHitit-ITit-CHptpt-BRpt-PT |
| Universal-Streaming | assemblyai/universal-streaming | enen-US |
| Universal-Streaming-Multilingual | assemblyai/universal-streaming-multilingual | enen-USen-GBen-AUen-CAen-INen-NZeses-ESes-MXes-ARes-COes-CLes-PEes-VEes-ECes-GTes-CUes-BOes-DOes-HNes-PYes-SVes-NIes-CRes-PAes-UYes-PRfrfr-FRfr-CAfr-BEfr-CHdede-DEde-ATde-CHitit-ITit-CHptpt-BRpt-PT |
LiveKit Inference
Use LiveKit Inference to access AssemblyAI STT without a separate AssemblyAI API key.
Usage
To use AssemblyAI, use the STT class from the inference module:
from livekit.agents import AgentSession, inferencesession = AgentSession(stt=inference.STT(model="assemblyai/u3-rt-pro",language="en"),# ... tts, stt, vad, turn_detection, etc.)
import { AgentSession, inference } from '@livekit/agents';session = new AgentSession({stt: new inference.STT({model: "assemblyai/u3-rt-pro",language: "en"}),// ... tts, stt, vad, turn_detection, etc.});
Parameters
stringRequiredThe model to use for the STT. Available models: assemblyai/u3-rt-pro, assemblyai/universal-streaming, assemblyai/universal-streaming-multilingual.
stringOptionalLanguage code for the transcription. If not set, the provider default applies. Universal-3 Pro and Universal-Streaming Multilingual automatically detect between English, Spanish, German, French, Portuguese, and Italian.
dictOptionalAdditional parameters to pass to the AssemblyAI streaming API. Available parameters depend on the model:
- All models:
keyterms_prompt,vad_threshold,language_detection,max_turn_silence,min_turn_silence - Universal-3 Pro:
prompt - Universal-Streaming:
format_turns,end_of_turn_confidence_threshold,min_end_of_turn_silence_when_confident(deprecated — usemin_turn_silence)
See the provider's documentation for more information.
In Node.js this parameter is called modelOptions.
String descriptors
As a shortcut, you can also pass a model descriptor string directly to the stt argument in your AgentSession:
from livekit.agents import AgentSessionsession = AgentSession(stt="assemblyai/u3-rt-pro:en",# ... tts, stt, vad, turn_detection, etc.)
import { AgentSession } from '@livekit/agents';session = new AgentSession({stt: "assemblyai/u3-rt-pro:en",// ... tts, stt, vad, turn_detection, etc.});
Turn detection
Universal-3 Pro uses punctuation-based turn detection — it checks for terminal punctuation (. ? !) after periods of silence rather than using a confidence score. To use this for turn detection, set turn_detection="stt" in the AgentSession constructor.
Default parameter differences: The LiveKit plugin defaults to min_turn_silence=100 and max_turn_silence=100. The AssemblyAI API defaults are min_turn_silence=100 and max_turn_silence=1000. When using turn_detection="stt", explicitly set max_turn_silence=1000 to restore AssemblyAI's intended behavior.
min_endpointing_delay is additive in STT mode: LiveKit's min_endpointing_delay (default 0.5 seconds) is applied on top of AssemblyAI's own endpointing. Set min_endpointing_delay=0 to avoid extra latency — AssemblyAI's min_turn_silence and max_turn_silence already control the timing.
VAD threshold alignment: Universal-3 Pro defaults to a vad_threshold of 0.3. Set LiveKit's Silero activation_threshold to 0.3 as well to ensure consistent barge-in behavior.
Tuning guidance: You will likely need to experiment with min_turn_silence and max_turn_silence depending on your use case. Increase min_turn_silence if brief pauses cause the speculative EOT check to fire too early, ending turns on terminal punctuation before the user has finished speaking. Increase max_turn_silence if the forced turn end is cutting off users mid-thought.
For a detailed guide on configuring Universal-3 Pro with LiveKit — including entity splitting tradeoffs, VAD threshold alignment, and prompt engineering — see the AssemblyAI LiveKit guide.
session = AgentSession(turn_detection="stt",stt=inference.STT(model="assemblyai/u3-rt-pro",extra_kwargs={"min_turn_silence": 100,"max_turn_silence": 1000,"vad_threshold": 0.3,}),vad=silero.VAD.load(activation_threshold=0.3),min_endpointing_delay=0,# ... llm, tts, etc.)
AssemblyAI includes a custom phrase endpointing model that uses both audio and linguistic information to detect turn boundaries. To use this model for turn detection, set turn_detection="stt" in the AgentSession constructor. You should also provide a VAD plugin for responsive interruption handling.
session = AgentSession(turn_detection="stt",stt=inference.STT(model="assemblyai/universal-streaming",language="en"),vad=silero.VAD.load(), # Recommended for responsive interruption handling# ... llm, tts, etc.)
Plugin
Use the AssemblyAI plugin to connect directly to AssemblyAI's API with your own API key.
Installation
Install the plugin from PyPI:
uv add "livekit-agents[assemblyai]~=1.4"
Authentication
The AssemblyAI plugin requires an AssemblyAI API key.
Set ASSEMBLYAI_API_KEY in your .env file.
Usage
Use AssemblyAI STT in an AgentSession or as a standalone transcription service. For example, you can use this STT in the Voice AI quickstart.
from livekit.plugins import assemblyaisession = AgentSession(stt=assemblyai.STT(model="u3-rt-pro",min_turn_silence=100,max_turn_silence=1000,vad_threshold=0.3,),vad=silero.VAD.load(activation_threshold=0.3),# ... llm, tts, etc.)
Parameters
This section describes some of the available parameters. See the plugin reference for a complete list of all available parameters.
Shared parameters
These parameters apply to all AssemblyAI streaming models.
stringOptionalDefault: universal-streaming-englishSTT model to use. Accepted options are u3-rt-pro, universal-streaming-english, and universal-streaming-multilingual.
list[str]OptionalList of terms to boost recognition for.
floatOptionalAssemblyAI's internal Silero VAD onset threshold. Defaults to 0.3 for Universal-3 Pro and 0.4 for Universal-Streaming. For best results, align this with LiveKit's Silero activation_threshold.
boolOptionalWhether to include language_code and language_confidence in turn messages. Defaults to true for Universal-3 Pro and Universal-Streaming Multilingual, false for Universal-Streaming English.
intOptionalThe maximum duration of silence (in milliseconds) allowed in a turn before end of turn is triggered. See the model-specific sections below for defaults.
Model-specific parameters
intOptionalDefault: 100Milliseconds of silence before a speculative end-of-turn check. When the check fires, the model looks for terminal punctuation (. ? !) to decide whether the turn has ended. If no terminal punctuation is found, a partial is emitted and the turn continues.
This parameter replaces the now deprecated min_end_of_turn_silence_when_confident.
intOptionalDefault: 100Maximum milliseconds of silence before the turn is forced to end, regardless of punctuation. The LiveKit plugin defaults to 100. When using turn_detection="stt", set this to 1000 to match AssemblyAI's API default.
stringOptionalCustom transcription instructions for the model. When not provided, a default prompt optimized for turn detection is used automatically. Cannot be used with keyterms_prompt. This parameter is only supported with Universal-3 Pro.
Note: Prompting is a beta feature for Universal-3 Pro. Start without a prompt to establish baseline performance.
floatOptionalDefault: 0.4The confidence threshold to use when determining if the end of a turn has been reached. Not applicable to Universal-3 Pro.
intOptionalDefault: 400The minimum duration of silence (in milliseconds) required to detect end of turn when confident.
Deprecated: This parameter has been renamed to min_turn_silence. Use min_turn_silence instead.
intOptionalDefault: 1280The maximum duration of silence (in milliseconds) allowed in a turn before end of turn is triggered.
boolOptionalWhether to return formatted final transcripts. Not applicable to Universal-3 Pro (always returns formatted transcripts).
Turn detection
Universal-3 Pro uses punctuation-based turn detection — it checks for terminal punctuation (. ? !) after periods of silence rather than using a confidence score. To use this for turn detection, set turn_detection="stt" in the AgentSession constructor.
Default parameter differences: The LiveKit plugin defaults to min_turn_silence=100 and max_turn_silence=100. The AssemblyAI API defaults are min_turn_silence=100 and max_turn_silence=1000. When using turn_detection="stt", explicitly set max_turn_silence=1000 to restore AssemblyAI's intended behavior.
min_endpointing_delay is additive in STT mode: LiveKit's min_endpointing_delay (default 0.5 seconds) is applied on top of AssemblyAI's own endpointing. Set min_endpointing_delay=0 to avoid extra latency — AssemblyAI's min_turn_silence and max_turn_silence already control the timing.
VAD threshold alignment: Universal-3 Pro defaults to a vad_threshold of 0.3. Set LiveKit's Silero activation_threshold to 0.3 as well to ensure consistent barge-in behavior.
Tuning guidance: You will likely need to experiment with min_turn_silence and max_turn_silence depending on your use case. Increase min_turn_silence if brief pauses cause the speculative EOT check to fire too early, ending turns on terminal punctuation before the user has finished speaking. Increase max_turn_silence if the forced turn end is cutting off users mid-thought.
session = AgentSession(turn_detection="stt",stt=assemblyai.STT(model="u3-rt-pro",min_turn_silence=100,max_turn_silence=1000,vad_threshold=0.3,),vad=silero.VAD.load(activation_threshold=0.3),min_endpointing_delay=0,# ... llm, tts, etc.)
You can also use LiveKit's MultilingualModel() turn detector instead of turn_detection="stt". The plugin defaults (min_turn_silence=100, max_turn_silence=100) are automatically tuned to provide transcripts to the turn detection model as fast as possible. However, raising these values (e.g., 200–300ms) may help by giving the model more time before finalizing transcripts, which can reduce over-segmentation.
For a detailed guide on configuring Universal-3 Pro with LiveKit — including entity splitting tradeoffs, VAD threshold alignment, and prompt engineering — see the AssemblyAI LiveKit guide.
AssemblyAI Universal-Streaming includes a custom phrase endpointing model that uses both audio and linguistic information to detect turn boundaries. To use this model for turn detection, set turn_detection="stt" in the AgentSession constructor. You should also provide a VAD plugin for responsive interruption handling.
session = AgentSession(turn_detection="stt",stt=assemblyai.STT(end_of_turn_confidence_threshold=0.4,min_end_of_turn_silence_when_confident=400,max_turn_silence=1280,),vad=silero.VAD.load(), # Recommended for responsive interruption handling# ... llm, tts, etc.)
Additional resources
The following resources provide more information about using AssemblyAI with LiveKit Agents.
Python package
The livekit-plugins-assemblyai package on PyPI.
Plugin reference
Reference for the AssemblyAI STT plugin.
GitHub repo
View the source or contribute to the LiveKit AssemblyAI STT plugin.
AssemblyAI docs
AssemblyAI's full docs for the Universal Streaming API.
Universal-3 Pro docs
AssemblyAI's docs for the Universal-3 Pro streaming model.
Voice AI quickstart
Get started with LiveKit Agents and AssemblyAI.
AssemblyAI LiveKit guide
Guide to using AssemblyAI Universal Streaming STT with LiveKit.