Skip to main content

Speech-to-text (STT) models overview

Models and plugins for realtime transcription in your voice agents.

Overview

STT models, also known as Automated Speech Recognition (ASR) models, are used for realtime transcription or translation of spoken audio. In voice AI, they form the first of three models in the core pipeline: text is transcribed by an STT model, then processed by an LLM model to generate a response which is turned backed to speech using a TTS model.

You can choose a model served through LiveKit Inference, which is included in LiveKit Cloud, or you can use a plugin to connect directly to a wider range of model providers with your own account.

LiveKit Inference

The following models are available in LiveKit Inference. Refer to the guide for each model for more details on additional configuration options.

ProviderModel nameModel IDLanguages
Universal-3 Pro Streaming
assemblyai/u3-rt-pro
6 languages
Universal-Streaming
assemblyai/universal-streaming
English only
Universal-Streaming-Multilingual
assemblyai/universal-streaming-multilingual
6 languages
Ink Whisper
cartesia/ink-whisper
100 languages
Flux
deepgram/flux-general
English only
Nova-3
deepgram/nova-3
Multilingual, 9 languages
Nova-3 Medical
deepgram/nova-3-medical
English only
Nova-2
deepgram/nova-2
Multilingual, 33 languages
Nova-2 Medical
deepgram/nova-2-medical
English only
Nova-2 Conversational AI
deepgram/nova-2-conversationalai
English only
Nova-2 Phonecall
deepgram/nova-2-phonecall
English only
Scribe V2 Realtime
elevenlabs/scribe_v2_realtime
41 languages

Plugins

The LiveKit Agents framework also includes a variety of open source plugins for a wide range of STT providers. These plugins require authentication with the provider yourself, usually via an API key. You are responsible for setting up your own account and managing your own billing and credentials. The plugins are listed below, along with their availability for Python or Node.js.

ProviderPythonNode.js

Have another provider in mind? LiveKit is open source and welcomes new plugin contributions.

Usage

To set up STT in an AgentSession, provide a descriptor with both the desired model and language. LiveKit Inference manages the connection to the model automatically. Consult the models list for available models and languages.

from livekit.agents import AgentSession
session = AgentSession(
# Deepgram Nova-3 in English
stt="deepgram/nova-3:en",
# ... llm, tts, etc.
)
import { AgentSession, inference } from '@livekit/agents';
const session = new AgentSession({
// Deepgram Nova-3 in English
stt: "deepgram/nova-3:en",
// ... llm, tts, etc.
})

Multilingual transcription

If you don't know the language of the input audio, or expect multiple languages to be used simultaneously, use Deepgram Nova-3 with the language set to multi. This model supports multilingual transcription.

from livekit.agents import AgentSession
session = AgentSession(
stt="deepgram/nova-3:multi",
# ... llm, tts, etc.
)
import { AgentSession } from '@livekit/agents';
const session = new AgentSession({
stt: "deepgram/nova-3:multi",
// ... llm, tts, etc.
})

Additional parameters

More configuration options, such as custom vocabulary, are available for each model. To set additional parameters, use the STT class from the inference module. Consult each model reference for examples and available parameters.

Advanced features

The following sections cover more advanced topics common to all STT providers. For more detailed reference on individual provider configuration, consult the model reference or plugin documentation for that provider.

Automatic model selection

If you don't need to use any specific model features, and are only interested in the best model available for a given language, you can specify the language alone with the special model id auto. LiveKit Inference will choose the best model for the given language automatically.

from livekit.agents import AgentSession
session = AgentSession(
# Use the best available model for Spanish
stt="auto:es",
)
import { AgentSession } from '@livekit/agents';
session = new AgentSession({
// Use the best available model for Spanish
stt: "auto:es",
})

LiveKit Inference supports the following languages:

enen-AUen-CAen-GBen-IEen-INen-NZen-USafamarar-AEar-BHar-DJar-DZar-EGar-ERar-IQar-IRar-JOar-KMar-KWar-LBar-LYar-MAar-MRar-OMar-PSar-QAar-SAar-SDar-SOar-SYar-TDar-TNar-YEasazbabebgbg-BGbnbobrbscacscs-CZcycy-GBdada-DKdede-ATde-CHde-DEelel-GReses-419es-ARes-BOes-CLes-COes-CRes-CUes-DOes-ECes-ESes-GTes-HNes-MXes-NIes-PAes-PEes-PRes-PYes-SVes-UYes-VEetet-EEeufafifi-FIfofrfr-BEfr-CAfr-CHfr-FRgaga-IEglguhahawhehe-ILhihi-INhrhr-HRhthuhu-HUhyidid-IDisis-ISitit-CHit-ITjaja-JPjwkakkkmknkoko-KRlalblnloltlt-LTlvlv-LVmgmimkmlmnmrmsms-MYmtmt-MTmultimynenlnl-BEnl-NLnnnono-NOocpaplpl-PLpsptpt-BRpt-PTroro-ROruru-RUsasdsisksk-SKslsl-SIsnsosqsrsr-RSsusvsv-SEswtatetgthth-THtktltrtr-TRttukuk-UAuruzvivi-VNyiyoyuezhzh-CNzh-Hanszh-Hantzh-HKzh-TW

Custom STT

To create an entirely custom STT, implement the STT node in your agent.

Standalone usage

You can use an STT instance in a standalone fashion, without an AgentSession, using the streaming interface. Use push_frame to add realtime audio frames to the stream, and then consume a stream of SpeechEvent events as output.

Here is an example of a standalone STT app:

import asyncio
from dotenv import load_dotenv
from livekit import agents, rtc
from livekit.agents import AgentServer
from livekit.agents.stt import SpeechEventType, SpeechEvent
from typing import AsyncIterable
from livekit.plugins import (
deepgram,
)
load_dotenv()
server = AgentServer()
@server.rtc_session(agent_name="my-agent")
async def my_agent(ctx: agents.JobContext):
@ctx.room.on("track_subscribed")
def on_track_subscribed(track: rtc.RemoteTrack):
print(f"Subscribed to track: {track.name}")
asyncio.create_task(process_track(track))
async def process_track(track: rtc.RemoteTrack):
stt = deepgram.STT(model="nova-2")
stt_stream = stt.stream()
audio_stream = rtc.AudioStream(track)
async with asyncio.TaskGroup() as tg:
# Create task for processing STT stream
stt_task = tg.create_task(process_stt_stream(stt_stream))
# Process audio stream
async for audio_event in audio_stream:
stt_stream.push_frame(audio_event.frame)
# Indicates the end of the audio stream
stt_stream.end_input()
# Wait for STT processing to complete
await stt_task
async def process_stt_stream(stream: AsyncIterable[SpeechEvent]):
try:
async for event in stream:
if event.type == SpeechEventType.FINAL_TRANSCRIPT:
print(f"Final transcript: {event.alternatives[0].text}")
elif event.type == SpeechEventType.INTERIM_TRANSCRIPT:
print(f"Interim transcript: {event.alternatives[0].text}")
elif event.type == SpeechEventType.START_OF_SPEECH:
print("Start of speech")
elif event.type == SpeechEventType.END_OF_SPEECH:
print("End of speech")
finally:
await stream.aclose()
if __name__ == "__main__":
agents.cli.run_app(server)

VAD and StreamAdapter

Some STT providers or models, such as Whisper don't support streaming input. In these cases, your app must determine when a chunk of audio represents a complete segment of speech. You can do this using VAD together with the StreamAdapter class.

The following example modifies the previous example to use VAD and StreamAdapter to buffer user speech until VAD detects the end of speech:

from livekit import agents, rtc
from livekit.plugins import openai, silero
async def process_track(ctx: agents.JobContext, track: rtc.Track):
whisper_stt = openai.STT()
vad = silero.VAD.load(
min_speech_duration=0.1,
min_silence_duration=0.5,
)
vad_stream = vad.stream()
# StreamAdapter will buffer audio until VAD emits END_SPEAKING event
stt = agents.stt.StreamAdapter(whisper_stt, vad_stream)
stt_stream = stt.stream()
...

Speaker diarization and primary speaker detection

Only Available in
Python

Speaker diarization identifies who said what in multi-speaker audio. STT providers that support diarization label segments of speech with a speaker identifier. When enabled, you can wrap the STT with MultiSpeakerAdapter to detect the primary speaker and format the transcripts by speaker. It supports the following features:

  • Identifies the primary speaker based on audio level (RMS). The loudest active speaker is treated as primary.
  • Formats transcripts differently for primary and background speakers.
  • Optionally suppresses background speakers so only the primary speaker's transcript is sent to the LLM.

Use MultiSpeakerAdapter when you want the agent to focus on a single speaker or differentiate transcripts by speaker. It operates on a single mixed audio track (for example, a room microphone) and requires an STT provider that supports diarization.

Supported STT providers

The following STT provider plugins support diarization and can be used with MultiSpeakerAdapter. Diarization must be enabled explicitly. See the documentation for each provider for details:

You can confirm diarization support by checking if the stt.capabilities.diarization property is set to True.

MultiSpeakerAdapter usage

You can format the primary and background transcripts differently using the primary_format and background_format parameters and the placeholders {text} and {speaker_id}.

The following example detects the primary speaker and formats the transcripts by speaker:

from livekit import agents
from livekit.plugins import deepgram
# Deepgram STT with diarization enabled
base_stt = deepgram.STT(model="nova-3", language="en", enable_diarization=True)
# Wrap with MultiSpeakerAdapter to detect primary speaker and format or suppress background
stt = agents.stt.MultiSpeakerAdapter(
stt=base_stt,
detect_primary_speaker=True,
suppress_background_speaker=False, # set True to send only primary speaker to the LLM
primary_format="{text}",
background_format="[Speaker {speaker_id}] {text}",
)
session = AgentSession(stt=stt, # ... llm, tts, etc.)

The following resources provide more information about using MultiSpeakerAdapter.

Additional resources

The following resources cover related topics that may be useful for your application.