Overview
STT models, also known as Automated Speech Recognition (ASR) models, are used for realtime transcription or translation of spoken audio. In voice AI, they form the first of three models in the core pipeline: text is transcribed by an STT model, then processed by an LLM model to generate a response which is turned backed to speech using a TTS model.
You can choose a model served through LiveKit Inference, which is included in LiveKit Cloud, or you can use a plugin to connect directly to a wider range of model providers with your own account.
LiveKit Inference
The following models are available in LiveKit Inference. Refer to the guide for each model for more details on additional configuration options.
Provider | Model name | Languages |
---|---|---|
Universal-Streaming | English only | |
Ink Whisper | 98 languages | |
Nova-3 | Multilingual, 8 languages | |
Nova-3 Medical | English only | |
Nova-2 | Multilingual, 33 languages | |
Nova-2 Medical | English only | |
Nova-2 Conversational AI | English only | |
Nova-2 Phonecall | English only |
Usage
To set up STT in an AgentSession
, provide a descriptor with both the desired model and language. LiveKit Inference manages the connection to the model automatically. Consult the models list for available models and languages.
from livekit.agents import AgentSessionsession = AgentSession(# AssemblyAI STT in Englishstt="assemblyai/universal-streaming:en",# ... llm, tts, etc.)
import { AgentSession, inference } from '@livekit/agents';const session = new AgentSession({// AssemblyAI STT in Englishstt: "assemblyai/universal-streaming:en",// ... llm, tts, etc.})
Multilingual transcription
If you don't know the language of the input audio, or expect multiple languages to be used simultaneously, use deepgram/nova-3
with the language set to multi
. This model supports multilingual transcription.
from livekit.agents import AgentSessionsession = AgentSession(stt="deepgram/nova-3:multi",# ... llm, tts, etc.)
import { AgentSession } from '@livekit/agents';const session = new AgentSession({stt: "deepgram/nova-3:multi",// ... llm, tts, etc.})
Additional parameters
More configuration options, such as custom vocabulary, are available for each model. To set additional parameters, use the STT
class from the inference
module. Consult each model reference for examples and available parameters.
Plugins
The LiveKit Agents framework also includes a variety of open source plugins for a wide range of STT providers. These plugins require authentication with the provider yourself, usually via an API key. You are responsible for setting up your own account and managing your own billing and credentials. The plugins are listed below, along with their availability for Python or Node.js.
Provider | Python | Node.js |
---|---|---|
✓ | — | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | ✓ | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | ✓ | |
✓ | — | |
✓ | — | |
✓ | — | |
✓ | — |
Have another provider in mind? LiveKit is open source and welcomes new plugin contributions.
Advanced features
The following sections cover more advanced topics common to all STT providers. For more detailed reference on individual provider configuration, consult the model reference or plugin documentation for that provider.
Automatic model selection
If you don't need to use any specific model features, and are only interested in the best model available for a given language, you can specify the language alone with the special model id auto
. LiveKit Inference will choose the best model for the given language automatically.
from livekit.agents import AgentSessionsession = AgentSession(# Use the best available model for Spanishstt="auto:es",)
import { AgentSession } from '@livekit/agents';session = new AgentSession({// Use the best available model for Spanishstt: "auto:es",})
LiveKit Inference supports the following languages:
Custom STT
To create an entirely custom STT, implement the STT node in your agent.
Standalone usage
You can use an STT
instance in a standalone fashion, without an AgentSession
, using the streaming interface. Use push_frame
to add realtime audio frames to the stream, and then consume a stream of SpeechEvent
events as output.
Here is an example of a standalone STT app:
import asynciofrom dotenv import load_dotenvfrom livekit import agents, rtcfrom livekit.agents.stt import SpeechEventType, SpeechEventfrom typing import AsyncIterablefrom livekit.plugins import (deepgram,)load_dotenv()async def entrypoint(ctx: agents.JobContext):@ctx.room.on("track_subscribed")def on_track_subscribed(track: rtc.RemoteTrack):print(f"Subscribed to track: {track.name}")asyncio.create_task(process_track(track))async def process_track(track: rtc.RemoteTrack):stt = deepgram.STT(model="nova-2")stt_stream = stt.stream()audio_stream = rtc.AudioStream(track)async with asyncio.TaskGroup() as tg:# Create task for processing STT streamstt_task = tg.create_task(process_stt_stream(stt_stream))# Process audio streamasync for audio_event in audio_stream:stt_stream.push_frame(audio_event.frame)# Indicates the end of the audio streamstt_stream.end_input()# Wait for STT processing to completeawait stt_taskasync def process_stt_stream(stream: AsyncIterable[SpeechEvent]):try:async for event in stream:if event.type == SpeechEventType.FINAL_TRANSCRIPT:print(f"Final transcript: {event.alternatives[0].text}")elif event.type == SpeechEventType.INTERIM_TRANSCRIPT:print(f"Interim transcript: {event.alternatives[0].text}")elif event.type == SpeechEventType.START_OF_SPEECH:print("Start of speech")elif event.type == SpeechEventType.END_OF_SPEECH:print("End of speech")finally:await stream.aclose()if __name__ == "__main__":agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))
VAD and StreamAdapter
Some STT providers or models, such as Whisper don't support streaming input. In these cases, your app must determine when a chunk of audio represents a complete segment of speech. You can do this using VAD together with the StreamAdapter
class.
The following example modifies the previous example to use VAD and StreamAdapter
to buffer user speech until VAD detects the end of speech:
from livekit import agents, rtcfrom livekit.plugins import openai, sileroasync def process_track(ctx: agents.JobContext, track: rtc.Track):whisper_stt = openai.STT()vad = silero.VAD.load(min_speech_duration=0.1,min_silence_duration=0.5,)vad_stream = vad.stream()# StreamAdapter will buffer audio until VAD emits END_SPEAKING eventstt = agents.stt.StreamAdapter(whisper_stt, vad_stream)stt_stream = stt.stream()...
Additional resources
The following resources cover related topics that may be useful for your application.