LiveKit docs › Models › STT › Overview

---

# Speech-to-text (STT) models overview

> Models and plugins for realtime transcription in your voice agents.

## Overview

STT models, also known as Automated Speech Recognition (ASR) models, are used for realtime transcription or translation of spoken audio. In voice AI, they form the first of three models in the core pipeline: text is transcribed by an STT model, then processed by an [LLM](https://docs.livekit.io/agents/models/llm.md) model to generate a response which is turned back into speech using a [TTS](https://docs.livekit.io/agents/models/tts.md) model.

You can choose a model served through LiveKit Inference, included with LiveKit Cloud. With LiveKit Inference, your agent runs on LiveKit's infrastructure to minimize latency. No separate provider API key is required, and usage and rate limits are managed through LiveKit Cloud. Use the plugin instead if you prefer to manage billing and rate limits yourself, or need access to a provider not currently available through LiveKit Inference.

### LiveKit Inference

The following models are available in [LiveKit Inference](https://docs.livekit.io/agents/models.md#inference). Refer to the guide for each model for more details on additional configuration options.

| Provider | Model name | Languages |
| -------- | -------- | --------- |
| [AssemblyAI](https://docs.livekit.io/agents/models/stt/assemblyai.md) | Universal-3 Pro Streaming | 6 languages |
|   | Universal-Streaming | English only |
|   | Universal-Streaming-Multilingual | 6 languages |
| [Cartesia](https://docs.livekit.io/agents/models/stt/cartesia.md) | Ink Whisper | 100 languages |
| [Deepgram](https://docs.livekit.io/agents/models/stt/deepgram.md) | Flux | English only |
|   | Flux (Multilingual) | 10 languages |
|   | Nova-2 | Multilingual, 33 languages |
|   | Nova-2 Conversational AI | English only |
|   | Nova-2 Medical | English only |
|   | Nova-2 Phone Call | English only |
|   | Nova-3 (Monolingual) | 44 languages |
|   | Nova-3 Medical | English only |
|   | Nova-3 (Multilingual) | Multilingual, 0 languages |
| [ElevenLabs](https://docs.livekit.io/agents/models/stt/elevenlabs.md) | Scribe v2 Realtime | 190 languages |

### Plugins

The LiveKit Agents framework also includes a variety of open source [plugins](https://docs.livekit.io/agents/models.md#plugins) for a wide range of STT providers. These plugins require authentication with the provider yourself, usually via an API key. You are responsible for setting up your own account and managing your own billing and credentials. The plugins are listed below, along with their availability for Python or Node.js.

| Provider | Python | Node.js |
| -------- | ------ | ------- |
| [Amazon Transcribe](https://docs.livekit.io/agents/models/stt/plugins/aws.md) | ✓ | — |
| [AssemblyAI](https://docs.livekit.io/agents/models/stt/plugins/assemblyai.md) | ✓ | — |
| [Azure AI Speech](https://docs.livekit.io/agents/models/stt/plugins/azure.md) | ✓ | — |
| [Azure OpenAI](https://docs.livekit.io/agents/models/stt/plugins/azure-openai.md) | ✓ | — |
| [Baseten](https://docs.livekit.io/agents/models/stt/plugins/baseten.md) | ✓ | — |
| [Cartesia](https://docs.livekit.io/agents/models/stt/plugins/cartesia.md) | ✓ | — |
| [Clova](https://docs.livekit.io/agents/models/stt/plugins/clova.md) | ✓ | — |
| [Deepgram](https://docs.livekit.io/agents/models/stt/plugins/deepgram.md) | ✓ | ✓ |
| [ElevenLabs](https://docs.livekit.io/agents/models/stt/plugins/elevenlabs.md) | ✓ | — |
| [fal](https://docs.livekit.io/agents/models/stt/plugins/fal.md) | ✓ | — |
| [Gladia](https://docs.livekit.io/agents/models/stt/plugins/gladia.md) | ✓ | — |
| [Google Cloud](https://docs.livekit.io/agents/models/stt/plugins/google.md) | ✓ | — |
| [Gradium](https://docs.livekit.io/agents/models/stt/plugins/gradium.md) | ✓ | — |
| [Groq](https://docs.livekit.io/agents/models/stt/plugins/groq.md) | ✓ | — |
| [Mistral AI](https://docs.livekit.io/agents/models/stt/plugins/mistralai.md) | ✓ | — |
| [Nvidia](https://docs.livekit.io/agents/models/stt/plugins/nvidia.md) | ✓ | — |
| [OpenAI](https://docs.livekit.io/agents/models/stt/plugins/openai.md) | ✓ | ✓ |
| [OVHCloud](https://docs.livekit.io/agents/models/stt/plugins/ovhcloud.md) | ✓ | ✓ |
| [Sarvam](https://docs.livekit.io/agents/models/stt/plugins/sarvam.md) | ✓ | ✓ |
| [Simplismart](https://docs.livekit.io/agents/models/stt/plugins/simplismart.md) | ✓ | — |
| [Soniox](https://docs.livekit.io/agents/models/stt/plugins/soniox.md) | ✓ | — |
| [Speechmatics](https://docs.livekit.io/agents/models/stt/plugins/speechmatics.md) | ✓ | — |
| [Spitch](https://docs.livekit.io/agents/models/stt/plugins/spitch.md) | ✓ | — |

Have another provider in mind? LiveKit is open source and welcomes [new plugin contributions](https://docs.livekit.io/agents/models.md#contribute).

## Usage

To set up STT in an `AgentSession`, use the `STT` class from the `inference` module. Consult the [models list](#inference) for available models and languages.

**Python**:

```python
from livekit.agents import AgentSession, inference

session = AgentSession(
    stt=inference.STT(
        model="deepgram/nova-3",
        language="en"
    ),
    # ... llm, tts, etc.
)

```

---

**Node.js**:

```typescript
import { AgentSession, inference } from '@livekit/agents';

const session = new AgentSession({
    stt: new inference.STT({
        model: "deepgram/nova-3",
        language: "en"
    }),
    // ... llm, tts, etc.
})

```

### Multilingual transcription

If you don't know the language of the input audio, or expect multiple languages to be used simultaneously, use Deepgram Nova-3 with the language set to `multi`. This model supports multilingual transcription.

**Python**:

```python
from livekit.agents import AgentSession

session = AgentSession(
    stt="deepgram/nova-3:multi",
    # ... llm, tts, etc.
)

```

---

**Node.js**:

```typescript
import { AgentSession } from '@livekit/agents';

const session = new AgentSession({
    stt: "deepgram/nova-3:multi",
    // ... llm, tts, etc.
})

```

### Additional parameters

More configuration options, such as custom vocabulary, are available for each model. To set additional parameters, use the `STT` class from the `inference` module. Consult each model reference for examples and available parameters.

## Advanced features

The following sections cover more advanced topics common to all STT providers. For more detailed reference on individual provider configuration, consult the model reference or plugin documentation for that provider.

### Automatic model selection

If you don't need to use any specific model features, and are only interested in the best model available for a given language, you can specify the language alone with the special model ID `auto`. LiveKit Inference chooses the best model for the given language automatically.

**Python**:

```python
from livekit.agents import AgentSession

session = AgentSession(
    # Use the best available model for Spanish
    stt="auto:es",   
)

```

---

**Node.js**:

```typescript
import { AgentSession } from '@livekit/agents';

session = new AgentSession({
    // Use the best available model for Spanish
    stt: "auto:es",
})

```

LiveKit Inference supports the following languages. You can pass any of these values using the [`LanguageCode`](#language-codes) type, which accepts ISO 639-1, BCP-47, ISO 639-3, language names, and other common formats.

- `ar-AE`: Arabic (United Arab Emirates)
- `ar-BH`: Arabic (Bahrain)
- `ar-DJ`: Arabic (Djibouti)
- `ar-DZ`: Arabic (Algeria)
- `ar-EG`: Arabic (Egypt)
- `ar-ER`: Arabic (Eritrea)
- `ar-IQ`: Arabic (Iraq)
- `ar-JO`: Arabic (Jordan)
- `ar-KM`: Arabic (Comoros)
- `ar-KW`: Arabic (Kuwait)
- `ar-LB`: Arabic (Lebanon)
- `ar-LY`: Arabic (Libya)
- `ar-MA`: Arabic (Morocco)
- `ar-MR`: Arabic (Mauritania)
- `ar-OM`: Arabic (Oman)
- `ar-QA`: Arabic (Qatar)
- `ar-SA`: Arabic (Saudi Arabia)
- `ar-SD`: Arabic (Sudan)
- `ar-SO`: Arabic (Somalia)
- `ar-SY`: Arabic (Syria)
- `ar-TD`: Arabic (Chad)
- `ar-TN`: Arabic (Tunisia)
- `ar-YE`: Arabic (Yemen)
- `bg-BG`: Bulgarian (Bulgaria)
- `cs-CZ`: Czech (Czech Republic)
- `cy-GB`: Welsh (United Kingdom)
- `da-DK`: Danish (Denmark)
- `de-AT`: German (Austria)
- `de-CH`: German (Switzerland)
- `de-DE`: German (Germany)
- `el-GR`: Greek (Greece)
- `en-AU`: English (Australia)
- `en-CA`: English (Canada)
- `en-GB`: English (United Kingdom)
- `en-IE`: English (Ireland)
- `en-IN`: English (India)
- `en-NZ`: English (New Zealand)
- `en-US`: English (United States)
- `es-419`: Spanish (Latin America)
- `es-AR`: Spanish (Argentina)
- `es-BO`: Spanish (Bolivia)
- `es-CL`: Spanish (Chile)
- `es-CO`: Spanish (Colombia)
- `es-CR`: Spanish (Costa Rica)
- `es-CU`: Spanish (Cuba)
- `es-DO`: Spanish (Dominican Republic)
- `es-EC`: Spanish (Ecuador)
- `es-ES`: Spanish (Spain)
- `es-GT`: Spanish (Guatemala)
- `es-HN`: Spanish (Honduras)
- `es-MX`: Spanish (Mexico)
- `es-NI`: Spanish (Nicaragua)
- `es-PA`: Spanish (Panama)
- `es-PE`: Spanish (Peru)
- `es-PR`: Spanish (Puerto Rico)
- `es-PY`: Spanish (Paraguay)
- `es-SV`: Spanish (El Salvador)
- `es-UY`: Spanish (Uruguay)
- `es-VE`: Spanish (Venezuela)
- `et-EE`: Estonian (Estonia)
- `fi-FI`: Finnish (Finland)
- `fr-BE`: French (Belgium)
- `fr-CA`: French (Canada)
- `fr-CH`: French (Switzerland)
- `fr-FR`: French (France)
- `ga-IE`: Irish (Ireland)
- `he-IL`: Hebrew (Israel)
- `hi-IN`: Hindi (India)
- `hr-HR`: Croatian (Croatia)
- `hu-HU`: Hungarian (Hungary)
- `id-ID`: Indonesian (Indonesia)
- `is-IS`: Icelandic (Iceland)
- `it-CH`: Italian (Switzerland)
- `it-IT`: Italian (Italy)
- `ja-JP`: Japanese (Japan)
- `ko-KR`: Korean (South Korea)
- `lt-LT`: Lithuanian (Lithuania)
- `lv-LV`: Latvian (Latvia)
- `ms-MY`: Malay (Malaysia)
- `mt-MT`: Maltese (Malta)
- `nl-BE`: Dutch (Belgium)
- `nl-NL`: Dutch (Netherlands)
- `no-NO`: Norwegian (Norway)
- `pl-PL`: Polish (Poland)
- `pt-BR`: Portuguese (Brazil)
- `pt-PT`: Portuguese (Portugal)
- `ro-RO`: Romanian (Romania)
- `ru-RU`: Russian (Russia)
- `sk-SK`: Slovak (Slovakia)
- `sl-SI`: Slovenian (Slovenia)
- `sr-RS`: Serbian (Serbia)
- `sv-SE`: Swedish (Sweden)
- `th-TH`: Thai (Thailand)
- `tr-TR`: Turkish (Turkey)
- `uk-UA`: Ukrainian (Ukraine)
- `vi-VN`: Vietnamese (Vietnam)
- `zh-CN`: Simplified Chinese (China)
- `zh-HK`: Traditional Chinese (Hong Kong)
- `zh-Hans`: Simplified Chinese
- `zh-Hant`: Traditional Chinese
- `zh-TW`: Traditional Chinese (Taiwan)
- `af`: Afrikaans
- `am`: Amharic
- `ar`: Arabic
- `as`: Assamese
- `auto`: Multilingual (automatic)
- `az`: Azerbaijani
- `ba`: Bashkir
- `be`: Belarusian
- `bg`: Bulgarian
- `bn`: Bengali
- `bo`: Tibetan
- `br`: Breton
- `bs`: Bosnian
- `ca`: Catalan
- `cs`: Czech
- `cy`: Welsh
- `da`: Danish
- `de`: German
- `el`: Greek
- `en`: English
- `es`: Spanish
- `et`: Estonian
- `eu`: Basque
- `fa`: Farsi
- `fi`: Finnish
- `fil`: Filipino
- `fo`: Faroese
- `fr`: French
- `ga`: Irish
- `gl`: Galician
- `gu`: Gujarati
- `ha`: Hausa
- `haw`: Hawaiian
- `he`: Hebrew
- `hi`: Hindi
- `hr`: Croatian
- `ht`: Haitian
- `hu`: Hungarian
- `hy`: Armenian
- `id`: Indonesian
- `is`: Icelandic
- `it`: Italian
- `ja`: Japanese
- `jw`: Javanese
- `ka`: Georgian
- `kk`: Kazakh
- `km`: Khmer
- `kn`: Kannada
- `ko`: Korean
- `la`: Latin
- `lb`: Luxembourgish
- `ln`: Lingala
- `lo`: Lao
- `lt`: Lithuanian
- `lv`: Latvian
- `mg`: Malagasy
- `mi`: Maori
- `mk`: Macedonian
- `ml`: Malayalam
- `mn`: Mongolian
- `mr`: Marathi
- `ms`: Malay
- `mt`: Maltese
- `multi`: multilingual (automatic)
- `my`: Myanmar
- `ne`: Nepali
- `nl`: Dutch
- `nn`: Norwegian Nynorsk
- `no`: Norwegian
- `oc`: Occitan
- `pa`: Punjabi
- `pl`: Polish
- `ps`: Pashto
- `pt`: Portuguese
- `ro`: Romanian
- `ru`: Russian
- `sa`: Sanskrit
- `sd`: Sindhi
- `si`: Sinhala
- `sk`: Slovak
- `sl`: Slovenian
- `sn`: Shona
- `so`: Somali
- `sq`: Albanian
- `sr`: Serbian
- `su`: Sundanese
- `sv`: Swedish
- `sw`: Swahili
- `ta`: Tamil
- `te`: Telugu
- `tg`: Tajik
- `th`: Thai
- `tk`: Turkmen
- `tl`: Tagalog
- `tr`: Turkish
- `tt`: Tatar
- `uk`: Ukrainian
- `ur`: Urdu
- `uz`: Uzbek
- `vi`: Vietnamese
- `yi`: Yiddish
- `yo`: Yoruba
- `yue`: Cantonese
- `zh`: Chinese

### Custom STT

To create an entirely custom STT, implement the [STT node](https://docs.livekit.io/agents/build/nodes.md#stt_node) in your agent.

### Standalone usage

You can use an `STT` instance in a standalone fashion, without an `AgentSession`, using the streaming interface. Use `push_frame` to add [realtime audio frames](https://docs.livekit.io/transport/media.md) to the stream, and then consume a stream of `SpeechEvent` events as output.

Here is an example of a standalone STT app:

** Filename: `agent.py`**

```python
import asyncio

from dotenv import load_dotenv

from livekit import agents, rtc
from livekit.agents import AgentServer
from livekit.agents.stt import SpeechEventType, SpeechEvent
from typing import AsyncIterable
from livekit.plugins import (
    deepgram,
)

load_dotenv()

server = AgentServer()

@server.rtc_session(agent_name="my-agent")
async def my_agent(ctx: agents.JobContext):
    @ctx.room.on("track_subscribed")
    def on_track_subscribed(track: rtc.RemoteTrack):
        print(f"Subscribed to track: {track.name}")

        asyncio.create_task(process_track(track))

    async def process_track(track: rtc.RemoteTrack):
        stt = deepgram.STT(model="nova-2")
        stt_stream = stt.stream()
        audio_stream = rtc.AudioStream(track)

        async with asyncio.TaskGroup() as tg:
            # Create task for processing STT stream
            stt_task = tg.create_task(process_stt_stream(stt_stream))

            # Process audio stream
            async for audio_event in audio_stream:
                stt_stream.push_frame(audio_event.frame)

            # Indicates the end of the audio stream
            stt_stream.end_input()

            # Wait for STT processing to complete
            await stt_task

    async def process_stt_stream(stream: AsyncIterable[SpeechEvent]):
        try:
            async for event in stream:
                if event.type == SpeechEventType.FINAL_TRANSCRIPT:
                    print(f"Final transcript: {event.alternatives[0].text}")
                elif event.type == SpeechEventType.INTERIM_TRANSCRIPT:
                    print(f"Interim transcript: {event.alternatives[0].text}")
                elif event.type == SpeechEventType.START_OF_SPEECH:
                    print("Start of speech")
                elif event.type == SpeechEventType.END_OF_SPEECH:
                    print("End of speech")
        finally:
            await stream.aclose()


if __name__ == "__main__":
    agents.cli.run_app(server)


```

### VAD and StreamAdapter

Some STT providers or models, such as [Whisper](https://github.com/openai/whisper), don't support streaming input. In these cases, your app must determine when a chunk of audio represents a complete segment of speech. You can do this using VAD together with the `StreamAdapter` class.

The following example modifies the previous example to use VAD and `StreamAdapter` to buffer user speech until VAD detects the end of speech:

```python
from livekit import agents, rtc
from livekit.plugins import openai, silero

async def process_track(ctx: agents.JobContext, track: rtc.Track):
  whisper_stt = openai.STT()
  vad = silero.VAD.load(
    min_speech_duration=0.1,
    min_silence_duration=0.5,
  )
  vad_stream = vad.stream()
  # StreamAdapter buffers audio until VAD emits END_SPEAKING event
  stt = agents.stt.StreamAdapter(whisper_stt, vad_stream)
  stt_stream = stt.stream()
  ...

```

### Speaker diarization and primary speaker detection

Available in:
- [ ] Node.js
- [x] Python

Speaker diarization identifies who said what in multi-speaker audio. STT providers that support diarization label segments of speech with a speaker identifier. When enabled, you can wrap the STT with `MultiSpeakerAdapter` to detect the primary speaker and format the transcripts by speaker. It supports the following features:

- Identifies the primary speaker based on audio level (RMS). The loudest active speaker is treated as primary.
- Formats transcripts differently for primary and background speakers.
- Optionally suppresses background speakers so only the primary speaker's transcript is sent to the LLM.

Use `MultiSpeakerAdapter` when you want the agent to focus on a single speaker or differentiate transcripts by speaker. It operates on a single mixed audio track (for example, a room microphone) and requires an STT provider that supports diarization.

#### Supported STT providers

The following STT provider plugins support diarization and can be used with `MultiSpeakerAdapter`. You must explicitly enable diarization. See the documentation for each provider for details:

- [AssemblyAI](https://docs.livekit.io/agents/models/stt/assemblyai.md#speaker-diarization)
- [Deepgram](https://docs.livekit.io/agents/models/stt/deepgram.md#speaker-diarization)
- [Speechmatics](https://docs.livekit.io/agents/models/stt/speechmatics.md#speaker-diarization)
- [Soniox](https://docs.livekit.io/agents/models/stt/soniox.md#speaker-diarization)

You can confirm diarization support by checking if the `stt.capabilities.diarization` property is set to `True`.

#### MultiSpeakerAdapter usage

You can format the primary and background transcripts differently using the `primary_format` and `background_format` parameters and the placeholders `{text}` and `{speaker_id}`.

The following example detects the primary speaker and formats the transcripts by speaker:

```python
from livekit import agents
from livekit.plugins import deepgram

# Deepgram STT with diarization enabled
base_stt = deepgram.STT(model="nova-3", language="en", enable_diarization=True)

# Wrap with MultiSpeakerAdapter to detect primary speaker and format or suppress background
stt = agents.stt.MultiSpeakerAdapter(
    stt=base_stt,
    detect_primary_speaker=True,
    suppress_background_speaker=False,  # set True to send only primary speaker to the LLM
    primary_format="{text}",
    background_format="[Speaker {speaker_id}] {text}",
)

session = AgentSession(stt=stt, # ... llm, tts, etc.)

```

The following resources provide more information about using `MultiSpeakerAdapter`.

- **[Speaker diarization](https://github.com/livekit/agents/blob/main/examples/voice_agents/speaker_id_multi_speaker.py)**: Example of using `MultiSpeakerAdapter`.

- **[Python reference](https://docs.livekit.io/reference/python/livekit/agents/stt.md#livekit.agents.stt.MultiSpeakerAdapter)**: Reference for the `MultiSpeakerAdapter` class.

### Language codes

All STT plugins and LiveKit Inference use the `LanguageCode` type for the `language` parameter. `LanguageCode` accepts any common language format and normalizes it automatically to [BCP-47](https://www.rfc-editor.org/info/bcp47). You don't need to look up the specific format each provider expects. Pass any of the following and the framework handles the conversion:

- ISO 639-1: `"en"`, `"es"`, `"fr"`
- BCP-47 with region: `"en-US"`, `"zh-Hans-CN"`
- ISO 639-3: `"eng"`, `"spa"`
- Language names: `"english"`, `"spanish"`
- Underscored variants: `"en_us"` (normalized to `"en-US"`)

For example, all of the following are equivalent:

**Python**:

```python
from livekit.agents import LanguageCode

LanguageCode("english")  # → "en"
LanguageCode("eng")      # → "en"
LanguageCode("en")       # → "en"
LanguageCode("en-US")    # → "en-US"
LanguageCode("en_us")    # → "en-US"

```

`LanguageCode` is a `str` subclass, so you can use it anywhere a string is expected. It also provides properties for extracting parts of the code:

- `.language`: Base ISO 639-1 code (for example, `"en"` from `"en-US"`).
- `.region`: Region subtag, if present (for example, `"US"` from `"en-US"`).
- `.iso`: ISO 639-1 tag with region (for example, `"zh-CN"` from `"cmn-Hans-CN"`).

---

**Node.js**:

```typescript
import { normalizeLanguage } from '@livekit/agents';

normalizeLanguage("english")  // → "en"
normalizeLanguage("eng")      // → "en"
normalizeLanguage("en")       // → "en"
normalizeLanguage("en-US")    // → "en-US"
normalizeLanguage("en_us")    // → "en-US"

```

In Node.js, `LanguageCode` is a branded `string` type. Use `normalizeLanguage()` to convert a plain string to a `LanguageCode`, and the standalone helper functions to extract parts of the code:

- `getBaseLanguage(lang)`: Base ISO 639-1 code (for example, `"en"` from `"en-US"`).
- `getLanguageRegion(lang)`: Region subtag, if present (for example, `"US"` from `"en-US"`).
- `getIsoLanguage(lang)`: ISO 639-1 tag with region (for example, `"zh-CN"` from `"cmn-Hans-CN"`).

## Additional resources

The following resources cover related topics that might be useful for your app.

- **[Text and transcriptions](https://docs.livekit.io/agents/build/text.md)**: Integrate realtime text features into your agent.

- **[Pipeline nodes](https://docs.livekit.io/agents/build/nodes.md)**: Learn how to customize the behavior of your agent by overriding nodes in the voice pipeline.

- **[Inference pricing](https://livekit.io/pricing/inference#stt)**: The latest pricing information for STT models in LiveKit Inference.

---

This document was rendered at 2026-04-09T17:17:05.541Z.
For the latest version of this document, see [https://docs.livekit.io/agents/models/stt.md](https://docs.livekit.io/agents/models/stt.md).

To explore all LiveKit documentation, see [llms.txt](https://docs.livekit.io/llms.txt).