Speechmatics STT | LiveKit Documentation

Create a new agent in your browser using this model

Overview

Speechmatics speech-to-text is available in LiveKit Agents through LiveKit Inference and the Speechmatics plugin. With LiveKit Inference, your agent runs on LiveKit's infrastructure to minimize latency. No separate provider API key is required, and usage and rate limits are managed through LiveKit Cloud. Use the plugin instead if you want to manage your own billing and rate limits. Pricing for LiveKit Inference is available on the pricing page .

LiveKit Inference

Use LiveKit Inference to access Speechmatics STT without a separate Speechmatics API key.

Model name	Model ID	Languages
Speechmatics Enhanced	speechmatics/enhanced	arar_enbabebgbncacmncmn_encmn_en_ms_tacscydadeelenen_msen_taeoeseteufafifrgaglhehihrhuiaiditjakoltlvmnmrmsmtnlnoplptroruskslsvswtathtltrugukurviyue
Speechmatics Standard	speechmatics/standard	arar_enbabebgbncacmncmn_encmn_en_ms_tacscydadeelenen_msen_taeoeseteufafifrgaglhehihrhuiaiditjakoltlvmnmrmsmtnlnoplptroruskslsvswtathtltrugukurviyue

Usage

To use Speechmatics, use the STT class from the inference module:

from livekit.agents import AgentSession, inference

session = AgentSession(
    stt=inference.STT(
        model="speechmatics/enhanced",
        language="en"
    ),
    # ... llm, tts, vad, turn_handling, etc.
)

import { AgentSession, inference } from '@livekit/agents';

session = new AgentSession({
    stt: new inference.STT({
        model: "speechmatics/enhanced",
        language: "en"
    }),
    // ... llm, tts, vad, turnHandling, etc.
});

Voice activity detection

Speechmatics doesn't detect end of speech server-side, so LiveKit Inference relies on a VAD running in your agent. The SDK runs the VAD on incoming audio locally, and when the speaker stops, it signals the LiveKit Inference gateway to flush the final transcript.

By default, the framework loads a Silero VAD for you. To use a different VAD, pass one with the vad parameter:

from livekit.agents import AgentSession, inference
from livekit.plugins import silero

session = AgentSession(
    stt=inference.STT(
        model="speechmatics/enhanced",
        vad=silero.VAD.load(),  # optional; the framework loads one if omitted
    ),
    # ... llm, tts, etc.
)

import { AgentSession, inference } from '@livekit/agents';
import * as silero from '@livekit/agents-plugin-silero';

session = new AgentSession({
    stt: new inference.STT({
        model: "speechmatics/enhanced",
        vad: await silero.VAD.load(),  // optional; the framework loads one if omitted
    }),
    // ... llm, tts, etc.
});

Parameters

model

Required

string

The model to use for the STT. See model IDs for available models.

languageLanguageCode

Language code for the transcription. If not set, the provider default applies.

vadVAD

Voice activity detector used to detect end of speech. Set this parameter to override the default Silero VAD. See Voice activity detection.

extra_kwargsdict

Additional parameters to pass to the Speechmatics STT API. See model parameters for supported fields.

In Node.js this parameter is called modelOptions.

Model parameters

Pass the following parameters inside extra_kwargs (Python) or modelOptions (Node.js):

Parameter	Type	Default	Notes
`domain`	`string`		Domain-specific language pack for improved accuracy in a vertical, for example `finance`.
`output_locale`	`string`		BCP-47 locale that controls output formatting conventions, such as spelling and number formats.
`max_delay`	`float`	`1.0`	Maximum delay in seconds between the end of a spoken word and the final transcript. Valid range `0.7`–`4.0`. Lower values reduce latency but can reduce accuracy.
`max_delay_mode`	`string`		How `max_delay` is applied. `flexible` lets the model exceed the delay to finish recognizing entities such as numbers; `fixed` enforces the delay strictly.
`diarization`	`string`	`none`	Speaker diarization mode: `none`, `speaker`, `channel`, or `channel_and_speaker`. Any value other than `none` enables diarization. See Speaker diarization.
`speaker_sensitivity`	`float`		Sensitivity of speaker detection, from `0.0` to `1.0`. Higher values help distinguish speakers with similar voices. Applies when `diarization` is set to `speaker`.
`max_speakers`	`int`		Maximum number of speakers to detect when diarization is enabled.
`prefer_current_speaker`	`bool`	`false`	When diarization is enabled, reduces the likelihood of switching between similar-sounding speakers.
`enable_partials`	`bool`	`true`	Emit interim transcript results before the final transcript.
`enable_entities`	`bool`		Enable entity detection and formatting, such as numbers, dates, and currencies. See the Speechmatics entities docs .
`punctuation_overrides`	`dict`		Override default punctuation behavior, such as the set of permitted punctuation marks.
`additional_vocab`	`list[dict]`		Custom vocabulary to boost recognition of domain-specific words. Each entry is a dict with a `content` field and optional `sounds_like` pronunciation variants.
`end_of_utterance_silence_trigger`	`float`		Duration of silence in seconds that triggers the end of an utterance and emits a final transcript.
`audio_filtering_config`	`dict`		Advanced audio filtering configuration passed through to Speechmatics, such as a volume threshold.
`transcript_filtering_config`	`dict`		Advanced transcript filtering configuration passed through to Speechmatics.

For more details on these options, see the Speechmatics real-time API reference .

Plugin

The Speechmatics plugin connects directly to the Speechmatics API with your own API key. The plugin is available for Python only; for Node.js, use LiveKit Inference.

Available inPython

Installation

Install the plugin from PyPI:

uv add "livekit-agents[speechmatics,silero]~=1.5"

Silero VAD dependency

The default turn detection mode (EXTERNAL) uses an external VAD, so this command includes the silero extra to install Silero VAD . For details and other modes, see Turn detection.

Authentication

The Speechmatics plugin requires an API key .

Set SPEECHMATICS_API_KEY in your .env file.

Usage

Use Speechmatics STT in an AgentSession or as a standalone transcription service. For example, you can use this STT in the Voice AI quickstart.

from livekit.plugins import speechmatics

session = AgentSession(
    stt=speechmatics.STT(),
    # ... llm, tts, etc.
)

Turn detection

The turn_detection_mode parameter sets which component detects the end of a turn. It supports four modes:

Mode	Detection method
`EXTERNAL` (default)	An external VAD. Works with LiveKit's turn detection.
`ADAPTIVE`	Speechmatics' own voice activity detection and the pace of speech.
`SMART_TURN`	A Speechmatics machine learning model.
`FIXED`	A fixed period of silence, set by the `end_of_utterance_silence_trigger` parameter.

The default EXTERNAL mode needs no extra configuration, as shown in the Usage example.

Silero VAD in EXTERNAL mode

In EXTERNAL mode, the plugin loads Silero VAD automatically. To use a different VAD, pass it as the vad parameter. To provide the end-of-turn signal yourself, pass vad=None.

To have Speechmatics detect the end of a turn instead, set turn_detection_mode to ADAPTIVE, SMART_TURN, or FIXED, and set turn_detection="stt" in the turn handling options:

from livekit.agents import AgentSession, TurnHandlingOptions
from livekit.plugins import speechmatics

session = AgentSession(
    stt=speechmatics.STT(
        turn_detection_mode=speechmatics.TurnDetectionMode.ADAPTIVE,
    ),
    turn_handling=TurnHandlingOptions(
        turn_detection="stt",
    ),
    # ... llm, tts, etc.
)

Parameters

This section describes the key parameters for the Speechmatics STT plugin. See the plugin reference for a complete list of all available parameters.

operating_pointstringDefault: enhanced

Operating point to use for the transcription. This parameter balances accuracy, speed, and resource usage. To learn more, see Operating points .

languageLanguageCodeDefault: en

Language code for the input audio. All languages are global, meaning that regardless of which language you select, the system can recognize different dialects and accents. For the full list, see Supported Languages .

include_partialsboolDefault: true

Include partial transcripts in the output. Partial transcripts allow you to receive preliminary transcriptions and update as more context is available until the higher-accuracy final transcript is returned. Partials are returned faster but without any post-processing such as formatting. When enabled, the STT service emits INTERIM_TRANSCRIPT events.

enable_diarizationboolDefault: true

Enable speaker diarization. When enabled, spoken words are attributed to unique speakers. You can use the speaker_sensitivity parameter to adjust the sensitivity of diarization. To learn more, see Speaker diarization.

max_delaynumberDefault: 1.0

The maximum delay in seconds between the end of a spoken word and returning the final transcript results. Lower values can have an impact on accuracy.

end_of_utterance_silence_triggerfloat

The duration of silence in seconds that triggers end of utterance. The delay is used to wait for any further transcribed words before emitting FINAL_TRANSCRIPT events.

turn_detection_modeTurnDetectionModeDefault: TurnDetectionMode.EXTERNAL

Sets which component detects the end of a turn. Defaults to TurnDetectionMode.EXTERNAL, which uses an external VAD and works with LiveKit's turn detection. See Turn detection for the available modes and examples.

speaker_active_formatstring

Formatter for speaker identification in transcription output. The following attributes are available:

{speaker_id}: The ID of the speaker.
{text}: The text spoken by the speaker.

By default, if speaker diarization is enabled and this parameter is not set, the transcription output is not formatted for speaker identification.

The system instructions for the language model might need to include any necessary instructions to handle the formatting. To learn more, see Speaker diarization.

speaker_sensitivityfloatDefault: 0.5

Sensitivity of speaker detection. Valid values are between 0 and 1. Higher values increase sensitivity and can help when two or more speakers have similar voices. To learn more, see Speaker sensitivity .

The enable_diarization parameter must be set to True for this parameter to take effect.

prefer_current_speakerboolDefault: false

When speaker diarization is enabled and this is set to True, it reduces the likelihood of switching between similar sounding speakers. To learn more, see Prefer current speaker .

Speaker diarization

You can enable speaker diarization to attribute speech to individual speakers. When enabled, the STT reports capabilities.diarization = True, and the transcription output can include speaker information through the speaker_id and text attributes.

With diarization enabled, wrap the Speechmatics STT with MultiSpeakerAdapter to detect the primary speaker and format transcripts by speaker.

Enable diarization in the STT constructor. LiveKit Inference uses a string diarization mode, while the plugin uses the boolean enable_diarization:

stt = inference.STT(
    model="speechmatics/enhanced",
    extra_kwargs={
        # "none", "speaker", "channel", or "channel_and_speaker"
        "diarization": "speaker",
    },
)

stt = speechmatics.STT(
    enable_diarization=True,
    speaker_active_format="<{speaker_id}>{text}</{speaker_id}>",
)

With the plugin, use speaker_active_format to format speaker output. For example:

<{speaker_id}>{text}</{speaker_id}> produces <S1>Hello</S1>.
[Speaker {speaker_id}] {text} produces [Speaker S1] Hello.

Additional resources

The following resources provide more information about using Speechmatics with LiveKit Agents.

Python package

The livekit-plugins-speechmatics package on PyPI.

Plugin reference

Reference for the Speechmatics STT plugin.

GitHub repo

View the source or contribute to the LiveKit Speechmatics STT plugin.

Speechmatics docs

Speechmatics STT docs.

Voice AI quickstart

Get started with LiveKit Agents and Speechmatics STT.

Overview

LiveKit Inference

Usage

Voice activity detection

Parameters

Model parameters

Plugin

Installation

Authentication

Usage

Turn detection

Parameters

Speaker diarization

Additional resources

Python package

Plugin reference

GitHub repo

Speechmatics docs

Voice AI quickstart

Ask LiveKit