Speechmatics integration guide

Overview

Speechmatics provides AI speech recognition technology. Their advanced speech models deliver highly accurate transcriptions across diverse languages, dialects, and accents. With LiveKit’s Speechmatics integration and the Agents framework, you can build voice AI agents that provide reliable, real-time transcriptions.

Note

If you're looking to build an AI voice assistant with Speechmatics, check out our Voice Agent Quickstart guide and use the Speechmatics STT module as demonstrated below.

Quick reference

Environment variables

SPEECHMATICS_API_KEY=<your-speechmatics-api-key>

STT

LiveKit's Speechmatics integration provides a speech-to-text (STT) interface that can be used as the first stage in a VoicePipelineAgent or as a standalone transcription service. For a complete reference of all available parameters, see the plugin reference for Python.

Note

The Speechmatics STT plugin is currently only supported for Python.

Usage

from livekit.plugins import speechmatics
from livekit.plugins.speechmatics.types import TranscriptionConfig, AudioSettings

speechmatics_stt = speechmatics.STT(
   transcription_config=TranscriptionConfig(
      operating_point="enhanced", 
      enable_partials=True,
      language="en",
      output_locale="en-US",
      diarization="speaker",
      enable_entities=True,
      additional_vocab=[
         {
               "content": "financial crisis"
         },
         {
               "content": "gnocchi", 
               "sounds_like": [
               "nyohki", 
               "nokey", 
               "nochi"
               ]
         },
         {
               "content": "CEO",
               "sounds_like": [
               "C.E.O."
               ]
         }
      ],
      max_delay=0.7,
      max_delay_mode="flexible"
   ), 
   audio_settings=AudioSettings(
      encoding="pcm_s16le",
      sample_rate=16000, 
   ),
)

Parameters

operating_pointstringOptionalDefault: enhanced

Operating point to use for the transcription per required accuracy & complexity. To learn more, see Accuracy Reference.

enable_partialsboolOptionalDefault: True

Partial transcripts allow you to receive preliminary transcriptions and update as more context is available until the higher-accuracy final transcript is returned. Partials are returned faster but without any post-processing such as formatting.

languagestringOptionalDefault: en

ISO 639-1 language code. All languages are global and can understand different dialects/accents. To see the list of all supported languages, see Supported Languages.

output_localestringOptionalDefault: en-US

RFC-5646 language code for transcription output. For supported locales, see Output Locale.

diarizationstringOptionalDefault: NULL

Setting this to speaker enables accurate labeling of different speakers detected with the attributed transcribed output e.g. S1, S2. For more information, visit Speaker Diarization.

additional_vocablist[dict{“content”:str, ”sounds_like”:str}]OptionalDefault: NULL

Add custom words for each transcription job. To learn more, see Custom Dictionary.

enable_entitiesboolOptionalDefault: False

Allows the written form of various entities such as phone numbers, emails, currency, etc to be output in the transcript. To learn more about the supported entities, see Entities.

max_delaynumberOptionalDefault: 0.7

The delay in seconds between the end of a spoken word and returning the final transcript results.

max_delay_modestringOptionalDefault: flexible

If set to flexible, the final transcript is delayed until proper numeral formatting is complete. To learn more, see Numeral Formatting.