Overview
Speechmatics provides AI speech recognition technology. Their advanced speech models deliver highly accurate transcriptions across diverse languages, dialects, and accents. With LiveKit’s Speechmatics integration and the Agents framework, you can build voice AI agents that provide reliable, real-time transcriptions.
If you're looking to build an AI voice assistant with Speechmatics, check out our Voice Agent Quickstart guide and use the Speechmatics STT module as demonstrated below.
Quick reference
Environment variables
SPEECHMATICS_API_KEY=<your-speechmatics-api-key>
STT
LiveKit's Speechmatics integration provides a speech-to-text (STT) interface that can be used as the first stage in a VoicePipelineAgent
or as a standalone transcription service. For a complete reference of all available parameters, see the plugin reference for Python.
The Speechmatics STT plugin is currently only supported for Python.
Usage
from livekit.plugins import speechmaticsfrom livekit.plugins.speechmatics.types import TranscriptionConfig, AudioSettingsspeechmatics_stt = speechmatics.STT(transcription_config=TranscriptionConfig(operating_point="enhanced",enable_partials=True,language="en",output_locale="en-US",diarization="speaker",enable_entities=True,additional_vocab=[{"content": "financial crisis"},{"content": "gnocchi","sounds_like": ["nyohki","nokey","nochi"]},{"content": "CEO","sounds_like": ["C.E.O."]}],max_delay=0.7,max_delay_mode="flexible"),audio_settings=AudioSettings(encoding="pcm_s16le",sample_rate=16000,),)
Parameters
Operating point to use for the transcription per required accuracy & complexity. To learn more, see Accuracy Reference.
Partial transcripts allow you to receive preliminary transcriptions and update as more context is available until the higher-accuracy final transcript is returned. Partials are returned faster but without any post-processing such as formatting.
ISO 639-1 language code. All languages are global and can understand different dialects/accents. To see the list of all supported languages, see Supported Languages.
RFC-5646 language code for transcription output. For supported locales, see Output Locale.
Setting this to speaker
enables accurate labeling of different speakers detected with the attributed transcribed output e.g. S1, S2. For more information, visit Speaker Diarization.
Add custom words for each transcription job. To learn more, see Custom Dictionary.
Allows the written form of various entities such as phone numbers, emails, currency, etc to be output in the transcript. To learn more about the supported entities, see Entities.
The delay in seconds between the end of a spoken word and returning the final transcript results.
If set to flexible
, the final transcript is delayed until proper numeral formatting is complete. To learn more, see Numeral Formatting.