Speech-to-text (STT) integrations

Guides for adding STT integrations to your agents.

Overview

Speech-to-text (STT) models process incoming audio and convert it to text in realtime. In voice AI, this text is then processed by an LLM to generate a response which is turn turned backed to speech using a TTS model.

The agents framework includes plugins for popular STT providers out of the box. You can also implement the STT node to provide custom behavior or use an alternative provider.

LiveKit is open source and welcomes new plugin contributions.

How to use

The following sections describe high-level usage only.

For more detailed information about installing and using plugins, see the plugins overview.

Usage in AgentSession

Construct an AgentSession or Agent with an STT instance created by your desired plugin:

from livekit.agents import AgentSession
from livekit.plugins import deepgram
session = AgentSession(
stt=deepgram.STT(model="nova-2")
)

AgentSession automatically integrates with VAD to detect user turns and know when to start and stop STT.

Standalone usage

You can also use an STT instance in a standalone fashion by creating a stream. You can use push_frame to add realtime audio frames to the stream, and then consume a stream of SpeechEvent as output.

Here is an example of a standalone STT app:

import asyncio
from dotenv import load_dotenv
from livekit import agents, rtc
from livekit.agents.stt import SpeechEventType, SpeechEvent
from typing import AsyncIterable
from livekit.plugins import (
deepgram,
)
load_dotenv()
async def entrypoint(ctx: agents.JobContext):
await ctx.connect()
@ctx.room.on("track_subscribed")
def on_track_subscribed(track: rtc.RemoteTrack):
print(f"Subscribed to track: {track.name}")
asyncio.create_task(process_track(track))
async def process_track(track: rtc.RemoteTrack):
stt = deepgram.STT(model="nova-2")
stt_stream = stt.stream()
audio_stream = rtc.AudioStream(track)
async with asyncio.TaskGroup() as tg:
# Create task for processing STT stream
stt_task = tg.create_task(process_stt_stream(stt_stream))
# Process audio stream
async for audio_event in audio_stream:
stt_stream.push_frame(audio_event.frame)
# Indicates the end of the audio stream
stt_stream.end_input()
# Wait for STT processing to complete
await stt_task
async def process_stt_stream(stream: AsyncIterable[SpeechEvent]):
try:
async for event in stream:
if event.type == SpeechEventType.FINAL_TRANSCRIPT:
print(f"Final transcript: {event.alternatives[0].text}")
elif event.type == SpeechEventType.INTERIM_TRANSCRIPT:
print(f"Interim transcript: {event.alternatives[0].text}")
elif event.type == SpeechEventType.START_OF_SPEECH:
print("Start of speech")
elif event.type == SpeechEventType.END_OF_SPEECH:
print("End of speech")
finally:
await stream.aclose()
if __name__ == "__main__":
agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))

VAD and StreamAdapter

Some STT providers or models, such as Whisper don't support streaming input. In these cases, your app must determine when a chunk of audio represents a complete segment of speech. You can do this using VAD together with the StreamAdapter class.

The following example modifies the previous example to use VAD and StreamAdapter to buffer user speech until VAD detects the end of speech:

from livekit import agents, rtc
from livekit.plugins import openai, silero
async def process_track(ctx: agents.JobContext, track: rtc.Track):
whisper_stt = openai.STT()
vad = silero.VAD.load(
min_speech_duration=0.1,
min_silence_duration=0.5,
)
vad_stream = vad.stream()
# StreamAdapter will buffer audio until VAD emits END_SPEAKING event
stt = agents.stt.StreamAdapter(whisper_stt, vad_stream)
stt_stream = stt.stream()
...

Available providers

The following table lists the available STT providers for LiveKit Agents.

Further reading

Was this page helpful?