Overview
Speech-to-text (STT) models process incoming audio and convert it to text in realtime. In voice AI, this text is then processed by an LLM to generate a response which is turn turned backed to speech using a TTS model.
The agents framework includes plugins for popular STT providers out of the box. You can also implement the STT node to provide custom behavior or use an alternative provider.
LiveKit is open source and welcomes new plugin contributions.
How to use
The following sections describe high-level usage only.
For more detailed information about installing and using plugins, see the plugins overview.
Usage in AgentSession
Construct an AgentSession
or Agent
with an STT
instance created by your desired plugin:
from livekit.agents import AgentSessionfrom livekit.plugins import deepgramsession = AgentSession(stt=deepgram.STT(model="nova-2"))
AgentSession
automatically integrates with VAD to detect user turns and know when to start and stop STT.
Standalone usage
You can also use an STT
instance in a standalone fashion by creating a stream. You can use push_frame
to add realtime audio frames to the stream, and then consume a stream of SpeechEvent
as output.
Here is an example of a standalone STT app:
import asynciofrom dotenv import load_dotenvfrom livekit import agents, rtcfrom livekit.agents.stt import SpeechEventType, SpeechEventfrom typing import AsyncIterablefrom livekit.plugins import (deepgram,)load_dotenv()async def entrypoint(ctx: agents.JobContext):await ctx.connect()@ctx.room.on("track_subscribed")def on_track_subscribed(track: rtc.RemoteTrack):print(f"Subscribed to track: {track.name}")asyncio.create_task(process_track(track))async def process_track(track: rtc.RemoteTrack):stt = deepgram.STT(model="nova-2")stt_stream = stt.stream()audio_stream = rtc.AudioStream(track)async with asyncio.TaskGroup() as tg:# Create task for processing STT streamstt_task = tg.create_task(process_stt_stream(stt_stream))# Process audio streamasync for audio_event in audio_stream:stt_stream.push_frame(audio_event.frame)# Indicates the end of the audio streamstt_stream.end_input()# Wait for STT processing to completeawait stt_taskasync def process_stt_stream(stream: AsyncIterable[SpeechEvent]):try:async for event in stream:if event.type == SpeechEventType.FINAL_TRANSCRIPT:print(f"Final transcript: {event.alternatives[0].text}")elif event.type == SpeechEventType.INTERIM_TRANSCRIPT:print(f"Interim transcript: {event.alternatives[0].text}")elif event.type == SpeechEventType.START_OF_SPEECH:print("Start of speech")elif event.type == SpeechEventType.END_OF_SPEECH:print("End of speech")finally:await stream.aclose()if __name__ == "__main__":agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))
VAD and StreamAdapter
Some STT providers or models, such as Whisper don't support streaming input. In these cases, your app must determine when a chunk of audio represents a complete segment of speech. You can do this using VAD together with the StreamAdapter
class.
The following example modifies the previous example to use VAD and StreamAdapter
to buffer user speech until VAD detects the end of speech:
from livekit import agents, rtcfrom livekit.plugins import openai, sileroasync def process_track(ctx: agents.JobContext, track: rtc.Track):whisper_stt = openai.STT()vad = silero.VAD.load(min_speech_duration=0.1,min_silence_duration=0.5,)vad_stream = vad.stream()# StreamAdapter will buffer audio until VAD emits END_SPEAKING eventstt = agents.stt.StreamAdapter(whisper_stt, vad_stream)stt_stream = stt.stream()...
Available providers
The following table lists the available STT providers for LiveKit Agents.
Provider | Plugin | |
---|---|---|
Amazon Transcribe | aws | |
AssemblyAI | assemblyai | |
Azure AI Speech | azure | |
Clova | clova | |
Deepgram | deepgram | |
fal | fal | |
Gladia | gladia | |
Google Cloud | google | |
Groq | groq | |
OpenAI | openai | |
Speechmatics | speechmatics |