Working with plugins | LiveKit Docs

The Agents framework includes a set of prebuilt plugins that make it easier to build an AI agent. These plugins cover common tasks like converting speech to text (STT) or text to speech (TTS), running inference on a generative AI model, and more.

The API for plugins is standardized to make it easy to switch between different providers. Having a consistent interface also makes it simpler for anyone to extend the framework and build new plugins for other providers.

STT

STT converts audio frames to a stream of text. The following example uses Deepgram STT to process an audio stream:

from livekit import agents, rtc
from livekit.plugins import deepgram
from livekit.agents.stt import SpeechEventType, SpeechEvent
from typing import AsyncIterable

async def process_track(ctx: agents.JobContext, track: rtc.Track):
    stt = deepgram.STT()
    stt_stream = stt.stream()
    audio_stream = rtc.AudioStream(track)

    ctx.create_task(process_text_from_speech(stt_stream))
    async for audio_event in audio_stream:
        stt_stream.push_frame(audio_event.frame)

    stt_stream.end_input()

async def process_text_from_speech(self, stream: AsyncIterable[SpeechEvent]):
    async for event in stream:
        if event.type == SpeechEventType.FINAL_TRANSCRIPT:
            text = event.alternatives[0].text
            # Do something with text
        elif event.type == SpeechEventType.INTERIM_TRANSCRIPT:
            pass
        elif event.type == SpeechEventType.START_OF_SPEECH:
            pass
        elif event.type == SpeechEventType.END_OF_SPEECH:
            pass

    await stream.aclose()

Voice activity detector (VAD) and StreamAdapter

Some providers or models, such as Whisper, do not support streaming input. In these cases, the application must determine when a chunk of audio represents a complete segment of speech. This can be accomplished using a VAD together with the StreamAdapter class.

The following example modifies the example above to use VAD and StreamAdapter:

from livekit import agents, rtc
from livekit.plugins import openai, silero

async def process_track(ctx: agents.JobContext, track: rtc.Track):
  whisper_stt = openai.STT()
  vad = silero.VAD.load(
    min_speech_duration=0.1,
    min_silence_duration=0.5,
  )
  vad_stream = vad.stream()
  # StreamAdapter will buffer audio until VAD emits END_SPEAKING event
  stt = agents.stt.StreamAdapter(whisper_stt, vad_stream)
  stt_stream = stt.stream()
  ...

TTS

TTS synthesizes text into audio frames. The following example uses ElevenLabs TTS to convert text input into audio and plays the audio stream:

from livekit import agents, rtc
from livekit.agents.tts import SynthesizedAudio
from livekit.plugins import elevenlabs
from typing import AsyncIterable

ctx: agents.JobContext = ...
text_stream: AsyncIterable[str] = ...
audio_source = rtc.AudioSource(44100, 1)

track = rtc.LocalAudioTrack.create_audio_track("agent-audio", audio_source)
await ctx.room.local_participant.publish_track(track)

tts = elevenlabs.TTS(model_id="eleven_turbo_v2")
tts_stream = tts.stream()

# create a task to consume and publish audio frames
ctx.create_task(send_audio(tts_stream))

# push text into the stream, TTS stream will emit audio frames along with events
# indicating sentence (or segment) boundaries.
async for text in text_stream:
    tts_stream.push_text(text)
tts_stream.end_input()

async def send_audio(audio_stream: AsyncIterable[SynthesizedAudio]):
    async for a in audio_stream:
        await audio_source.capture_frame(e.audio.frame)

Building your own

The plugin framework is designed to be extensible, allowing anyone to build their own plugin. Your plugin can integrate with various providers or directly load models for local inference.

By adopting the standard STT or TTS interfaces, you can abstract away implementation specifics and simplify switching between different providers in your agent code.

Code contributions to plugins are always welcome. To learn more, see the guidelines for contributions to the Python repository or the Node.js repository.

LiveKit plugins

The following plugins provide utilities for LiveKit agents. For a list of plugins for providers of LLM, STT, and TTS, see Integration guides for LiveKit Agents.

Plugin	SDK	Feature
livekit-plugins-browser	Python	Chrome browser.
livekit-plugins-llama-index	Python	Support for LlamaIndex query engine and chat engine. Query engine is used primarily for RAG. Chat engine can be used as an LLM in a pipeline agent.
livekit-plugins-nltk	Python	Utilities for working with text using NLTK.
livekit-plugins-rag	Python	Vector retrieval with Annoy.
livekit-plugins-silero	Python, Node.js	Silero VAD.
livekit-plugins-turn-detector	Python	LiveKit turn detector.