Transcriptions | LiveKit Docs

Overview

The Agents framework includes the ability to capture and deliver realtime transcriptions of a user's speech and LLM-generated speech or text.

Both VoicePipelineAgent and MultimodalAgent can forward transcriptions to clients automatically if you implement support for receiving them in your frontend. If you're not using either of these agent classes, you can add transcription forwarding to your agent code.

To learn more about creating transcriptions in the agent process, see Recording agent sessions.

Frontend integration

You can use a LiveKit SDK to receive transcription events in your frontend.

Transcriptions are delivered in segments, each associated with a particular Participant and Track. Each segment has a unique id. Segments might be sent as fragments as they're generated. You can monitor the final property to determine when a segment is complete.

This example uses React with TypeScript, but the principles are the same for other frameworks.

Collect TranscriptionSegment by listening to RoomEvent.TranscriptionReceived:

import { useEffect, useState } from "react";
import { 
  TranscriptionSegment, 
  Participant,
  TrackPublication,
  RoomEvent, 
} from "livekit-client";
import { useMaybeRoomContext } from "@livekit/components-react";

export default function Transcriptions() {
  const room = useMaybeRoomContext();
  const [transcriptions, setTranscriptions] = useState<{ [id: string]: TranscriptionSegment }>({});

  useEffect(() => {
    if (!room) {
      return;
    }

    const updateTranscriptions = (
      segments: TranscriptionSegment[],
      participant?: Participant,
      publication?: TrackPublication
    ) => {
      setTranscriptions((prev) => {
        const newTranscriptions = { ...prev };
        for (const segment of segments) {
          newTranscriptions[segment.id] = segment;
        }
        return newTranscriptions;
      });
    };

    room.on(RoomEvent.TranscriptionReceived, updateTranscriptions);
    return () => {
      room.off(RoomEvent.TranscriptionReceived, updateTranscriptions);
    };
  }, [room]);

  return (
    <ul>
      {Object.values(transcriptions)
        .sort((a, b) => a.firstReceivedTime - b.firstReceivedTime)
        .map((segment) => (
          <li key={segment.id}>{segment.text}</li>
        ))}
    </ul>
  )
}

Agent integration

The STTSegmentsForwarder class provides an interface for delivering transcriptions from your custom agent to your frontend in realtime. Here's a sample implementation:

from livekit.agents import stt, transcription
from livekit.plugins.deepgram import STT

async def _forward_transcription(
    stt_stream: stt.SpeechStream,
    stt_forwarder: transcription.STTSegmentsForwarder,
):
    """Forward the transcription and log the transcript in the console"""
    async for ev in stt_stream:
        stt_forwarder.update(ev)
        if ev.type == stt.SpeechEventType.INTERIM_TRANSCRIPT:
            print(ev.alternatives[0].text, end="")
        elif ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
            print("\n")
            print(" -> ", ev.alternatives[0].text)


async def entrypoint(job: JobContext):
    stt = STT()
    tasks = []

    async def transcribe_track(participant: rtc.RemoteParticipant, track: rtc.Track):
        audio_stream = rtc.AudioStream(track)
        stt_forwarder = transcription.STTSegmentsForwarder(
            room=job.room, participant=participant, track=track
        )
        stt_stream = stt.stream()
        stt_task = asyncio.create_task(
            _forward_transcription(stt_stream, stt_forwarder)
        )
        tasks.append(stt_task)

        async for ev in audio_stream:
            stt_stream.push_frame(ev.frame)

    @job.room.on("track_subscribed")
    def on_track_subscribed(
        track: rtc.Track,
        publication: rtc.TrackPublication,
        participant: rtc.RemoteParticipant,
    ):
        if track.kind == rtc.TrackKind.KIND_AUDIO:
            tasks.append(asyncio.create_task(transcribe_track(participant, track)))