Overview
The Agents framework enables realtime transcriptions of both user speech and LLM-generated text.
VoicePipelineAgent
and MultimodalAgent
can generate and deliver transcriptions automatically. If you're building your own assistant or other transcription use-case, you can directly utilize STTSegmentsForwarder
within your agent.
Transcriptions are delivered in segments, associated with a particular Participant
and Track
. Each segment has a unique id
, but may be sent progressively as it is generated. You may monitor the final
property to determine when a segment is complete and will not change.
Frontend integration
Use the LiveKit SDKs to receive transcription events in your frontend.
This example uses React with TypeScript, but the principles are the same for other frameworks.
Collect TranscriptionSegment
by listening to RoomEvent.TranscriptionReceived
:
const room = useMaybeRoomContext();const [transcriptions, setTranscriptions] = useState<{ [id: string]: TranscriptionSegment }>({});useEffect(() => {if (!room) {return;}const updateTranscriptions = (segments, participant, publication) => {setTranscriptions((prev) => {const newTranscriptions = { ...prev };for (const segment of segments) {newTranscriptions[segment.id] = segment;}return newTranscriptions;});};room.on(RoomEvent.TranscriptionReceived, updateTranscriptions);return () => {room.off(RoomEvent.TranscriptionReceived, updateTranscriptions);};}, [room]);
Then present them in your view:
<ul>{Object.values(transcriptions).sort((a, b) => a.firstReceivedTime - b.firstReceivedTime).map((segment) => (<li key={segment.id}>{segment.text}</li>))}</ul>
STTSegmentsForwarder
The STTSegmentsForwarder
class provides an interface for delivering transcriptions to the frontend in realtime. Here's a sample implementation:
async def _forward_transcription(stt_stream: stt.SpeechStream,stt_forwarder: transcription.STTSegmentsForwarder,):"""Forward the transcription and log the transcript in the console"""async for ev in stt_stream:stt_forwarder.update(ev)if ev.type == stt.SpeechEventType.INTERIM_TRANSCRIPT:print(ev.alternatives[0].text, end="")elif ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT:print("\n")print(" -> ", ev.alternatives[0].text)async def entrypoint(job: JobContext):stt = STT()tasks = []async def transcribe_track(participant: rtc.RemoteParticipant, track: rtc.Track):audio_stream = rtc.AudioStream(track)stt_forwarder = transcription.STTSegmentsForwarder(room=job.room, participant=participant, track=track)stt_stream = stt.stream()stt_task = asyncio.create_task(_forward_transcription(stt_stream, stt_forwarder))tasks.append(stt_task)async for ev in audio_stream:stt_stream.push_frame(ev.frame)@job.room.on("track_subscribed")def on_track_subscribed(track: rtc.Track,publication: rtc.TrackPublication,participant: rtc.RemoteParticipant,):if track.kind == rtc.TrackKind.KIND_AUDIO:tasks.append(asyncio.create_task(transcribe_track(participant, track)))