Transcriptions

Overview

The Agents framework enables realtime transcriptions of both user speech and LLM-generated text.

VoicePipelineAgent and MultimodalAgent can generate and deliver transcriptions automatically. If you're building your own assistant or other transcription use-case, you can directly utilize STTSegmentsForwarder within your agent.

Transcriptions are delivered in segments, associated with a particular Participant and Track. Each segment has a unique id, but may be sent progressively as it is generated. You may monitor the final property to determine when a segment is complete and will not change.

Client Integration

Use the Realtime Client SDKs to receive transcription events in your application.

This example uses React with TypeScript, but the principles are the same for other frameworks.

Collect TranscriptionSegment by listening to RoomEvent.TranscriptionReceived:

const room = useMaybeRoomContext();
const [transcriptions, setTranscriptions] = useState<{[id: string]: TranscriptionSegment}>({});
useEffect(() => {
if (!room) {
return;
}
const updateTranscriptions = (segments, participant, publication) => {
setTranscriptions((prev) => {
const newTranscriptions = { ...prev };
for (const segment of segments) {
newTranscriptions[segment.id] = segment;
}
return newTranscriptions;
});
};
room.on(RoomEvent.TranscriptionReceived, updateTranscriptions);
return () => {
room.off(RoomEvent.TranscriptionReceived, updateTranscriptions);
};
}, [room]);

Then present them in your view:

<ul>
{Object.values(transcriptions)
.sort((a, b) => a.firstReceivedTime - b.firstReceivedTime)
.map((segment) => (
<li key={segment.id}>
{segment.text}
</li>
))}
</ul>

STTSegmentsForwarder

The STTSegmentsForwarder class provides an interface for delivering transcriptions to clients in realtime. Here's a sample implementation:

async def _forward_transcription(
stt_stream: stt.SpeechStream,
stt_forwarder: transcription.STTSegmentsForwarder,
):
"""Forward the transcription to the client and log the transcript in the console"""
async for ev in stt_stream:
stt_forwarder.update(ev)
if ev.type == stt.SpeechEventType.INTERIM_TRANSCRIPT:
print(ev.alternatives[0].text, end="")
elif ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
print("\n")
print(" -> ", ev.alternatives[0].text)
async def entrypoint(job: JobContext):
stt = STT()
tasks = []
async def transcribe_track(participant: rtc.RemoteParticipant, track: rtc.Track):
audio_stream = rtc.AudioStream(track)
stt_forwarder = transcription.STTSegmentsForwarder(
room=job.room, participant=participant, track=track
)
stt_stream = stt.stream()
stt_task = asyncio.create_task(
_forward_transcription(stt_stream, stt_forwarder)
)
tasks.append(stt_task)
async for ev in audio_stream:
stt_stream.push_frame(ev.frame)
@job.room.on("track_subscribed")
def on_track_subscribed(
track: rtc.Track,
publication: rtc.TrackPublication,
participant: rtc.RemoteParticipant,
):
if track.kind == rtc.TrackKind.KIND_AUDIO:
tasks.append(asyncio.create_task(transcribe_track(participant, track)))