Speechify TTS integration guide

How to use the Speechify TTS plugin for LiveKit Agents.

Overview

Speechify provides an ultra low latency, human quality, and affordable text to speech API with voice cloning features. You can use the Speechify TTS plugin for LiveKit Agents to build high-quality voice AI applications.

Quick reference

This section includes a brief overview of the Speechify TTS plugin. For more information, see Additional resources.

Installation

Install the plugin from PyPI:

pip install "livekit-agents[speechify]~=1.0"

Authentication

The Speechify plugin requires a Speechify API key.

Set SPEECHIFY_API_KEY in your .env file.

Usage

Use Speechify TTS within an AgentSession or as a standalone speech generator. For example, you can use this TTS in the Voice AI quickstart.

from livekit.plugins import speechify
session = AgentSession(
tts=speechify.TTS(
model="simba-english",
voice_id="jack",
)
# ... llm, stt, etc.
)

Parameters

This section describes some of the available parameters. See the plugin reference for a complete list of all available parameters.

voice_idstringRequiredDefault: jack

ID of the voice to be used for synthesizing speech. Refer to list_voices() method in the plugin reference.

modelstringOptional

ID of the model to use for generation. Use simba-english or simba-multilingual To learn more, see: supported models.

languagestringOptional

Language of input text in ISO-639-1 format. See the supported languages.

encodingstringOptionalDefault: wav_48000

Audio encoding to use. Choose between wav_48000, mp3_24000, ogg_24000 or aac_24000.

loudness_normalizationbooleanOptional

Determines whether to normalize the audio loudness to a standard level. When enabled, loudness normalization aligns the audio output to the following standards: Integrated loudness: -14 LUFS True peak: -2 dBTP Loudness range: 7 LU If disabled, the audio loudness will match the original loudness of the selected voice, which may vary significantly and be either too quiet or too loud. Enabling loudness normalization can increase latency due to additional processing required for audio level adjustments.

text_normalizationbooleanOptional

Determines whether to normalize the text. If enabled, it will transform numbers, dates, etc. into words. For example, "55" is normalized into "fifty five". This can increase latency due to additional processing required for text normalization.

Customizing pronunciation

Speechify supports custom pronunciation with Speech Synthesis Markup Language (SSML), an XML-based markup language that gives you granular control over speech output. With SSML, you can leverage XML tags to craft audio content that delivers a more natural and engaging listening experience. To learn more, see SSML.

Additional resources

The following resources provide more information about using Speechify with LiveKit Agents.