Skip to main content

Custom voices

Create voice clones from audio samples and use them with supported TTS providers in LiveKit Inference.

Overview

The custom voices feature lets you create a voice clone from a short audio clip. Upload or record a sample, and LiveKit clones it to all supported TTS providers on your plan. You can then use the clone in your agent sessions with any of those providers.

Custom voices are available on paid LiveKit Cloud plans. You create and manage voice clones in the LiveKit Cloud dashboard. Once created, use them in your agent code with LiveKit Inference. No separate provider API keys are required.

Supported providers

Voice clones are automatically created on the following TTS providers:

ProviderNoise removalPlan required
CartesiaYes (audio enhancement)Ship or higher
InworldYesShip or higher

When you create a voice clone, LiveKit clones it to all providers your plan supports. You can then use the voice with any TTS model from those providers. For complete limits by plan, see Quotas and limits.

Supported languages

The following languages are supported for voice clone input audio. Select the language that matches the speech in your audio sample:

CodeLanguageCodeLanguage
enEnglishhiHindi
frFrenchitItalian
deGermankoKorean
esSpanishnlDutch
ptPortugueseplPolish
zhChineseruRussian
jaJapanesearArabic

Create a voice clone

Create voice clones from the LiveKit Cloud dashboard:

  1. Open your project in the dashboard.
  2. Navigate to Voices > Custom voices in the sidebar.
  3. Click Create voice clone.
  4. Choose Upload file or Record audio:
    • Upload: Drag and drop or browse for an audio file. Supported formats: MP3, WAV, OGG, or WEBM. Maximum file size: 4 MB.
    • Record: Click Start recording and speak clearly for about 10 seconds. A sample script is provided in the dialog.
  5. Optionally trim the audio using the waveform trimmer.
  6. Enter a voice name and select the language spoken in the audio.
  7. Optionally enable Remove background noise if your audio has ambient noise. This may slightly affect voice quality.
  8. Click Upload and clone voice.
  9. Review the consent items, check I provide consent to the above items, and click Continue.

The voice is cloned to all supported providers in parallel. Processing typically takes under a minute. Once ready, the clone appears in your voices list with its status and a unique voice ID (for example, v_RT5PsNhXvMaB).

Tip

For tips on getting the best results, see Audio requirements.

Use a voice clone

Once a voice clone is ready, use its voice ID (the v_* identifier) in your agent session, just like any other voice. LiveKit Inference automatically routes the request to the correct provider.

In Agent Builder, open the Models & Voice tab, set voice type to Custom, and pick your cloned voice and TTS model.

from livekit.agents import AgentSession, inference
session = AgentSession(
tts=inference.TTS(
model="cartesia/sonic-3",
voice="v_RT5PsNhXvMaB",
),
# ... llm, stt, etc.
)
import { AgentSession, inference } from '@livekit/agents';
const session = new AgentSession({
tts: new inference.TTS({
model: "cartesia/sonic-3",
voice: "v_RT5PsNhXvMaB",
}),
// ... llm, stt, etc.
});

You can also use the string descriptor shortcut:

from livekit.agents import AgentSession
session = AgentSession(
tts="cartesia/sonic-3:v_RT5PsNhXvMaB",
# ... llm, stt, etc.
)
import { AgentSession } from '@livekit/agents';
const session = new AgentSession({
tts: "cartesia/sonic-3:v_RT5PsNhXvMaB",
// ... llm, stt, etc.
});

Using different TTS models

A voice clone works with any TTS model from a provider it was cloned to. For example, if your voice was cloned to both Cartesia and Inworld, you can switch between models:

from livekit.agents import inference
# Use with Cartesia
tts_cartesia = inference.TTS(
model="cartesia/sonic-3",
voice="v_RT5PsNhXvMaB",
)
# Use with Inworld
tts_inworld = inference.TTS(
model="inworld/inworld-tts-1.5-max",
voice="v_RT5PsNhXvMaB",
)
import { inference } from '@livekit/agents';
// Use with Cartesia
const ttsCartesia = new inference.TTS({
model: "cartesia/sonic-3",
voice: "v_RT5PsNhXvMaB",
});
// Use with Inworld
const ttsInworld = new inference.TTS({
model: "inworld/inworld-tts-1.5-max",
voice: "v_RT5PsNhXvMaB",
});

LiveKit Inference automatically resolves the voice ID to the correct provider-specific voice for the model you selected.

Automatic fallback

Because each voice is cloned to every provider your plan supports, LiveKit Inference automatically falls back to another provider if the primary one is unavailable. No configuration is required. If the provider for the model you selected fails, the session continues on another provider where the clone is ready. Each provider was trained on the same audio sample, so the voice stays recognizable across providers, though each provider's model has its own characteristics so the output isn't identical.

Manage voice clones

You can preview, re-clone, and delete voice clones from the voice detail page in the dashboard. Voice clones are scoped to a single project and can't be shared across projects.

Preview a clone

On the voice detail page, select a TTS model and enter custom text to hear how the clone sounds. Each provider's model produces a slightly different rendition of the same voice, so use the preview to pick the one you prefer.

Voice status

Each voice clone has an overall status and per-provider status:

StatusDescription
ActiveVoice is ready and available for use.
ProcessingVoice is being cloned. This typically takes under a minute.
PartialVoice is ready on some providers but failed on others. The voice is still usable with the providers where it succeeded.
FailedCloning failed on all providers.

Re-clone a voice

If a voice failed to clone on a specific provider, or if new providers become available, you can re-clone the voice. On the voice detail page in the dashboard, open the provider menu and select Re-clone voice with provider.

Delete a clone

To delete a voice clone, open the voice detail page in the dashboard and click Delete voice clone. This permanently removes the voice from all TTS providers and cannot be undone.

Audio requirements

For the best results when creating a voice clone:

  • Duration: About 10 seconds of speech. The audio trimmer in the dashboard lets you adjust the selection.
  • Quality: Use a clear recording with minimal background noise. A quiet room or headset microphone works well.
  • Content: Speak naturally with your normal pace and intonation. Avoid whispering or exaggerated expression.
  • Format: MP3, WAV, OGG, or WEBM. Maximum file size: 4 MB.
Note

The Remove background noise option can help with noisy recordings, but may slightly alter the voice characteristics. For the best results, start with a clean recording.

Audio retention

LiveKit stores your audio sample so the voice can be re-cloned to new providers and models as they're added. Recordings are deleted 12 months after the voice clone was last used, or immediately if you delete the clone.

Billing and limits

Creating a voice clone is free. Synthesis is billed at the standard LiveKit Inference TTS rate, the same as any other voice. The number of voice clones you can create and the available providers depend on your plan. For details on limits by plan, see Quotas and limits.

Additional resources