microsoft/speecht5_tts · How to change voice character [ANSWERED]

Hey y'all, Micah here. A lot of people are asking this question so I figure I'd provide the answer I came up with. In order to change the voice character of this TTS AI, you have to change the speaker embeddings. In my work I used Speechbrain's spkrec-xvect-voxceleb model to calculate the embeddings, after passing them through an AudioDenoiser package.

The code for the above looks like this:

from transformers import pipeline
import soundfile as sf
import torch
import time
import torchaudio
from audio_denoiser.AudioDenoiser import AudioDenoiser

def embed(source, target):
  # First, denoise audio. (Optional but improves quality)
  signal, fs = torchaudio.load(source)
  auto_scale = True # Recommended for low-volume input audio
  signal = denoiser.process_waveform(waveform=signal, sample_rate=16000, auto_scale=auto_scale)

  # Calculate speech embeddings.
  from speechbrain.inference.speaker import EncoderClassifier
  classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-xvect-voxceleb", savedir="pretrained_models/spkrec-xvect-voxceleb", run_opts={"device":device})
  embeddings = classifier.encode_batch(signal)

  # Here, embeddings is length 2048, so we need to squeeze it down.
  embeddings = torch.nn.functional.normalize(embeddings[:, :512], dim=-1).squeeze([1]) # Changes size from [1, 1, 1, 512] to [1, 512]

  # Write embeddings to a file.
  print(embeddings.size())
  with open(target, "wb") as f:
    if not ("Voice Embeddings" in target):
      target = "./Voice Embeddings/" + target

    torch.save(embeddings, f)
    return target

def GetEmbedding(location):
  with open(location, "rb") as f:
    return torch.load(f).squeeze(1)

# Making an embedding:
embed("whatever.wav", "whatever.bin") # target can end with whatever but I just use .bin
# Note that audio must be in 16khz  mono audio in WAV format. I use FFMPEG in my thing to convert the audio.

# Then, when you go to generate speech: 
synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts")
speech = synthesiser(Text, forward_params={"speaker_embeddings": GetEmbedding(PATH_TO_EMBEDDING)})

You can find the relevant source code for my bot's TTS component here
Keep in mind that it's a toy project, so the code isn't super organized or easy to read :)

Here's a demo:
I used this sample audio of Joe Biden speaking:

And generated these:

I include the second sample so you can see that the model still suffers when generating longer passages of voiced text. This is especially apparent with slower speakers, like Joe Biden.
Here's the same text read on a faster speaker:

If you need help, feel free to drop me a line on my Discord, micahb.dev!