On this mission, I constructed a easy voice cloning pipeline that generates life like speech from textual content utilizing a reference speaker’s voice. The core concept is to situation a text-to-speech (TTS) mannequin with a customized speaker embedding extracted utilizing my very own LSTM-based speaker encoder educated with GE2E loss.
This mission combines:
- A customized Speaker Encoder for extracting speaker embeddings from reference audio
- Tacotron2 for changing textual content into mel-spectrograms
- WaveGlow because the vocoder to transform mel-spectrograms into uncooked audio
1. Customized Speaker Encoder (LSTM + GE2E)
I educated a speaker encoder utilizing an LSTM mannequin with GE2E loss to be taught speaker-discriminative embeddings. The encoder takes in MFCC options extracted from a reference .wav
file and outputs a fixed-dimensional embedding vector representing the speaker’s id.
2. Tacotron2 (Textual content → Mel-Spectrogram)
NVIDIA’s Tacotron2 was used because the spine for changing textual content into mel-spectrograms. I modified its inference technique to settle for a customized speaker embedding, enabling it to imitate the reference speaker’s voice.
3. WaveGlow Vocoder
WaveGlow takes the mel-spectrogram from Tacotron2 and generates the ultimate speech waveform. I used the pretrained waveglow_256channels_ljs_v3.pt
mannequin offered by NVIDIA.
Right here’s what occurs inside clean_audio()
in my script:
def clean_audio():
reference_audio = "clear.wav"
textual content = "Whats up, how are you?"speaker_embedding = encoder(reference_audio)
text_sequence = text_to_sequence(textual content, ['english_cleaners'])
text_sequence = torch.LongTensor(text_sequence).unsqueeze(0).to(machine)
mel_outputs, mel_postnet, _, _ = tacotron2.inference(text_sequence, speaker_embedding)
mel_spectrogram = mel_postnet
audio = waveglow.infer(mel_spectrogram, sigma=0.8)
audio = denoiser(audio, 0.01).squeeze(1).cpu().numpy()
sf.write("output.wav", audio[0], 22050)
print("Output saved to output.wav")
You possibly can set this up by putting in the required fashions and dependencies:
pip set up torch torchaudio soundfile
pip set up git+https://github.com/NVIDIA/tacotron2.git
pip set up git+https://github.com/NVIDIA/waveglow.git
Obtain:
End result
The output audio in output.wav
speaks the enter sentence within the voice of the reference audio used to generate the speaker embedding. This permits us to clone any speaker’s voice so long as we now have a brief audio pattern of them.
Conclusion
This mission demonstrates methods to construct a modular, customizable voice cloning system utilizing deep studying. By separating out speaker id extraction from the text-to-speech era pipeline, we are able to construct versatile architectures that permit speaker adaptation with only a single audio clip.
For those who’re fascinated about attempting this or contributing, be happy to take a look at my github repo : https://github.com/jeevvanth/VoiceCloneTacoGlow_with_custom_speaker_embedding.git
In case your have an interest within the for my customized speaker enocoder , right here you may entry the pt
Drive: https://drive.google.com/drive/folders/1P9nHBmyAjbdc1dV-Wx6slehkKLkwLYz9?usp=drive_link