Voice cloning and Custom Speaker Enoder With Tacotron,Waveglow | by Jeevanthbheeman

On this mission, I constructed a easy voice cloning pipeline that generates life like speech from textual content utilizing a reference speaker’s voice. The core concept is to situation a text-to-speech (TTS) mannequin with a customized speaker embedding extracted utilizing my very own LSTM-based speaker encoder educated with GE2E loss.

This mission combines:

A customized Speaker Encoder for extracting speaker embeddings from reference audio
Tacotron2 for changing textual content into mel-spectrograms
WaveGlow because the vocoder to transform mel-spectrograms into uncooked audio

1. Customized Speaker Encoder (LSTM + GE2E)

I educated a speaker encoder utilizing an LSTM mannequin with GE2E loss to be taught speaker-discriminative embeddings. The encoder takes in MFCC options extracted from a reference .wav file and outputs a fixed-dimensional embedding vector representing the speaker’s id.

2. Tacotron2 (Textual content → Mel-Spectrogram)

NVIDIA’s Tacotron2 was used because the spine for changing textual content into mel-spectrograms. I modified its inference technique to settle for a customized speaker embedding, enabling it to imitate the reference speaker’s voice.

3. WaveGlow Vocoder

WaveGlow takes the mel-spectrogram from Tacotron2 and generates the ultimate speech waveform. I used the pretrained waveglow_256channels_ljs_v3.pt mannequin offered by NVIDIA.

Right here’s what occurs inside clean_audio() in my script:

def clean_audio():
reference_audio = "clear.wav"
textual content = "Whats up, how are you?"speaker_embedding = encoder(reference_audio)
text_sequence = text_to_sequence(textual content, ['english_cleaners'])
text_sequence = torch.LongTensor(text_sequence).unsqueeze(0).to(machine)
mel_outputs, mel_postnet, _, _ = tacotron2.inference(text_sequence, speaker_embedding)
mel_spectrogram = mel_postnet
audio = waveglow.infer(mel_spectrogram, sigma=0.8)
audio = denoiser(audio, 0.01).squeeze(1).cpu().numpy()
sf.write("output.wav", audio[0], 22050)
print("Output saved to output.wav")

You possibly can set this up by putting in the required fashions and dependencies:

pip set up torch torchaudio soundfile
pip set up git+https://github.com/NVIDIA/tacotron2.git
pip set up git+https://github.com/NVIDIA/waveglow.git

Obtain:

End result

The output audio in output.wav speaks the enter sentence within the voice of the reference audio used to generate the speaker embedding. This permits us to clone any speaker’s voice so long as we now have a brief audio pattern of them.

Conclusion

This mission demonstrates methods to construct a modular, customizable voice cloning system utilizing deep studying. By separating out speaker id extraction from the text-to-speech era pipeline, we are able to construct versatile architectures that permit speaker adaptation with only a single audio clip.

For those who’re fascinated about attempting this or contributing, be happy to take a look at my github repo : https://github.com/jeevvanth/VoiceCloneTacoGlow_with_custom_speaker_embedding.git

In case your have an interest within the for my customized speaker enocoder , right here you may entry the pt

Drive: https://drive.google.com/drive/folders/1P9nHBmyAjbdc1dV-Wx6slehkKLkwLYz9?usp=drive_link

Source link

How I Built My Own Cryptocurrency Portfolio Tracker with Python and Live Market Data | by Tanookh | Aug, 2025

Graph Neural Networks (GNNs) for Alpha Signal Generation | by Farid Soroush, Ph.D. | Aug, 2025

How Deep Learning Is Reshaping Hedge Funds

How I Built My Own Cryptocurrency Portfolio Tracker with Python and Live Market Data | by Tanookh | Aug, 2025

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

UK watchdog fines 23andMe for ‘profoundly damaging’ data breach

Meta Tells Staff Exactly When They Will Be Laid Off: Memo

Intel Data Center and AI EVP Hotard Named Nokia CEO

Our Picks

How I Built My Own Cryptocurrency Portfolio Tracker with Python and Live Market Data | by Tanookh | Aug, 2025

Why Ray Dalio Is ‘Thrilled About’ Selling His Last Shares

Graph Neural Networks (GNNs) for Alpha Signal Generation | by Farid Soroush, Ph.D. | Aug, 2025

Voice cloning and Custom Speaker Enoder With Tacotron,Waveglow | by Jeevanthbheeman | Jul, 2025

1. Customized Speaker Encoder (LSTM + GE2E)

2. Tacotron2 (Textual content → Mel-Spectrogram)

3. WaveGlow Vocoder

End result

Conclusion

Related Posts