Voice cloning and Custom Speaker Enoder With Tacotron,Waveglow | by Jeevanthbheeman

On this mission, I constructed a easy voice cloning pipeline that generates life like speech from textual content utilizing a reference speaker’s voice. The core concept is to situation a text-to-speech (TTS) mannequin with a customized speaker embedding extracted utilizing my very own LSTM-based speaker encoder educated with GE2E loss.

This mission combines:

A customized Speaker Encoder for extracting speaker embeddings from reference audio
Tacotron2 for changing textual content into mel-spectrograms
WaveGlow because the vocoder to transform mel-spectrograms into uncooked audio

1. Customized Speaker Encoder (LSTM + GE2E)

I educated a speaker encoder utilizing an LSTM mannequin with GE2E loss to be taught speaker-discriminative embeddings. The encoder takes in MFCC options extracted from a reference .wav file and outputs a fixed-dimensional embedding vector representing the speaker’s id.

2. Tacotron2 (Textual content → Mel-Spectrogram)

NVIDIA’s Tacotron2 was used because the spine for changing textual content into mel-spectrograms. I modified its inference technique to settle for a customized speaker embedding, enabling it to imitate the reference speaker’s voice.

3. WaveGlow Vocoder

WaveGlow takes the mel-spectrogram from Tacotron2 and generates the ultimate speech waveform. I used the pretrained waveglow_256channels_ljs_v3.pt mannequin offered by NVIDIA.

Right here’s what occurs inside clean_audio() in my script:

def clean_audio():
reference_audio = "clear.wav"
textual content = "Whats up, how are you?"speaker_embedding = encoder(reference_audio)
text_sequence = text_to_sequence(textual content, ['english_cleaners'])
text_sequence = torch.LongTensor(text_sequence).unsqueeze(0).to(machine)
mel_outputs, mel_postnet, _, _ = tacotron2.inference(text_sequence, speaker_embedding)
mel_spectrogram = mel_postnet
audio = waveglow.infer(mel_spectrogram, sigma=0.8)
audio = denoiser(audio, 0.01).squeeze(1).cpu().numpy()
sf.write("output.wav", audio[0], 22050)
print("Output saved to output.wav")

You possibly can set this up by putting in the required fashions and dependencies:

pip set up torch torchaudio soundfile
pip set up git+https://github.com/NVIDIA/tacotron2.git
pip set up git+https://github.com/NVIDIA/waveglow.git

Obtain:

End result

The output audio in output.wav speaks the enter sentence within the voice of the reference audio used to generate the speaker embedding. This permits us to clone any speaker’s voice so long as we now have a brief audio pattern of them.

Conclusion

This mission demonstrates methods to construct a modular, customizable voice cloning system utilizing deep studying. By separating out speaker id extraction from the text-to-speech era pipeline, we are able to construct versatile architectures that permit speaker adaptation with only a single audio clip.

For those who’re fascinated about attempting this or contributing, be happy to take a look at my github repo : https://github.com/jeevvanth/VoiceCloneTacoGlow_with_custom_speaker_embedding.git

In case your have an interest within the for my customized speaker enocoder , right here you may entry the pt

Drive: https://drive.google.com/drive/folders/1P9nHBmyAjbdc1dV-Wx6slehkKLkwLYz9?usp=drive_link

Source link

How Flawed Human Reasoning is Shaping Artificial Intelligence | by Manander Singh (MSD) | Aug, 2025

Clone Any Figma File with One Link Using MCP Tool

Agentic AI Patterns. Introduction | by özkan uysal | Aug, 2025

How Flawed Human Reasoning is Shaping Artificial Intelligence | by Manander Singh (MSD) | Aug, 2025

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Understanding the Evolution of ChatGPT: Part 1—An In-Depth Look at GPT-1 and What Inspired It | by Shirley Li | Jan, 2025

When Predictors Collide: Mastering VIF in Multicollinear Regression

Time Series Forecasting. Introduction and beginner-friendly… | by Jemish Vakharia | Jul, 2025

Our Picks

How Flawed Human Reasoning is Shaping Artificial Intelligence | by Manander Singh (MSD) | Aug, 2025

Exaone Ecosystem Expands With New AI Models

4 Easy Ways to Build a Team-First Culture — and How It Makes Your Business Better

Voice cloning and Custom Speaker Enoder With Tacotron,Waveglow | by Jeevanthbheeman | Jul, 2025

1. Customized Speaker Encoder (LSTM + GE2E)

2. Tacotron2 (Textual content → Mel-Spectrogram)

3. WaveGlow Vocoder

End result

Conclusion

Related Posts