Close Menu
    Trending
    • How I Built My Own Cryptocurrency Portfolio Tracker with Python and Live Market Data | by Tanookh | Aug, 2025
    • Why Ray Dalio Is ‘Thrilled About’ Selling His Last Shares
    • Graph Neural Networks (GNNs) for Alpha Signal Generation | by Farid Soroush, Ph.D. | Aug, 2025
    • How This Entrepreneur Built a Bay Area Empire — One Hustle at a Time
    • How Deep Learning Is Reshaping Hedge Funds
    • Boost Team Productivity and Security With Windows 11 Pro, Now $15 for Life
    • 10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025
    • This Mac and Microsoft Bundle Pays for Itself in Productivity
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Voice cloning and Custom Speaker Enoder With Tacotron,Waveglow | by Jeevanthbheeman | Jul, 2025
    Machine Learning

    Voice cloning and Custom Speaker Enoder With Tacotron,Waveglow | by Jeevanthbheeman | Jul, 2025

    Team_AIBS NewsBy Team_AIBS NewsJuly 10, 2025No Comments2 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    On this mission, I constructed a easy voice cloning pipeline that generates life like speech from textual content utilizing a reference speaker’s voice. The core concept is to situation a text-to-speech (TTS) mannequin with a customized speaker embedding extracted utilizing my very own LSTM-based speaker encoder educated with GE2E loss.

    This mission combines:

    • A customized Speaker Encoder for extracting speaker embeddings from reference audio
    • Tacotron2 for changing textual content into mel-spectrograms
    • WaveGlow because the vocoder to transform mel-spectrograms into uncooked audio

    1. Customized Speaker Encoder (LSTM + GE2E)

    I educated a speaker encoder utilizing an LSTM mannequin with GE2E loss to be taught speaker-discriminative embeddings. The encoder takes in MFCC options extracted from a reference .wav file and outputs a fixed-dimensional embedding vector representing the speaker’s id.

    2. Tacotron2 (Textual content → Mel-Spectrogram)

    NVIDIA’s Tacotron2 was used because the spine for changing textual content into mel-spectrograms. I modified its inference technique to settle for a customized speaker embedding, enabling it to imitate the reference speaker’s voice.

    3. WaveGlow Vocoder

    WaveGlow takes the mel-spectrogram from Tacotron2 and generates the ultimate speech waveform. I used the pretrained waveglow_256channels_ljs_v3.pt mannequin offered by NVIDIA.

    Right here’s what occurs inside clean_audio() in my script:

    def clean_audio():
    reference_audio = "clear.wav"
    textual content = "Whats up, how are you?"

    speaker_embedding = encoder(reference_audio)
    text_sequence = text_to_sequence(textual content, ['english_cleaners'])
    text_sequence = torch.LongTensor(text_sequence).unsqueeze(0).to(machine)
    mel_outputs, mel_postnet, _, _ = tacotron2.inference(text_sequence, speaker_embedding)
    mel_spectrogram = mel_postnet
    audio = waveglow.infer(mel_spectrogram, sigma=0.8)
    audio = denoiser(audio, 0.01).squeeze(1).cpu().numpy()

    sf.write("output.wav", audio[0], 22050)
    print("Output saved to output.wav")

    You possibly can set this up by putting in the required fashions and dependencies:

    pip set up torch torchaudio soundfile
    pip set up git+https://github.com/NVIDIA/tacotron2.git
    pip set up git+https://github.com/NVIDIA/waveglow.git

    Obtain:

    End result

    The output audio in output.wav speaks the enter sentence within the voice of the reference audio used to generate the speaker embedding. This permits us to clone any speaker’s voice so long as we now have a brief audio pattern of them.

    Conclusion

    This mission demonstrates methods to construct a modular, customizable voice cloning system utilizing deep studying. By separating out speaker id extraction from the text-to-speech era pipeline, we are able to construct versatile architectures that permit speaker adaptation with only a single audio clip.

    For those who’re fascinated about attempting this or contributing, be happy to take a look at my github repo : https://github.com/jeevvanth/VoiceCloneTacoGlow_with_custom_speaker_embedding.git

    In case your have an interest within the for my customized speaker enocoder , right here you may entry the pt

    Drive: https://drive.google.com/drive/folders/1P9nHBmyAjbdc1dV-Wx6slehkKLkwLYz9?usp=drive_link



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleBolt Insight Achieves SOC 2 Type I Compliance Certification 
    Next Article The Crucial Role of NUMA Awareness in High-Performance Deep Learning
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    How I Built My Own Cryptocurrency Portfolio Tracker with Python and Live Market Data | by Tanookh | Aug, 2025

    August 3, 2025
    Machine Learning

    Graph Neural Networks (GNNs) for Alpha Signal Generation | by Farid Soroush, Ph.D. | Aug, 2025

    August 2, 2025
    Machine Learning

    How Deep Learning Is Reshaping Hedge Funds

    August 2, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    How I Built My Own Cryptocurrency Portfolio Tracker with Python and Live Market Data | by Tanookh | Aug, 2025

    August 3, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    UK watchdog fines 23andMe for ‘profoundly damaging’ data breach

    June 17, 2025

    Meta Tells Staff Exactly When They Will Be Laid Off: Memo

    February 8, 2025

    Intel Data Center and AI EVP Hotard Named Nokia CEO

    February 11, 2025
    Our Picks

    How I Built My Own Cryptocurrency Portfolio Tracker with Python and Live Market Data | by Tanookh | Aug, 2025

    August 3, 2025

    Why Ray Dalio Is ‘Thrilled About’ Selling His Last Shares

    August 3, 2025

    Graph Neural Networks (GNNs) for Alpha Signal Generation | by Farid Soroush, Ph.D. | Aug, 2025

    August 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.