Updated | Near-Automated Voice Cloning | Whisper STT + Coqui TTS | Fine Tune a VITS Model on Colab

Channel:
Subscribers:
2,970
Published on ● Video Link: https://www.youtube.com/watch?v=dfmlyXHQOwE



Duration: 11:40
6,771 views
170


This is about as close to automated as I can make things. I've put together a Colab notebook that uses a bunch of spaghetti code, rnnoise, OpenAI's Whisper Speech to Text, and Coqui Text to Speech to train a VITS model.

Upload audio files, split and process clips, denoise clips, transcribe clips with Whisper, then use that dataset to fine tune a VITS model. Colab script revised to add toggles for freezing layers and some (possibly broken) audio processing toggles

This is for fine tuning English voices; things are hardcoded for English. Adjusting this will take some work on your part, and fine tuning across languages is hit and miss.

First part of the video covers using Audacity and the VST3 port of rnnoise to more accurately clip samples on your PC. Second half is the Colab run-through.

Real time noise suppression plugin:
https://github.com/werman/noise-suppression-for-voice

Colab script (r4):

https://colab.research.google.com/drive/1Swo0GH_PjjAMqYYV6He9uFaq5TQsJ7ZH?usp=sharing

Audacity:
https://www.audacityteam.org/

Coqui's Dataset Guide:
https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset

rnnoise:
https://github.com/xiph/rnnoise




Other Videos By NanoNomad


2023-05-01AI Voice Swap and Lip Sync using Wav2Lip-HQ-Updated
2023-04-22Voice Cloning with Tortoise TTS and Model Training Using the AI Voice Cloning WebUI
2023-04-07Locally Hosted Chatbots with RWKV through ChatRWKV and the Text-Generation-WebUI | 14B Model on 3GB!
2023-03-29Create Datasets for Voice Model Training on Google Colab | Updated Tools for Coqui TTS Training
2023-03-22Train a VITS Speech Model using Coqui TTS | Updated Script and Audio Processing Tools
2023-03-15Training or Fine Tuning a Hindi Language VITS TTS Voice Model with Coqui TTS on Google Colab
2023-03-05Install and Configure Retroarch for PS Vita with Thumbnails, Overlays and Shaders
2023-03-03Fallout 1 on the PS Vita is the Best Way to Play
2023-02-24Train or Fine Tune VITS on (theoretically) Any Language | Train Multi-Speaker Model | Train YourTTS
2023-02-12Even more Voice Cloning | Train a Multi-Speaker VITS model using Google Colab and a Custom Dataset
2023-02-04Updated | Near-Automated Voice Cloning | Whisper STT + Coqui TTS | Fine Tune a VITS Model on Colab
2023-01-30YourTTS Training Discussion | Experiences, Multistage Training, Demos, Prior Training Preservation
2023-01-27Updated | Fine-Tuning YourTTS with Automated STT Datasets on Google Colab for AI Voice Cloning
2023-01-13Fine-Tune YourTTS with Near-Automated Datasets on Google Colab for AI Voice Cloning
2022-12-22Near-Automated Voice Cloning | Whisper STT + Coqui TTS | Fine Tune a VITS Model on Colab or Linux
2022-12-09Dreambooth and Fine Tuning for Stable Diffusion 1.5 and 2 with this Versatile Script
2022-11-30If Bill Gates could rap? AI Synthesized Voice, AI Upsampled Video | Deltron 3030's Virus
2022-11-14Training Stable Diffusion Dreambooth on Multiple Subjects for Combined Image Generation
2022-10-31Locally Train Stable Diffusion with Dreambooth using WSL Ubuntu
2022-10-25Animated Stable Diffusion and Synthesized Voice Demo with Facial Movements
2022-10-24Stable Diffusion Image to Video, Synthesized Lauretta Young 1930s voice, Wav2Lip Demo



Tags:
voice cloning
vits
coqui
tts
ai voice
voice synthesis