Near-Automated Voice Cloning | Whisper STT + Coqui TTS | Fine Tune a VITS Model on Colab or Linux

Channel:
Subscribers:
2,550
Published on ● Video Link: https://www.youtube.com/watch?v=e_DCb1XPWS0



Duration: 12:19
7,636 views
216


This is about as close to automated as I can make things. I've put together a Colab notebook that uses a bunch of spaghetti code, rnnoise, OpenAI's Whisper Speech to Text, and Coqui Text to Speech to train a VITS model.

Upload audio files, split and process clips, denoise clips, transcribe clips with Whisper, then use that dataset to fine tune a VITS model.

Second part of the video is a quick look at installing the same thing on WSL2 Ubuntu 20.04 LInux on Windows 10. Copy-paste command list and scripts (rnnoise, and whisper translate) linked down below.

Use single-speaker, clear audio samples. rnnoise can't work miracles.

Whisper STT + Coqui TTS Colab Notebook
https://colab.research.google.com/drive/1xy0qmej_G3skZL2BpY1sBm_k3BTTkf7V?usp=sharing

Linux command list:
https://pastebin.com/9MeCYi4p

rnnoise voice clip denoise script:
https://pastebin.com/5wrAt1UG

Whisper STT voice clip transcription script:
https://pastebin.com/Q4VSsktk

Alternate command to split files to 8 seconds instead of on silence:
for FILE in *.wav; do sox "$FILE" splits/"$FILE" --show-progress trim 0 8 : restart ; done




Other Videos By NanoNomad


2023-03-22Train a VITS Speech Model using Coqui TTS | Updated Script and Audio Processing Tools
2023-03-15Training or Fine Tuning a Hindi Language VITS TTS Voice Model with Coqui TTS on Google Colab
2023-03-05Install and Configure Retroarch for PS Vita with Thumbnails, Overlays and Shaders
2023-03-03Fallout 1 on the PS Vita is the Best Way to Play
2023-02-24Train or Fine Tune VITS on (theoretically) Any Language | Train Multi-Speaker Model | Train YourTTS
2023-02-12Even more Voice Cloning | Train a Multi-Speaker VITS model using Google Colab and a Custom Dataset
2023-02-04Updated | Near-Automated Voice Cloning | Whisper STT + Coqui TTS | Fine Tune a VITS Model on Colab
2023-01-30YourTTS Training Discussion | Experiences, Multistage Training, Demos, Prior Training Preservation
2023-01-27Updated | Fine-Tuning YourTTS with Automated STT Datasets on Google Colab for AI Voice Cloning
2023-01-13Fine-Tune YourTTS with Near-Automated Datasets on Google Colab for AI Voice Cloning
2022-12-22Near-Automated Voice Cloning | Whisper STT + Coqui TTS | Fine Tune a VITS Model on Colab or Linux
2022-12-09Dreambooth and Fine Tuning for Stable Diffusion 1.5 and 2 with this Versatile Script
2022-11-30If Bill Gates could rap? AI Synthesized Voice, AI Upsampled Video | Deltron 3030's Virus
2022-11-14Training Stable Diffusion Dreambooth on Multiple Subjects for Combined Image Generation
2022-10-31Locally Train Stable Diffusion with Dreambooth using WSL Ubuntu
2022-10-25Animated Stable Diffusion and Synthesized Voice Demo with Facial Movements
2022-10-24Stable Diffusion Image to Video, Synthesized Lauretta Young 1930s voice, Wav2Lip Demo
2022-10-16Animate Images using AI with Frame Interpolation for Large Motion
2022-10-14Animated Stable Diffusion Images using Google's FILM Frame Interpolation for Large Motion demo
2022-10-07Training Textual Inversion for Stable Diffusion | Customizable AI Image Generation
2022-09-26How to Download All Styles and Objects from the Stable Diffusion Concepts Library | AI Images



Tags:
Voice Cloning
OpenAI Whisper
Coqui TTS
Voice Synthesis
Vocaloid
VITS Fine-Tuning