Updated | Fine-Tuning YourTTS with Automated STT Datasets on Google Colab for AI Voice Cloning
A followup to the YourTTS video. Here you can fine-tune a multispeaker YourTTS model using your own voice samples. The samples are split, converted, run through rnnoise to denoise, transcribed with OpenAI Whisper STT, then put into a VCTK-format dataset, and used to fine tune the YourTTS model using Coqui TTS.
The script is currently configured for English voices only. Other languages require separate datasets and things here are hardcoded for English for ease of use.
It probably mostly works if you have a good dataset. Probably. No promises.
Updated Whisper STT+Coqui YourTTS Colab Script:
https://colab.research.google.com/drive/1GsOL7pwCrECagRxmOgJoKOHtlaGhUCma?usp=sharing
WaveShop:
https://waveshop.sourceforge.net/download.html
Sonic Visualiser:
https://www.sonicvisualiser.org/
Coqui's Dataset Guide:
https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset
rnnoise:
https://github.com/xiph/rnnoise
Generate text with the CLI:
tts --text "text" --out_path outfile.wav --model_path multivoice/traineroutput/run path/best_model.pth --config_path multivoice/traineroutput/run path/config.json --speakers_file_path multivoice/speakers.pth --speaker_idx VCTK_speaker