Tortoise TTS DEMO: G-Man performs Gilbert and Sullivan's 'The Major-General's Song'
Half-Life's G-Man performs Gilbert and Sullivan's 'The Major-General's Song', through the magic of Tortoise TTS.
Generation settings: Sampler: 2, Iterations: 128, Cond free, length penalty 0.2
Download one of my fine-tuned Tortoise TTS models here:
Base model here: https://huggingface.co/AOLCDROM/Tortoise-TTS-MSFT-VCTK-4V-En
Requires custom tokenizer file, pt-t.json (put in ./models/tokenizers, switch tokenizer in settings menu)
Training a multi-voice English model. Testing one of the voices.
Training so far:
2 epochs to establish base language, 1e-5, text ratio 1
1 epoch of a single voice, 6 hour dataset, 1e-5, text ratio 0.1
several (I lost track) 1-2 epoch sessions of a 6 voice dataset ~2 hours,1e-5, text ratio 0.1
a total of 23 epochs thus far, 30 voice dataset, idk how long, approx 13,000 samples, multiple sessions, 1e-5, text ratio 0.1
Once an epoch ends, I allow another to begin. If the training reaches a minimum and stalls 1/3-1/2 through, I will terminate, test the model, and restart using that last checkpoint as the starting model.
Loss targeting across datasets/checkpoints doesn't seem to matter much right now, because the minimum are wildly different between sessions/models/datasets.
I typically see a relatively normal looking loss curve, then a sharp drop after each epoch during successful sessions, which feels a little counterintuitive.
Batching for all: 32, grad for all: 16
All training samples are recording-booth quality; very low noise unless the audio has effects applied
This voice worked well because it is dissimilar from the others being trained. There is some bleed over; I think the same voice actor performs one of the scientists, and the model slips into that intonation and speech pattern at one point.