I tried redoing it with commit 0231550287 from about 2 weeks ago, and the output was much better; close to the dataset voice. The training ran much faster too.
Did redoing it include…
Weird, that sounds like just about ideal. Are there any complications like reverb or background music?
How big is your dataset size and how different is it from "standard" English speech?
I'll be comfortable with renting out a GPU to do bigger training on (or cave and buy a 4090, as the prospect of renting for pennies sounds worse than just splurging $1500 on another GPU).
[The…
Have you tried training a model with a single voice, for comparison?
How closely does the transcription in train.txt match the content of the audio clips?
Trained a new model with the Japanese tokenizer, and after ~55 epochs (~825000 samples processed), I have a better Japanese model:
Was it with VALL-E or DLAS?
What effect does…
- I'm starting to hit the limitations of finetuning the base TorToiSe model.
- for non-English, a replaced tokenizer vocab is practically required for accuracy, and I have had terrible luck…
no BitsAndBytes to save my hide, so it's quite the VRAM hog.
How bad is it? Is it still something that could run on HEDT graphics cards or should I be pricing out refab P40's on eBay?
My gut feeling is that you'd want at least 100-200 epochs, but if your training set is close to a "standard" US English accent then you should be able to get away with less. How's the quality…
Maybe 50 epochs is enough?..
Hard to say without knowing your batch size and how many steps per epoch you have.