[Discussion] What is a large dataset? #435
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#435
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hey,
I was wondering a pretty simple question: what can be considered a large dataset for training? I mean I've seen somewhere in this repo that a small dataset could be considered around 17 or 50. As for large datasets, I've seen posts talking about 40 hours of speech, or another one talking about 820 000 samples.
I currently have a dataset that is around 2000 clips, which represents roughly 10 hours of speech, maybe less. As I'm struggling with overfitting problems, I a mostly wondering whether this can be considered a "large enough" dataset for training?
Thank you,
Depends what are you trying to achieve - when I am trying to finetune to new language I have a dataset of cca 500k clips (about 1000hours of audio) - James Becker (the original creator of tortoise tts) said that he trained the model on 50k hours of but he also said that it would be probably possible to achieve similar results even with 10k hours. In that perspective my dataset with 'just' 1k hours is quite small, but I can still get decent results.
If you are just doing English finetuning that 20+ hours should me more than enough to achieve very good results. There is even finetuned model (again from James Backer) finetuned with LJSpeech Dataset https://huggingface.co/jbetker/tortoise-tts-finetuned-lj From that you can probably guess what is possible to do with just finetuning or even better you can compare the finetuned model with original model which would use ljspeech as a voice cloning input.
Hey! Thank you, yes indeed I'm just trying to refine an english speaking model for a game character.
Thank you for the finetuned model, if I understand correctly, James Becker took the feminine voice of the LJ Speech Dataset and finetuned the Tortoise base model exactly like what we're doing with your tool?
I'm really far from 13k clips though, I don't think I'll be able to reach that kind of size, even with the audiobooks I've managed to get that are read by my character. And I still don't know where that overfitting problem comes from, it drives me insane!
Anyway, thank you for your kind answer, as usual!
Yes, exactly. But it's not my tool. It's made by @mrq
Try to use really low lr rate between like 0.000001 (1e-7) and 0.00000025 (2.5e-8). That's where it would be nice to ha learning rate finder - it would provide quite a good range or lr rates for each specific dataset to prevent overfitting...
And more importantly definitely use validation during training, ideally for each epoch. Set aside 10-20 percent of your dataset into validation. If it validation loss starts to go up, it signal over-fitting problems.
You can also use cosine annealing instead of just pure lr halving, it's really hard to tell, but just theoretically it provides more variance into the training...
Yes, exactly. But it's not my tool. It's made by @mrq
I'm so sorry, I read the message too quick and I assumed it was mrq who had answered ^^'. My apologies
Try to use really low lr rate between like 0.000001 (1e-7) and 0.00000025 (2.5e-8). That's where it would be nice to ha learning rate finder - it would provide quite a good range or lr rates for each specific dataset to prevent overfitting...
I agree, a LR finder would be awesome. I indeed tried very low learning rates but the model remains stuck at pretty high values (around 2) and doesn't go down anymore (I've read somewhere that this had something to do with the model getting stuck in a local minimum? I'm still quite new to ML), so I don't know what else I can do :'(
And more importantly definitely use validation during training, ideally for each epoch. Set aside 10-20 percent of your dataset into validation. If it validation loss starts to go up, it signal over-fitting problems.
Yep, I'm using validation alright, with 20% data for validation, and I even tried k-fold cross validation. One of my datasets seems to be working not too bad, no sign of overfitting over about 100 epochs, but the loss MEL rate goes down soooo slowly. The dataset is about 445 clips, all coming from an audiobook, so I guess the voice is pretty even and that's why it works better. I'll try and run that through the night in order to see whether I can get the loss below 1 but that's gonna take forever, if it's gonna happen at all...
You can also use cosine annealing instead of just pure lr halving, it's really hard to tell, but just theoretically it provides more variance into the training...
I have tried cosine annealing too, but without too much success. As a matter of fact, I struggle to understand how to use it. As for LV Halving what do you mean?
Thanks a million :)
Yes, the low learning rate could mean that you stuck on local minimum. Therefore is probably better to use cosine annealing because the lr restart resets learning rate into higher values in the next epoch - this could potentially help to escape local minimums - contrary to lr halving (multistep method) which only decreases the lr and never goes up again and therefore prevents potential escape from local minimum. But that's just theory, all of it is probably dataset dependent, it's variance etc.
That is just really small, the ljspeech will have probably 5-10k clips (for cca 20hours of audio). Tortoise model is quite big so it would be quite prone to overfitting with small number of examples. The bigger the model, the more examples you need to feed into it to prevent overfitting. That's one of the reasons why there are more model for image recognition etc.
Ahh okay, well I was starting to figure that was why I had so many problems. I have different datasets, this one is mostly for testing. The biggest I have right now is 3000, but it is still overfitting. Would say that the minimum would need to be around 5000 to avoid overfitting then?
That is really hard to say. It really depends on dataset entropy and variance... I usually don't go bellow 50k+ samples for single speaker, even for English language ones... Therefore I don't have much experience with small datasets.
Oh wow 50k! Yeah I definitely won't be able to produce that many clips. All I have are 4 audiobooks and 600 game character lines, and once everything is processed I guess I can try and reach 30k but not further.
The voice actor has a pretty wide range of voices so that definitely goes in favor of needing as much data s possible to cover all possibilities... Anyway, I'll just try to produce as many clips I can with what I have.
I can also artificially augment it with Eleven Labs too, but there are a few differences between Eleven Labs output and the actual voice, so I'm kind of afraid that will pollute the dataset.
Anyway, we'll see, thank you for your advice!