Overfitting with large datasets #248

Open
opened 2023-05-22 19:11:00 +00:00 by arrivederci · 8 comments

Recently I tried to train a model on MoistCritikal's voice with 40 hours of speech. However I'm noticing that I can only get in like 15 epochs before the validation loss starts going up. I've tried various learning rates but to no avail.

It's not that the model isn't good, I just feel like there is a way to get the training and validation loss lower (right now 1.89 is the lowest mel_loss I can get on the validation set).

Any advice?

Recently I tried to train a model on MoistCritikal's voice with 40 hours of speech. However I'm noticing that I can only get in like 15 epochs before the validation loss starts going up. I've tried various learning rates but to no avail. It's not that the model isn't good, I just feel like there is a way to get the training and validation loss lower (right now 1.89 is the lowest mel_loss I can get on the validation set). Any advice?
Owner

I honestly can't quite recall the intricacies of TorToiSe finetuning (as most of that knowledge emptied out for VALL-E). But I sort of remember with my Japanese dataset (which I checked finally and it caps at 22 hours, untrimmed, but the source was already pretty tightly trimmed to begin with), that it ended at a relatively higher-than-typical loss. During my VALL-E escapades, the loss that it would tend towards would go up the more I kept adding to my dataset.

In other words, the bigger your dataset, the higher the loss will tend to after a while. I think later on in my TorToiSe finetuning adventures, I think I noted it on the wiki naively as most of my testing were a few hundred lines, not a (relatively) astronomical amount.

I wouldn't sweat over trying to make number smaller, as the real metric is always how the actual output sounds to the human ear, rather than statistically how close it is. I think I could throw into reporting accuracies into DLAS, but eh.

I honestly can't *quite* recall the intricacies of TorToiSe finetuning (as most of that knowledge emptied out for VALL-E). But I sort of remember with my Japanese dataset (which I checked finally and it caps at 22 hours, untrimmed, but the source was already pretty tightly trimmed to begin with), that it ended at a relatively higher-than-typical loss. During my VALL-E escapades, the loss that it would tend towards would go up the more I kept adding to my dataset. In other words, the bigger your dataset, the higher the loss will tend to after a while. I think later on in my TorToiSe finetuning adventures, I think I noted it on the wiki naively as most of my testing were a few hundred lines, not a (relatively) astronomical amount. I wouldn't sweat over trying to make number smaller, as the real metric is always how the actual output sounds to the human ear, rather than statistically how close it is. I think I could throw into reporting accuracies into DLAS, but eh.

Hello!

Allow me to share my own experience about this matter. I, too, have been struggling for like a week on the problem of the validation curve going up almost immediately, like as soon as the end of epoch 1.

Here’s what I have tried so far.

  1. I have read / a data scientist friend of mine advised to create the validation dataset with about 20% of the train dataset. So I have made a little script that automatically culls 20% of the train.txt and creates the validation.txt. I then tried with batches ranging from 200 to 600 clips, but to no avail.

  2. I thought that maybe the validation dataset was not representative enough of the training dataset. Indeed the character I am trying to clone has very different voice tonalities. So I’ve went ahead and classified all my clips by what I called « voice color », medium, high and deep. I’ve put them all in the train dataset, and designed another script to make sure that it would still cull 20% of the train dataset for the validation dataset, AND that each voice color (medium, high and deep) would be equally represented in the validation. Again, immediate over fitting.

  3. So then I tried reducing the dataset to only specific voice colors : train only 200 clips of deep, or 200 of medium and 200 of high. Same result.

In the end, all the models that I have trained perform not too bad, honestly, but they still are overfitted and I would very much like to understand why, and most of all, why it overfits immediately.

  1. Is it a problem of size of the dataset? Maybe I should have 2000 clips rather than 600? I can’t invent new voice lines (I can use Eleven Labs to some extent but the generated voice, although very good, remains slightly different and I’m afraid to pollute my dataset), but maybe I can re-slice all 600 clips about 3 or 4 times?

  2. On the contrary should I train the model on very small datasets?

  3. Are there maybe parameters I should look closer? I admit I haven’t touched the learning rate for example

Thank you for any advice ☺️ and for this amazing tool

Hello! Allow me to share my own experience about this matter. I, too, have been struggling for like a week on the problem of the validation curve going up almost immediately, like as soon as the end of epoch 1. Here’s what I have tried so far. 1. I have read / a data scientist friend of mine advised to create the validation dataset with about 20% of the train dataset. So I have made a little script that automatically culls 20% of the train.txt and creates the validation.txt. I then tried with batches ranging from 200 to 600 clips, but to no avail. 2. I thought that maybe the validation dataset was not representative enough of the training dataset. Indeed the character I am trying to clone has very different voice tonalities. So I’ve went ahead and classified all my clips by what I called « voice color », medium, high and deep. I’ve put them all in the train dataset, and designed another script to make sure that it would still cull 20% of the train dataset for the validation dataset, AND that each voice color (medium, high and deep) would be equally represented in the validation. Again, immediate over fitting. 3. So then I tried reducing the dataset to only specific voice colors : train only 200 clips of deep, or 200 of medium and 200 of high. Same result. In the end, all the models that I have trained perform not too bad, honestly, but they still are overfitted and I would very much like to understand why, and most of all, why it overfits immediately. 1. Is it a problem of size of the dataset? Maybe I should have 2000 clips rather than 600? I can’t invent new voice lines (I can use Eleven Labs to some extent but the generated voice, although very good, remains slightly different and I’m afraid to pollute my dataset), but maybe I can re-slice all 600 clips about 3 or 4 times? 2. On the contrary should I train the model on very small datasets? 3. Are there maybe parameters I should look closer? I admit I haven’t touched the learning rate for example Thank you for any advice ☺️ and for this amazing tool

Great news!! I've managed to get a very good validation curve over almost 200 epochs, only by changing the learning rate from 1e-05 to 1e-06. The training loss mel is still around 2.0, but I think the results are quite good.

I'll keep investigating in this direction and share my findings here.

Great news!! I've managed to get a very good validation curve over almost 200 epochs, only by changing the learning rate from 1e-05 to 1e-06. The training loss mel is still around 2.0, but I think the results are quite good. I'll keep investigating in this direction and share my findings here.

@DoctorPopi
How does it sound though? I am not sure how VALL-E sounds - I have stuck with using Tortoise TTS engine. My datasets are about 30 to 40.

It seems that for as well as Tortoise matches the pace, accent, and general speed of talking... it's not so good at matching pitch. That is frustrating. It gets about 80 percent there. I feel there is intentionally a flaw or error or offset induced as far as the pitch. This keeps it from getting 90 to 95 percent there.

@DoctorPopi How does it sound though? I am not sure how VALL-E sounds - I have stuck with using Tortoise TTS engine. My datasets are about 30 to 40. It seems that for as well as Tortoise matches the pace, accent, and general speed of talking... it's not so good at matching pitch. That is frustrating. It gets about 80 percent there. I feel there is intentionally a flaw or error or offset induced as far as the pitch. This keeps it from getting 90 to 95 percent there.

Hey! Well it does sound pretty good, even though of course it’s a bit random, but I’d say it outputs good results 40% of the time. What is a bit frustrating though is that the results are NOT good when generating voice lines with the data that was used to train the model, which is pretty odd. I have to use another batch in order to get good results, and a very specific one. Clearly it’s not perfect though, you have to generate a lot to have a good result. I’d advise playing around with both Mrq and Fast tortoise for the generation, too, because you can have pretty different results it seems, with each one sometimes better than the other.

On Mrq, the settings that seem to work best for me are :

  • no voice fixing
  • iterations 200
  • conditioning free activated
  • temperature 1
  • repetition penalty 8
  • length penalty 6
  • diffusion samplers P

Currently my training batch has about 400 clips, with about 120 for validation. I’m currently running another test with another running rate, I’m at 0.000005 and the validation and training curves look pretty good so far (15 epochs)

A few questions to try and understand your problem with my little experience until now:

  • are there background sounds or noises?
  • how does your validation batch look ?
  • how many epochs do you run?
  • have you checked the whisper transcription of the json, as well as the train.txt?
Hey! Well it does sound pretty good, even though of course it’s a bit random, but I’d say it outputs good results 40% of the time. What is a bit frustrating though is that the results are NOT good when generating voice lines with the data that was used to train the model, which is pretty odd. I have to use another batch in order to get good results, and a very specific one. Clearly it’s not perfect though, you have to generate a lot to have a good result. I’d advise playing around with both Mrq and Fast tortoise for the generation, too, because you can have pretty different results it seems, with each one sometimes better than the other. On Mrq, the settings that seem to work best for me are : - no voice fixing - iterations 200 - conditioning free activated - temperature 1 - repetition penalty 8 - length penalty 6 - diffusion samplers P Currently my training batch has about 400 clips, with about 120 for validation. I’m currently running another test with another running rate, I’m at 0.000005 and the validation and training curves look pretty good so far (15 epochs) A few questions to try and understand your problem with my little experience until now: - are there background sounds or noises? - how does your validation batch look ? - how many epochs do you run? - have you checked the whisper transcription of the json, as well as the train.txt?

@DoctorPopi
Re: are there background sounds or noises?
Nope. I have found that running any samples through a musical vocal remover actually has helped clarify my audio samples. I already troubleshot this issue as one time I had 40 samples and realized one had bad background noise. Fine-tuned training it was unintelligible, practically.

Re: how does your validation batch look ?
Don't use one. Should I? I can pull samples from different sources. Does this help things with pitch correction?

Re: how many epochs do you run?
200 to 250

have you checked the whisper transcription of the json, as well as the train.txt?
Yes. I am not sure that train.txt actually is anything more than an intermediate file for the .json.

Like I said originally in the comment -- I feel that this this program gets to about 80% accuracy which is awesome. I just feel that for the voice cloning, the pitch/sound should be the easiest thing to match. "frustrating" is probably the wrong word...

@DoctorPopi Re: are there background sounds or noises? Nope. I have found that running any samples through a musical vocal remover actually has helped clarify my audio samples. I already troubleshot this issue as one time I had 40 samples and realized one had bad background noise. Fine-tuned training it was unintelligible, practically. Re: how does your validation batch look ? Don't use one. Should I? I can pull samples from different sources. Does this help things with pitch correction? Re: how many epochs do you run? 200 to 250 have you checked the whisper transcription of the json, as well as the train.txt? Yes. I am not sure that train.txt actually is anything more than an intermediate file for the .json. Like I said originally in the comment -- I feel that this this program gets to about 80% accuracy which is awesome. I just feel that for the voice cloning, the pitch/sound should be the easiest thing to match. "frustrating" is probably the wrong word...

Hey!

Mmmmh indeed, that's very weird that you have a different pitch but otherwise good results. From my experience, it happened that I had to use another batch of the same voice (but not trained upon) to generate the results, else it would indeed be a different pitch. Maybe something to dig there?

About the validation dataset, it wouldn't hurt to try, at least you could see whether your model is behaving, overfitting or something like that? Funny enough, though, I had overall better results with overfitted models than not overfitted lol...Still can't explain it.

Hey! Mmmmh indeed, that's very weird that you have a different pitch but otherwise good results. From my experience, it happened that I had to use another batch of the same voice (but not trained upon) to generate the results, else it would indeed be a different pitch. Maybe something to dig there? About the validation dataset, it wouldn't hurt to try, at least you could see whether your model is behaving, overfitting or something like that? Funny enough, though, I had overall better results with overfitted models than not overfitted lol...Still can't explain it.

Hello, I'm following up on this issue, because I'm still struggling with overfitting problems. Only, I've realized, they are not only occurring with large datasets, but also with smaller ones (200 clips compared to 1500, though I'm wondering what can be considered a small or large dataset ?)

At first, I thought that the problem could be solved by passing the learning rate from 1e-05 to 1e-06. At this point, the curves (validation and training), looked better, in the sense that the validation curve was going down and eventually stagnated, not increasing anymore. However, the training curve also tends to get stuck at around 1.5, and the results are not particularly better than the results I have with clearly overfitting models (results that are surprisingly good for overfitted models...)

As I explained above, I also designed a script that automatically cooks a validation batch with 20% of the training set (if people are interested in this, I'll glady share the script). But that doesn't help. I also tried playing around with the learning rate schedules : Multistep, Cos Annealing... There's no way to get a decent validation curve. It actually happened only once, but I have been completely incapable of reproducing it, even after recovering the seed. It was on a batch of about 1000 clips, validation set 20% of the training data.

Currently, I'm thinking about cross-validation in order to test whether the data validation is the problem or not, but it's quite puzzling that with every dataset, in LR 1e-05, no matter the size of the dataset, it overfits almost immediately. Maybe 1000, or even 2000 clips is not near enough yet? Maybe I should try and reach 5000 or 10 000? I have indeed read that one of the main reasons for overfitting was because there was not enough data.

If anyone cares to share their experience or thoughts on the subject, I'd be grateful!

Hello, I'm following up on this issue, because I'm still struggling with overfitting problems. Only, I've realized, they are not only occurring with large datasets, but also with smaller ones (200 clips compared to 1500, though I'm wondering what can be considered a small or large dataset ?) At first, I thought that the problem could be solved by passing the learning rate from 1e-05 to 1e-06. At this point, the curves (validation and training), looked better, in the sense that the validation curve was going down and eventually stagnated, not increasing anymore. However, the training curve also tends to get stuck at around 1.5, and the results are not particularly better than the results I have with clearly overfitting models (results that are surprisingly good for overfitted models...) As I explained above, I also designed a script that automatically cooks a validation batch with 20% of the training set (if people are interested in this, I'll glady share the script). But that doesn't help. I also tried playing around with the learning rate schedules : Multistep, Cos Annealing... There's no way to get a decent validation curve. It actually happened only once, but I have been completely incapable of reproducing it, even after recovering the seed. It was on a batch of about 1000 clips, validation set 20% of the training data. Currently, I'm thinking about cross-validation in order to test whether the data validation is the problem or not, but it's quite puzzling that with every dataset, in LR 1e-05, no matter the size of the dataset, it overfits almost immediately. Maybe 1000, or even 2000 clips is not near enough yet? Maybe I should try and reach 5000 or 10 000? I have indeed read that one of the main reasons for overfitting was because there was not enough data. If anyone cares to share their experience or thoughts on the subject, I'd be grateful!
Sign in to join this conversation.
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#248
No description provided.