Training never produces usable voices #307

Open
opened 2023-07-11 20:26:35 +00:00 by ryokoseigo · 1 comment

I feel like I'm doing something wrong, but every tutorial seems to gloss over major aspects of how to do the training in its entirety.
I downloaded someone elses voices off HF and they work, so I am clearly not doing something wrong on that end, like using bad settings. I have tried pretty much all combos of settings. But my results are highly metallic, and often you cant even tell what some of the words are.

I purposefully grabbed a copy of audio from a https://www.sounds-resource.com and it is clearly high quality. The split sound clips also all appear good. In this isntance, I had over 200 clips and used 200 epochs.

Now the part that's confusing me. Everyone seems to be using crazy high batch size, but anything above about 6 runs like garbage, and isnt worth doing. I'm unsure how people are doing 64/128 batch sizes. I am using a 3080ti, so 12gb vram, but with 4-6 being my max, I don't get how anyone is doing that high regaardless. Is there something wrong on my end?

My only guess atm, is I'm horribly undertraining because of this discrepancy, but I dunno, the loss dropped to literally nothing, so you'd think it got it.

So the part that confuses me is, what do I actually do when I'm done? Do I change change "Autoregressive Model" to my new model, and then voice to my new voice? Do I do only one of these? Should I have audio in the /marie folder? Should it be all the audio, some of it?

I feel like I'm doing something wrong, but every tutorial seems to gloss over major aspects of how to do the training in its entirety. I downloaded someone elses voices off HF and they work, so I am clearly not doing something wrong on that end, like using bad settings. I have tried pretty much all combos of settings. But my results are highly metallic, and often you cant even tell what some of the words are. I purposefully grabbed a copy of audio from a https://www.sounds-resource.com and it is clearly high quality. The split sound clips also all appear good. In this isntance, I had over 200 clips and used 200 epochs. Now the part that's confusing me. Everyone seems to be using crazy high batch size, but anything above about 6 runs like garbage, and isnt worth doing. I'm unsure how people are doing 64/128 batch sizes. I am using a 3080ti, so 12gb vram, but with 4-6 being my max, I don't get how anyone is doing that high regaardless. Is there something wrong on my end? My only guess atm, is I'm horribly undertraining because of this discrepancy, but I dunno, the loss dropped to literally nothing, so you'd think it got it. So the part that confuses me is, what do I actually do when I'm done? Do I change change "Autoregressive Model" to my new model, and then voice to my new voice? Do I do only one of these? Should I have audio in the /marie folder? Should it be all the audio, some of it?
Owner

Everyone seems to be using crazy high batch size, but anything above about 6 runs like garbage, and isnt worth doing. I'm unsure how people are doing 64/128 batch sizes

You can increase your batch size if you also increase the gradient accumulation amount too (up to half your batch size, because of how the training script was made). That's probably what everyone else is doing. So a batch size of 128 with a gradient accumulation factor of 64 will have an effective batch size of 128, but will do it in batches of 2.

There's some conflicting wisdom I've read, about what's best:

  • have a high raw batch size for the most truest, accurate training
  • supplement it with gradient accumulation, which gets it close to the above (and also less gradient updates makes it "faster" to train over time)
  • don't even bother with gradient accumulation, the gradient norm being noisy helps "jostle" any weights that might be stuck in local minima out

But it's effectively minmaxing something that, for the purpose of finetuning, isn't that substantial. So it's not necessary.

My only guess atm, is I'm horribly undertraining because of this discrepancy, but I dunno, the loss dropped to literally nothing, so you'd think it got it.

Mmm. It's been a long while since I looked at finetuning graphs, but it looks fine. The only thing I can think of is if it actually is overtrained, which you can check by using an older checkpoint instead.

But my results are highly metallic, and often you cant even tell what some of the words are.

I vaguely remember an issue of the past mentioning it being a problem. I think it first was a model trained on bad slices, and then it was bad conditioning latents, and then it again, but you mentioned the slices being fine, and the conditioning latents problem has been long solved.

Do I change change "Autoregressive Model" to my new model, and then voice to my new voice? Do I do only one of these?

Either set the Autoregressive Model to auto, or to err on the cautious side, just set it to the finetuned model, and then select Marie as the voice. If you use a voice a finetune was not trained on, it'll sound terrible at worst and maybe actually transfer a lot of a voice's traits to the voice you're using, at best.

Should I have audio in the /marie folder? Should it be all the audio, some of it?

It Depends™. I don't remember having to do anything specific for the finetunes I've done after finetuning them.

If I remember right, if you still have your voice under ./training/, then the scripts should pull from it instead and do some magic trying to calculate the best conditional latents (if you set voice latents chunk size to 0, which I think it'll default when everything is right to do so).

In theory, it should be "better" to only have a reference clip that's as close to what you want the target to be, but it shouldn't be necessary to.


That should be all I can think of at the moment.

> Everyone seems to be using crazy high batch size, but anything above about 6 runs like garbage, and isnt worth doing. I'm unsure how people are doing 64/128 batch sizes You can increase your batch size if you also increase the gradient accumulation amount too (up to half your batch size, because of how the training script was made). That's probably what everyone else is doing. So a batch size of 128 with a gradient accumulation factor of 64 will have an effective batch size of 128, but will do it in batches of 2. There's some conflicting wisdom I've read, about what's best: * have a high raw batch size for the most truest, accurate training * supplement it with gradient accumulation, which gets it close to the above (and also less gradient updates makes it "faster" to train over time) * don't even bother with gradient accumulation, the gradient norm being noisy helps "jostle" any weights that might be stuck in local minima out But it's effectively minmaxing something that, for the purpose of finetuning, isn't that substantial. So it's not necessary. > My only guess atm, is I'm horribly undertraining because of this discrepancy, but I dunno, the loss dropped to literally nothing, so you'd think it got it. Mmm. It's been a long while since I looked at finetuning graphs, but it looks fine. The only thing I can think of is if it actually is overtrained, which you can check by using an older checkpoint instead. > But my results are highly metallic, and often you cant even tell what some of the words are. I vaguely remember an issue of the past mentioning it being a problem. I think it first was a model trained on bad slices, and then it was bad conditioning latents, and then it again, but you mentioned the slices being fine, and the conditioning latents problem has been long solved. > Do I change change "Autoregressive Model" to my new model, and then voice to my new voice? Do I do only one of these? Either set the `Autoregressive Model` to `auto`, or to err on the cautious side, just set it to the finetuned model, and then select Marie as the voice. If you use a voice a finetune was not trained on, it'll sound terrible at worst and *maybe* actually transfer a lot of a voice's traits to the voice you're using, at best. > Should I have audio in the /marie folder? Should it be all the audio, some of it? It Depends™. I don't remember having to do anything specific for the finetunes I've done after finetuning them. If I remember right, if you still have your voice under `./training/`, then the scripts should pull from it instead and do some magic trying to calculate the best conditional latents (if you set voice latents chunk size to 0, which I think it'll default when everything is right to do so). In theory, it should be "better" to only have a reference clip that's as close to what you want the target to be, but it shouldn't be necessary to. --- That should be all I can think of at the moment.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#307
No description provided.