! RETRAIN YOUR MODELS ! #103

Closed
opened 2023-03-09 20:27:07 +00:00 by mrq · 13 comments
Owner

It seems I've made a grave mistake with not looking at the other DLAS repo, as it contained a small little tweak that helps finetunes that end up sounding like total trash.

It's a big enough improvement with implementing this that I must bring attention to it somehow, although I don't got much of a good way to go about it.

If you've also been affected with models sound like garbage (I'm not sure if there's a criteria on what voices would cause it, as it seemed more likely to happen if it was a non-male voice), please, please, please, retrain your finetunes after updating.

If you already finetuned with that repo, you're golden, and don't need to retrain.

I would suggest for smaller datasets (sub 100):

  • 100 epochs
  • LR 0.0001
  • MultiStepLR
  • schedule: [9, 18, 25, 33, 50, 59]

to quickly train something to decent output.

It seems I've made a grave mistake with not looking at [the other DLAS repo](https://github.com/152334H/DL-Art-School), as it contained a small little tweak that helps finetunes that end up sounding like total trash. It's a big enough improvement with implementing [this](https://github.com/152334H/DL-Art-School/commit/ae80992817059acf6eef38a680efa5124cee570b) that I must bring attention to it somehow, although I don't got much of a good way to go about it. If you've also been affected with models sound like garbage (I'm not sure if there's a criteria on what voices would cause it, as it seemed more likely to happen if it was a non-male voice), please, please, *please*, retrain your finetunes after updating. If you already finetuned with that repo, you're golden, and don't need to retrain. I would suggest for smaller datasets (sub 100): * 100 epochs * LR 0.0001 * MultiStepLR * schedule: `[9, 18, 25, 33, 50, 59]` to quickly train something to decent output.
mrq added the
news
label 2023-03-09 20:27:07 +00:00

I am getting some decent result on LR : 0.00009 /100-200 epochs on smaller dataset of say 10 to 30 mins. All other settings default apart from validting the training settings

Zapp Brannigan : https://vocaroo.com/1lT5i70dMj33 15 min dataset rougly

I am getting some decent result on LR : 0.00009 /100-200 epochs on smaller dataset of say 10 to 30 mins. All other settings default apart from validting the training settings Zapp Brannigan : https://vocaroo.com/1lT5i70dMj33 15 min dataset rougly

Unless I'm mistaken you did not implement the same way, you set -sub where he set sub

return text_logits[:, :sub]
vs
return text_logits[:, :-sub]

Is this intended ?

I still don't really understand that tortoise_compat setting either

Unless I'm mistaken you did not implement the same way, you set -sub where he set sub `return text_logits[:, :sub]` vs `return text_logits[:, :-sub]` Is this intended ? I still don't really understand that tortoise_compat setting either
Author
Owner

If I did botch that then I'm going to scream, but that's what consistently staying up until 2AM gets. I'll check when I get a moment.

Anyways, the compat fix simply just makes the unified_voice2 model more inline with its implementation in tortoise-tts, as its like, 80% similar in code.
I just lazily applied his fix to it last night rather than derive it myself.

If I did botch that then I'm going to scream, but that's what consistently staying up until 2AM gets. I'll check when I get a moment. Anyways, the compat fix simply just makes the unified_voice2 model more inline with its implementation in tortoise-tts, as its like, 80% similar in code. I just lazily applied his fix to it last night rather than derive it myself.
Author
Owner

So I did. I suppose I'll have to retrain what I've been training today, since I imagine that's a pretty big problem.

So I did. I *suppose* I'll have to retrain what I've been training today, since I imagine that's a pretty big problem.

idk what i did to do this tbh... but i trained a model, i switched to the model in settings, and then in generating tab i selected the voice of the 5min audio i had. i then clicked recompute voice latents and then selected standard preset and hit generate... and now its Generating autoregressive samples. and its being super slow than the usual soo idk if i did something wrong with re computing voice latents or something idk...

update: it generated no voice.. nothing... after all the waiting time. not sure what went wrong

update on the update: it now generates voice, not very close to my voice though but 50 steps was closer than 100 steps hmmm. Its still slow at generating though idkk why

idk what i did to do this tbh... but i trained a model, i switched to the model in settings, and then in generating tab i selected the voice of the 5min audio i had. i then clicked recompute voice latents and then selected standard preset and hit generate... and now its Generating autoregressive samples. and its being super slow than the usual soo idk if i did something wrong with re computing voice latents or something idk... update: it generated no voice.. nothing... after all the waiting time. not sure what went wrong update on the update: it now generates voice, not very close to my voice though but 50 steps was closer than 100 steps hmmm. Its still slow at generating though idkk why
Author
Owner

and its being super slow than the usual soo idk if i did something wrong with re computing voice latents or something idk...

I reverted my change to the routine that deduces sample batch sizes for generation (it seems it's haunted where it breaks if it ever gets touched), so you should be fine now to update.

A remedy to that is to manually set your sample batch size (which I heavily encourage to do, as the default tiers are very conservative).

> and its being super slow than the usual soo idk if i did something wrong with re computing voice latents or something idk... I reverted my change to the routine that deduces sample batch sizes for generation (it seems it's haunted where it breaks if it ever gets touched), so you should be fine now to update. A remedy to that is to manually set your sample batch size (which I heavily encourage to do, as the default tiers are very conservative).

We could use a discussion tab for this git. A central place to compare notes and whatnot.

I noticed slowness, too, so I upped batch size, and that helped. I got lots of vram.

Are we sure that large datasets are the way to go for training? To capture a character in the Kohya SD script I use 16 images, that's it--and it works well. Very large datasets can actually cause trouble, not to mention slowing your training.

I get good resemblance out of a minute and a half of speech, which is what? 15 chunks?

We could use a discussion tab for this git. A central place to compare notes and whatnot. I noticed slowness, too, so I upped batch size, and that helped. I got lots of vram. Are we sure that large datasets are the way to go for training? To capture a character in the Kohya SD script I use 16 images, that's it--and it works well. Very large datasets can actually cause trouble, not to mention slowing your training. I get good resemblance out of a minute and a half of speech, which is what? 15 chunks?
Author
Owner

We could use a discussion tab for this git.

Gitea doesn't have any feature like that. Although that's on me for using it over a gitlab instance, but oh well.

Are we sure that large datasets are the way to go for training?

I've had great luck with training against small datasets, sub-200 and sub-100 even. I've jsut been having issues with a large dataset since multi-GPU training is very particular when it comes to large datasets, so I've just been having my Japanese dataset trained on a paperspace A4000 and it's been training fine, but I haven't got a chance to test it.

> We could use a discussion tab for this git. Gitea doesn't have any feature like that. Although that's on me for using it over a gitlab instance, but oh well. > Are we sure that large datasets are the way to go for training? I've had great luck with training against small datasets, sub-200 and sub-100 even. I've jsut been having issues with a large dataset since multi-GPU training is very particular when it comes to large datasets, so I've just been having my Japanese dataset trained on a paperspace A4000 and it's been training fine, but I haven't got a chance to test it.

just been having my Japanese dataset trained on a paperspace A4000 and it's been training fine, but I haven't got a chance to test it.

What do you think is the best way to trasncribe japanese speech? Is there a japanese whisper model? Do you have to transcribe to katakana?

> just been having my Japanese dataset trained on a paperspace A4000 and it's been training fine, but I haven't got a chance to test it. What do you think is the best way to trasncribe japanese speech? Is there a japanese whisper model? Do you have to transcribe to katakana?
Author
Owner

What do you think is the best way to trasncribe japanese speech? Is there a japanese whisper model? Do you have to transcribe to katakana?

All three whisper implementations can transcribe to Japanese, just set the Language field to ja (or leave it blank to auto-deduce). I wouldn't use the default openai/whisper implementation for accuracy reasons, it'll trim the clips too liberally; WhisperX or WhisperCPP both work better than base Whisper.

I didn't do any editing desu, since it would be a pain to curate 15k lines for what would amount to maybe replacing a wrong kanji that sounds the same anyways. I wouldn't bother coercing them into bare kana, since the kanji should help train the text side of the AR model.

I'm waiting for a few more epochs of baking my Japanese finetune before testing it, although it looks pretty ready anyhow, as my reported loss is nearing the defacto loss.

image

> What do you think is the best way to trasncribe japanese speech? Is there a japanese whisper model? Do you have to transcribe to katakana? All three whisper implementations can transcribe to Japanese, just set the Language field to `ja` (or leave it blank to auto-deduce). I wouldn't use the default openai/whisper implementation for accuracy reasons, it'll trim the clips too liberally; WhisperX or WhisperCPP both work better than base Whisper. I didn't do any editing desu, since it would be a pain to curate 15k lines for what would amount to *maybe* replacing a wrong kanji that sounds the same anyways. I wouldn't bother coercing them into bare kana, since the kanji should help train the text side of the AR model. I'm waiting for a few more epochs of baking my Japanese finetune before testing it, although it looks pretty ready anyhow, as my reported loss is nearing the defacto loss. ![image](/attachments/dc47abba-bde8-4fe9-88d3-90e0cb2bd846)

not gonna lie, i dont really understand the graphs... what is good and what is bad lol?

not gonna lie, i dont really understand the graphs... what is good and what is bad lol?
Author
Owner

not gonna lie, i dont really understand the graphs... what is good and what is bad lol?

#82 (comment)

> not gonna lie, i dont really understand the graphs... what is good and what is bad lol? https://git.ecker.tech/mrq/ai-voice-cloning/issues/82#issuecomment-772
Contributor

We could use a discussion tab for this git.

Gitea doesn't have any feature like that. Although that's on me for using it over a gitlab instance, but oh well.

If this project is gonna take off, and there are better features elsewhere, now is the perfect time to move. A discussion place would be nice rather than doing so in issues.

> > We could use a discussion tab for this git. > > Gitea doesn't have any feature like that. Although that's on me for using it over a gitlab instance, but oh well. If this project is gonna take off, and there are better features elsewhere, now is the perfect time to move. A discussion place would be nice rather than doing so in issues.
mrq closed this issue 2023-03-13 17:39:13 +00:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
7 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#103
No description provided.