! RETRAIN YOUR MODELS ! #103
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
7 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#103
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
It seems I've made a grave mistake with not looking at the other DLAS repo, as it contained a small little tweak that helps finetunes that end up sounding like total trash.
It's a big enough improvement with implementing this that I must bring attention to it somehow, although I don't got much of a good way to go about it.
If you've also been affected with models sound like garbage (I'm not sure if there's a criteria on what voices would cause it, as it seemed more likely to happen if it was a non-male voice), please, please, please, retrain your finetunes after updating.
If you already finetuned with that repo, you're golden, and don't need to retrain.
I would suggest for smaller datasets (sub 100):
[9, 18, 25, 33, 50, 59]
to quickly train something to decent output.
I am getting some decent result on LR : 0.00009 /100-200 epochs on smaller dataset of say 10 to 30 mins. All other settings default apart from validting the training settings
Zapp Brannigan : https://vocaroo.com/1lT5i70dMj33 15 min dataset rougly
Unless I'm mistaken you did not implement the same way, you set -sub where he set sub
return text_logits[:, :sub]
vs
return text_logits[:, :-sub]
Is this intended ?
I still don't really understand that tortoise_compat setting either
If I did botch that then I'm going to scream, but that's what consistently staying up until 2AM gets. I'll check when I get a moment.
Anyways, the compat fix simply just makes the unified_voice2 model more inline with its implementation in tortoise-tts, as its like, 80% similar in code.
I just lazily applied his fix to it last night rather than derive it myself.
So I did. I suppose I'll have to retrain what I've been training today, since I imagine that's a pretty big problem.
idk what i did to do this tbh... but i trained a model, i switched to the model in settings, and then in generating tab i selected the voice of the 5min audio i had. i then clicked recompute voice latents and then selected standard preset and hit generate... and now its Generating autoregressive samples. and its being super slow than the usual soo idk if i did something wrong with re computing voice latents or something idk...
update: it generated no voice.. nothing... after all the waiting time. not sure what went wrong
update on the update: it now generates voice, not very close to my voice though but 50 steps was closer than 100 steps hmmm. Its still slow at generating though idkk why
I reverted my change to the routine that deduces sample batch sizes for generation (it seems it's haunted where it breaks if it ever gets touched), so you should be fine now to update.
A remedy to that is to manually set your sample batch size (which I heavily encourage to do, as the default tiers are very conservative).
We could use a discussion tab for this git. A central place to compare notes and whatnot.
I noticed slowness, too, so I upped batch size, and that helped. I got lots of vram.
Are we sure that large datasets are the way to go for training? To capture a character in the Kohya SD script I use 16 images, that's it--and it works well. Very large datasets can actually cause trouble, not to mention slowing your training.
I get good resemblance out of a minute and a half of speech, which is what? 15 chunks?
Gitea doesn't have any feature like that. Although that's on me for using it over a gitlab instance, but oh well.
I've had great luck with training against small datasets, sub-200 and sub-100 even. I've jsut been having issues with a large dataset since multi-GPU training is very particular when it comes to large datasets, so I've just been having my Japanese dataset trained on a paperspace A4000 and it's been training fine, but I haven't got a chance to test it.
What do you think is the best way to trasncribe japanese speech? Is there a japanese whisper model? Do you have to transcribe to katakana?
All three whisper implementations can transcribe to Japanese, just set the Language field to
ja
(or leave it blank to auto-deduce). I wouldn't use the default openai/whisper implementation for accuracy reasons, it'll trim the clips too liberally; WhisperX or WhisperCPP both work better than base Whisper.I didn't do any editing desu, since it would be a pain to curate 15k lines for what would amount to maybe replacing a wrong kanji that sounds the same anyways. I wouldn't bother coercing them into bare kana, since the kanji should help train the text side of the AR model.
I'm waiting for a few more epochs of baking my Japanese finetune before testing it, although it looks pretty ready anyhow, as my reported loss is nearing the defacto loss.
not gonna lie, i dont really understand the graphs... what is good and what is bad lol?
#82 (comment)
If this project is gonna take off, and there are better features elsewhere, now is the perfect time to move. A discussion place would be nice rather than doing so in issues.