Are YouTube rips entirely unusable for finetuning? #359
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#359
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hello
I want to clone a voice from a Youtube commentary video which has no background music, however I ran the audio in uvr to extract the vocals anyway. First tried training using 30 minutes of one clip, which turned out crap, then 50 minutes of another clip, which turned out crapper. It's transcribed perfectly as far as I've seen ( though each line starts with a space token, not sure if that's an issue). I cut it into several clips by hand which then segmented into 3-10 second long clips on preparing the dataset. Listening to a few of the short clips, it seems they are cut mostly cleanly (some cut off mid-word though) and match with the transcript.
After 2300 steps of training it's worse than using the base AR model with provided 20 seconds of audio which sounds alright but not ideal. Even at <0.1 loss the finetune still sounds like he's speaking out of a monkey ass and ending each line with throes of pain. Observe: https://vocaroo.com/1j86OcocBiFY
And random automatically segmented sample: https://vocaroo.com/10IKJnyVFhGP
Is the dataset too big? Am I supposed to segment the clips in another way?
I use youtube all the time to train models. IMO if there is no background music don't bother with UVR.
My gut says:
I was going to mention that your latents might be funny, but these days it's not an issue as the latents are automatically loaded and tied to the hash of the model you're using, so you'll have to try very hard for that to be a problem.
I trained for 50 epochs, just couldn't load the entire graph. It's like it gets worse the more it trains but I'll try training for longer too or turn down the LR as you said. But with default LR/scheduler loss drops fast in the first 10 epochs then looks like it isn't ever decreasing below 1.