Are YouTube rips entirely unusable for finetuning? #359

Open
opened 2023-08-30 13:22:35 +00:00 by loathsomedungeater · 3 comments

Hello
I want to clone a voice from a Youtube commentary video which has no background music, however I ran the audio in uvr to extract the vocals anyway. First tried training using 30 minutes of one clip, which turned out crap, then 50 minutes of another clip, which turned out crapper. It's transcribed perfectly as far as I've seen ( though each line starts with a space token, not sure if that's an issue). I cut it into several clips by hand which then segmented into 3-10 second long clips on preparing the dataset. Listening to a few of the short clips, it seems they are cut mostly cleanly (some cut off mid-word though) and match with the transcript.

After 2300 steps of training it's worse than using the base AR model with provided 20 seconds of audio which sounds alright but not ideal. Even at <0.1 loss the finetune still sounds like he's speaking out of a monkey ass and ending each line with throes of pain. Observe: https://vocaroo.com/1j86OcocBiFY

And random automatically segmented sample: https://vocaroo.com/10IKJnyVFhGP

Is the dataset too big? Am I supposed to segment the clips in another way?

Hello I want to clone a voice from a Youtube commentary video which has no background music, however I ran the audio in uvr to extract the vocals anyway. First tried training using 30 minutes of one clip, which turned out crap, then 50 minutes of another clip, which turned out crapper. It's transcribed perfectly as far as I've seen ( though each line starts with a space token, not sure if that's an issue). I cut it into several clips by hand which then segmented into 3-10 second long clips on preparing the dataset. Listening to a few of the short clips, it seems they are cut mostly cleanly (some cut off mid-word though) and match with the transcript. After 2300 steps of training it's worse than using the base AR model with provided 20 seconds of audio which sounds alright but not ideal. Even at <0.1 loss the finetune still sounds like he's speaking out of a monkey ass and ending each line with throes of pain. Observe: https://vocaroo.com/1j86OcocBiFY And random automatically segmented sample: https://vocaroo.com/10IKJnyVFhGP Is the dataset too big? Am I supposed to segment the clips in another way?

I use youtube all the time to train models. IMO if there is no background music don't bother with UVR.

I use youtube all the time to train models. IMO if there is no background music don't bother with UVR.
Owner

My gut says:

  • the finetune is being trained too fast, as your initial LR is too high / your LR is not decaying fast enough.
  • the finetune is also not being trained long enough. 2300 steps / ~13 epochs is nowhere near enough.

I was going to mention that your latents might be funny, but these days it's not an issue as the latents are automatically loaded and tied to the hash of the model you're using, so you'll have to try very hard for that to be a problem.

My gut says: * the finetune is being trained too fast, as your initial LR is too high / your LR is not decaying fast enough. * the finetune is also not being trained long enough. 2300 steps / ~13 epochs is nowhere near enough. I was going to mention that your latents might be funny, but these days it's not an issue as the latents are automatically loaded and tied to the hash of the model you're using, so you'll have to try very hard for that to be a problem.

My gut says:

  • the finetune is being trained too fast, as your initial LR is too high / your LR is not decaying fast enough.
  • the finetune is also not being trained long enough. 2300 steps / ~13 epochs is nowhere near enough.

I was going to mention that your latents might be funny, but these days it's not an issue as the latents are automatically loaded and tied to the hash of the model you're using, so you'll have to try very hard for that to be a problem.

I trained for 50 epochs, just couldn't load the entire graph. It's like it gets worse the more it trains but I'll try training for longer too or turn down the LR as you said. But with default LR/scheduler loss drops fast in the first 10 epochs then looks like it isn't ever decreasing below 1.

> My gut says: > * the finetune is being trained too fast, as your initial LR is too high / your LR is not decaying fast enough. > * the finetune is also not being trained long enough. 2300 steps / ~13 epochs is nowhere near enough. > > I was going to mention that your latents might be funny, but these days it's not an issue as the latents are automatically loaded and tied to the hash of the model you're using, so you'll have to try very hard for that to be a problem. I trained for 50 epochs, just couldn't load the entire graph. It's like it gets worse the more it trains but I'll try training for longer too or turn down the LR as you said. But with default LR/scheduler loss drops fast in the first 10 epochs then looks like it isn't ever decreasing below 1.
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#359
No description provided.