Question: What is the meaning of the blue and orange lines in training? #82

Closed
opened 2023-03-07 13:25:18 +00:00 by st33lmouse · 3 comments

What are we looking at in the blue and orange lines in the training graph? What does a good graph look like, and what's a bad one look like?

What are we looking at in the blue and orange lines in the training graph? What does a good graph look like, and what's a bad one look like?
Owner

Per the wiki:

This will update the graph below with the current loss rate. This is useful to see how "ready" your model/finetune is. However, there doesn't seem to be a "one-size-fits-all" value for what loss rate you should aim at. I've had some finetunes benefit a ton more from sub 0.01 loss rates, while others absolutely fried after 0.5 (although, it entirely depends on how low of a learning rate you have, rather than haphazardly quick-training it).

I don't have a better answer than that to give you at the current moment in my fleeting free time.

Per the [wiki](https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training#run-training): > This will update the graph below with the current loss rate. This is useful to see how "ready" your model/finetune is. However, there doesn't seem to be a "one-size-fits-all" value for what loss rate you should aim at. I've had some finetunes benefit a ton more from sub 0.01 loss rates, while others absolutely fried after 0.5 (although, it entirely depends on how low of a learning rate you have, rather than haphazardly quick-training it). I don't have a better answer than that to give you at the current moment in my fleeting free time.
mrq closed this issue 2023-03-07 19:46:22 +00:00
Owner

Alight, now that I'm in a slightly better headspace now, I can try to explain what the loss curves mean, but with a brief crash course on what the model does (to my understanding):

The autoregressive model predicts tokens in as <speech conditioning>:<text tokens>:<MEL tokens> string, where:

  • speech conditioning is a vector representing a voice's latents
  • text tokens (I believe) represents phonemes, which can be compared against the CLVP for "most likely candidates"
  • MEL tokens represent the actual speech, which gets later converted to a waveform

Now back to the scope of answering your question. Each curve is responsible for quantifying how accurate the model is.

  • the text loss quantifies how well the predicted text tokens match the source text. This doesn't necessarily need to have too low of a loss. In fact, trainings that have it lower than the mel loss turns out unusuable.
  • the mel loss quantifies how well the predicted speech tokens match the source audio. This definitely seems to benefit from low loss rates.
  • the total loss is a bit irrelevant, and I should probably hide it since it almost always follows the mel loss, due to how the text loss gets weighed.

There's also the validation versions of the text and mel losses, which quantifies the defacto similarity from the generated output to the source output, as the validation dataset serves as outside data (as if you're normally generating something). If there's a large deviation betweent he reported losses and the validation losses, then your model probably has started to overfit for the source material.

Below all of that is the learning rate graph, which helps to show what the current learning rate is at. It's not a huge indicator of how training is, as the learning rate curve is determinative.

image

Here's what a decent graph looks like for a small dataset. Here, you can see that it's probably the "best" at epoch 20 (epoch, as my batch size = dataset size here), as the defacto loss goes higher than the reported loss.

Alight, now that I'm in a slightly better headspace now, I can try to explain what the loss curves mean, but with a brief crash course on what the model does (to my understanding): The autoregressive model predicts tokens in as `<speech conditioning>:<text tokens>:<MEL tokens>` string, where: * speech conditioning is a vector representing a voice's latents * text tokens (I believe) represents phonemes, which can be compared against the CLVP for "most likely candidates" * MEL tokens represent the actual speech, which gets later converted to a waveform Now back to the scope of answering your question. Each curve is responsible for quantifying how accurate the model is. * the `text` loss quantifies how well the predicted text tokens match the source text. This doesn't necessarily need to have too low of a loss. In fact, trainings that have it lower than the mel loss turns out unusuable. * the `mel` loss quantifies how well the predicted speech tokens match the source audio. This definitely seems to benefit from low loss rates. * the `total` loss is a bit irrelevant, and I should probably hide it since it almost always follows the `mel` loss, due to how the `text` loss gets weighed. There's also the validation versions of the text and mel losses, which quantifies the defacto similarity from the generated output to the source output, as the validation dataset serves as outside data (as if you're normally generating something). If there's a large deviation betweent he reported losses and the validation losses, then your model probably has started to overfit for the source material. Below all of that is the learning rate graph, which helps to show what the current learning rate is at. It's not a huge indicator of how training is, as the learning rate curve is determinative. ![image](/attachments/79ebdede-ae03-44e3-bf1b-b5f60a0cf6ed) Here's what a decent graph looks like for a small dataset. Here, you can see that it's probably the "best" at epoch 20 (epoch, as my batch size = dataset size here), as the defacto loss goes higher than the reported loss.
Author

Good explanation. You should probably drop that in the wiki somewhere.

Good explanation. You should probably drop that in the wiki somewhere.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#82
No description provided.