Question: What is the meaning of the blue and orange lines in training? #82

New Issue

st33lmouse · 2023-03-07T13:25:18Z

st33lmouse commented

2023-03-07 13:25:18 +00:00

What are we looking at in the blue and orange lines in the training graph? What does a good graph look like, and what's a bad one look like?

mrq commented

2023-03-07 19:46:22 +00:00

Per the wiki:

This will update the graph below with the current loss rate. This is useful to see how "ready" your model/finetune is. However, there doesn't seem to be a "one-size-fits-all" value for what loss rate you should aim at. I've had some finetunes benefit a ton more from sub 0.01 loss rates, while others absolutely fried after 0.5 (although, it entirely depends on how low of a learning rate you have, rather than haphazardly quick-training it).

I don't have a better answer than that to give you at the current moment in my fleeting free time.

Per the [wiki](https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training#run-training): > This will update the graph below with the current loss rate. This is useful to see how "ready" your model/finetune is. However, there doesn't seem to be a "one-size-fits-all" value for what loss rate you should aim at. I've had some finetunes benefit a ton more from sub 0.01 loss rates, while others absolutely fried after 0.5 (although, it entirely depends on how low of a learning rate you have, rather than haphazardly quick-training it). I don't have a better answer than that to give you at the current moment in my fleeting free time.

mrq closed this issue

2023-03-07 19:46:22 +00:00

mrq commented

2023-03-10 03:45:54 +00:00

Alight, now that I'm in a slightly better headspace now, I can try to explain what the loss curves mean, but with a brief crash course on what the model does (to my understanding):

The autoregressive model predicts tokens in as <speech conditioning>:<text tokens>:<MEL tokens> string, where:

speech conditioning is a vector representing a voice's latents
text tokens (I believe) represents phonemes, which can be compared against the CLVP for "most likely candidates"
MEL tokens represent the actual speech, which gets later converted to a waveform

Now back to the scope of answering your question. Each curve is responsible for quantifying how accurate the model is.

the text loss quantifies how well the predicted text tokens match the source text. This doesn't necessarily need to have too low of a loss. In fact, trainings that have it lower than the mel loss turns out unusuable.
the mel loss quantifies how well the predicted speech tokens match the source audio. This definitely seems to benefit from low loss rates.
the total loss is a bit irrelevant, and I should probably hide it since it almost always follows the mel loss, due to how the text loss gets weighed.

There's also the validation versions of the text and mel losses, which quantifies the defacto similarity from the generated output to the source output, as the validation dataset serves as outside data (as if you're normally generating something). If there's a large deviation betweent he reported losses and the validation losses, then your model probably has started to overfit for the source material.

Below all of that is the learning rate graph, which helps to show what the current learning rate is at. It's not a huge indicator of how training is, as the learning rate curve is determinative.

Here's what a decent graph looks like for a small dataset. Here, you can see that it's probably the "best" at epoch 20 (epoch, as my batch size = dataset size here), as the defacto loss goes higher than the reported loss.

Alight, now that I'm in a slightly better headspace now, I can try to explain what the loss curves mean, but with a brief crash course on what the model does (to my understanding): The autoregressive model predicts tokens in as `<speech conditioning>:<text tokens>:<MEL tokens>` string, where: * speech conditioning is a vector representing a voice's latents * text tokens (I believe) represents phonemes, which can be compared against the CLVP for "most likely candidates" * MEL tokens represent the actual speech, which gets later converted to a waveform Now back to the scope of answering your question. Each curve is responsible for quantifying how accurate the model is. * the `text` loss quantifies how well the predicted text tokens match the source text. This doesn't necessarily need to have too low of a loss. In fact, trainings that have it lower than the mel loss turns out unusuable. * the `mel` loss quantifies how well the predicted speech tokens match the source audio. This definitely seems to benefit from low loss rates. * the `total` loss is a bit irrelevant, and I should probably hide it since it almost always follows the `mel` loss, due to how the `text` loss gets weighed. There's also the validation versions of the text and mel losses, which quantifies the defacto similarity from the generated output to the source output, as the validation dataset serves as outside data (as if you're normally generating something). If there's a large deviation betweent he reported losses and the validation losses, then your model probably has started to overfit for the source material. Below all of that is the learning rate graph, which helps to show what the current learning rate is at. It's not a huge indicator of how training is, as the learning rate curve is determinative. ![image](/attachments/79ebdede-ae03-44e3-bf1b-b5f60a0cf6ed) Here's what a decent graph looks like for a small dataset. Here, you can see that it's probably the "best" at epoch 20 (epoch, as my batch size = dataset size here), as the defacto loss goes higher than the reported loss.

image.png

36 KiB

st33lmouse commented

2023-03-10 05:04:48 +00:00

Good explanation. You should probably drop that in the wiki somewhere.

mrq referenced this issue

2023-03-10 23:19:58 +00:00

! RETRAIN YOUR MODELS ! #103

jJcRucZGXMRZRNdCXCzm referenced this issue

2023-07-09 09:06:33 +00:00

Audio artifacts/repetitive words after training #296

Sign in to join this conversation.