Training Error? Encountered 10 NaN losses in a row? #97

New Issue

maki6003 · 2023-03-08T23:56:35Z

maki6003 commented

2023-03-08 23:56:35 +00:00

I was training once with the same data set i managed to get to 150 epoch before i ran out of space because of all the models, I did it again now with adjusted Saving for models... but now when im close to 40 - 50 epochs it starts to break and gives out this error?

[Training] [2023-03-08T23:49:22.882899] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.882941] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.882985] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883014] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883049] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883084] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883119] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883175] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883216] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883288] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883343] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883388] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883426] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883460] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883494] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883529] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883571] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883597] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883632] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883666] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883700] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883733] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883773] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883814] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883843] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883877] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883912] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883947] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883989] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.884015] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.884134]
[Training] [2023-03-08T23:49:23.710789] 0%| | 0/1 [00:00<?, ?it/s]
[Training] [2023-03-08T23:49:23.710912] 0%| | 0/1 [00:00<?, ?it/s]
[Training] [2023-03-08T23:49:23.711135] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:23.711179] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:23.711216] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:23.711254] Encountered 10 NaN losses in a row. Something is screwed up. Dumping model weights and exiting.

I was training once with the same data set i managed to get to 150 epoch before i ran out of space because of all the models, I did it again now with adjusted Saving for models... but now when im close to 40 - 50 epochs it starts to break and gives out this error? [Training] [2023-03-08T23:49:22.882899] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.882941] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.882985] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883014] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883049] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883084] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883119] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883175] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883216] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883288] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883343] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883388] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883426] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883460] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883494] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883529] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883571] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883597] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883632] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883666] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883700] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883733] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883773] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883814] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883843] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883877] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883912] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883947] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883989] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.884015] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.884134] [Training] [2023-03-08T23:49:23.710789] 0%| | 0/1 [00:00<?, ?it/s] [Training] [2023-03-08T23:49:23.710912] 0%| | 0/1 [00:00<?, ?it/s] [Training] [2023-03-08T23:49:23.711135] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:23.711179] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:23.711216] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:23.711254] Encountered 10 NaN losses in a row. Something is screwed up. Dumping model weights and exiting.

maki6003 commented

2023-03-08 23:58:52 +00:00

i didnt save models 150 sadly so i never knew how good or bad it was.

But now on the re-runs:

tbh finetuned model 32 wasnt even that bad it was not perfect but decent.... but not sure why i keep getting this error everytime

i didnt save models 150 sadly so i never knew how good or bad it was. But now on the re-runs: tbh finetuned model 32 wasnt even that bad it was not perfect but decent.... but not sure why i keep getting this error everytime

nirurin commented

2023-03-09 02:44:56 +00:00

Can't help with your issue but... you were getting good results with only 32 epochs?

I'm currently on... 5990 epochs and it's still pretty much just a garbled mess.

Can't help with your issue but... you were getting good results with only 32 epochs? I'm currently on... 5990 epochs and it's still pretty much just a garbled mess.

maki6003 commented

2023-03-09 03:22:30 +00:00

honestly it sounded decent tbh.... not perfect, it was only 5min recording of me just reading random stuff from something. i managed to train again it hit 50 before breaking.. but 50 sounded worse than 32

mrq added the

insufficient info

label 2023-03-09 19:13:22 +00:00

maki6003 commented

2023-03-09 22:30:48 +00:00

Im not sure what extra info i need to add? my dataset is literally 5min long, it creates 70 voice and text dataset for that after validation. i train it to 500 epochs, that save every 50 at a learning rate of 0.00009, if im lucky and it hits 50 i can get 1 model.. but usually while it trains around the 30-50+ mark it starts to give out the

" [Training] [2023-03-08T23:49:22.882899] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.882941] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.882985] Non-finite loss encountered. Skipping backwards step. "

and it starts to keep happening until it happens 10 times and stops training

[Training] [2023-03-08T23:49:23.711254] Encountered 10 NaN losses in a row. Something is screwed up. Dumping model weights and exiting.

The first time i trained this dataset i managed to get it to train to 150 epochs before it broke because of not enough disk space issue.. but now i dont even know

Im not sure what extra info i need to add? my dataset is literally 5min long, it creates 70 voice and text dataset for that after validation. i train it to 500 epochs, that save every 50 at a learning rate of 0.00009, if im lucky and it hits 50 i can get 1 model.. but usually while it trains around the 30-50+ mark it starts to give out the " [Training] [2023-03-08T23:49:22.882899] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.882941] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.882985] Non-finite loss encountered. Skipping backwards step. " and it starts to keep happening until it happens 10 times and stops training [Training] [2023-03-08T23:49:23.711254] Encountered 10 NaN losses in a row. Something is screwed up. Dumping model weights and exiting. The first time i trained this dataset i managed to get it to train to 150 epochs before it broke because of not enough disk space issue.. but now i dont even know

mrq commented

2023-03-09 22:52:52 +00:00

it creates 70 voice and text dataset for that after validation. i train it to 500 epochs, that save every 50 at a learning rate of 0.00009

Thank you. The graphs would be helpful too, but oh well, it's irrelevant when there was a problem with training for tortoise that has been remedied last night.

I've had similar datasets fail after prolonged low learning rates, so I'd imagine it's just a matter of the scheduler decaying the learning rate to an even small value for a very, very long time.

I suggest updating and replicating my quick-train settings from #103, as it literally Worked On My Machine(tm)

> it creates 70 voice and text dataset for that after validation. i train it to 500 epochs, that save every 50 at a learning rate of 0.00009 Thank you. The graphs would be helpful too, but oh well, it's irrelevant when there was a problem with training for tortoise that has been remedied last night. I've had similar datasets fail after prolonged low learning rates, so I'd imagine it's just a matter of the scheduler decaying the learning rate to an even small value for a very, very long time. I suggest updating and replicating my quick-train settings from https://git.ecker.tech/mrq/ai-voice-cloning/issues/103, as it literally Worked On My Machine(tm)

maki6003 commented

2023-03-09 23:50:25 +00:00

ill try to re-train, but now due to the latest notebook updates, i cant run the google colab version on paperspace to do my ffmpg work-around... i dont why it wont install ffmpg with the paperspace notebook, but it did with the google colab notebook on paperspace... makes no sense.

is there anything i can share with you to know why it wont install / find it for preparing the dataset? cuz everything else works

update: nvm fixed it.

ill try to re-train, but now due to the latest notebook updates, i cant run the google colab version on paperspace to do my ffmpg work-around... i dont why it wont install ffmpg with the paperspace notebook, but it did with the google colab notebook on paperspace... makes no sense. is there anything i can share with you to know why it wont install / find it for preparing the dataset? cuz everything else works update: nvm fixed it.

mrq closed this issue

2023-03-13 17:38:56 +00:00

Sign in to join this conversation.