Training Error? Encountered 10 NaN losses in a row? #97

Closed
opened 2023-03-08 23:56:35 +07:00 by maki6003 · 6 comments

I was training once with the same data set i managed to get to 150 epoch before i ran out of space because of all the models, I did it again now with adjusted Saving for models... but now when im close to 40 - 50 epochs it starts to break and gives out this error?

[Training] [2023-03-08T23:49:22.882899] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.882941] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.882985] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883014] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883049] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883084] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883119] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883175] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883216] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883288] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883343] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883388] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883426] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883460] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883494] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883529] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883571] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883597] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883632] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883666] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883700] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883733] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883773] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883814] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883843] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883877] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.883912] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.883947] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.883989] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.884015] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:22.884134]
[Training] [2023-03-08T23:49:23.710789] 0%| | 0/1 [00:00<?, ?it/s]
[Training] [2023-03-08T23:49:23.710912] 0%| | 0/1 [00:00<?, ?it/s]
[Training] [2023-03-08T23:49:23.711135] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:23.711179] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:23.711216] Non-finite loss encountered. Skipping backwards step.
[Training] [2023-03-08T23:49:23.711254] Encountered 10 NaN losses in a row. Something is screwed up. Dumping model weights and exiting.

I was training once with the same data set i managed to get to 150 epoch before i ran out of space because of all the models, I did it again now with adjusted Saving for models... but now when im close to 40 - 50 epochs it starts to break and gives out this error? [Training] [2023-03-08T23:49:22.882899] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.882941] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.882985] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883014] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883049] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883084] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883119] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883175] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883216] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883288] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883343] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883388] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883426] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883460] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883494] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883529] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883571] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883597] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883632] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883666] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883700] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883733] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883773] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883814] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883843] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883877] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.883912] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.883947] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.883989] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.884015] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:22.884134] [Training] [2023-03-08T23:49:23.710789] 0%| | 0/1 [00:00<?, ?it/s] [Training] [2023-03-08T23:49:23.710912] 0%| | 0/1 [00:00<?, ?it/s] [Training] [2023-03-08T23:49:23.711135] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:23.711179] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:23.711216] Non-finite loss encountered. Skipping backwards step. [Training] [2023-03-08T23:49:23.711254] Encountered 10 NaN losses in a row. Something is screwed up. Dumping model weights and exiting.

i didnt save models 150 sadly so i never knew how good or bad it was.

But now on the re-runs:

tbh finetuned model 32 wasnt even that bad it was not perfect but decent.... but not sure why i keep getting this error everytime

i didnt save models 150 sadly so i never knew how good or bad it was. But now on the re-runs: tbh finetuned model 32 wasnt even that bad it was not perfect but decent.... but not sure why i keep getting this error everytime

Can't help with your issue but... you were getting good results with only 32 epochs?

I'm currently on... 5990 epochs and it's still pretty much just a garbled mess.

Can't help with your issue but... you were getting good results with only 32 epochs? I'm currently on... 5990 epochs and it's still pretty much just a garbled mess.

honestly it sounded decent tbh.... not perfect, it was only 5min recording of me just reading random stuff from something. i managed to train again it hit 50 before breaking.. but 50 sounded worse than 32

honestly it sounded decent tbh.... not perfect, it was only 5min recording of me just reading random stuff from something. i managed to train again it hit 50 before breaking.. but 50 sounded worse than 32
mrq added the
insufficient info
label 2023-03-09 19:13:22 +07:00

Im not sure what extra info i need to add? my dataset is literally 5min long, it creates 70 voice and text dataset for that after validation. i train it to 500 epochs, that save every 50 at a learning rate of 0.00009, if im lucky and it hits 50 i can get 1 model.. but usually while it trains around the 30-50+ mark it starts to give out the

" [Training] [2023-03-08T23:49:22.882899] !!Detected non-finite loss text_ce
[Training] [2023-03-08T23:49:22.882941] !!Detected non-finite loss mel_ce
[Training] [2023-03-08T23:49:22.882985] Non-finite loss encountered. Skipping backwards step. "

and it starts to keep happening until it happens 10 times and stops training

[Training] [2023-03-08T23:49:23.711254] Encountered 10 NaN losses in a row. Something is screwed up. Dumping model weights and exiting.

The first time i trained this dataset i managed to get it to train to 150 epochs before it broke because of not enough disk space issue.. but now i dont even know

Im not sure what extra info i need to add? my dataset is literally 5min long, it creates 70 voice and text dataset for that after validation. i train it to 500 epochs, that save every 50 at a learning rate of 0.00009, if im lucky and it hits 50 i can get 1 model.. but usually while it trains around the 30-50+ mark it starts to give out the " [Training] [2023-03-08T23:49:22.882899] !!Detected non-finite loss text_ce [Training] [2023-03-08T23:49:22.882941] !!Detected non-finite loss mel_ce [Training] [2023-03-08T23:49:22.882985] Non-finite loss encountered. Skipping backwards step. " and it starts to keep happening until it happens 10 times and stops training [Training] [2023-03-08T23:49:23.711254] Encountered 10 NaN losses in a row. Something is screwed up. Dumping model weights and exiting. The first time i trained this dataset i managed to get it to train to 150 epochs before it broke because of not enough disk space issue.. but now i dont even know

it creates 70 voice and text dataset for that after validation. i train it to 500 epochs, that save every 50 at a learning rate of 0.00009

Thank you. The graphs would be helpful too, but oh well, it's irrelevant when there was a problem with training for tortoise that has been remedied last night.

I've had similar datasets fail after prolonged low learning rates, so I'd imagine it's just a matter of the scheduler decaying the learning rate to an even small value for a very, very long time.

I suggest updating and replicating my quick-train settings from #103, as it literally Worked On My Machine(tm)

> it creates 70 voice and text dataset for that after validation. i train it to 500 epochs, that save every 50 at a learning rate of 0.00009 Thank you. The graphs would be helpful too, but oh well, it's irrelevant when there was a problem with training for tortoise that has been remedied last night. I've had similar datasets fail after prolonged low learning rates, so I'd imagine it's just a matter of the scheduler decaying the learning rate to an even small value for a very, very long time. I suggest updating and replicating my quick-train settings from https://git.ecker.tech/mrq/ai-voice-cloning/issues/103, as it literally Worked On My Machine(tm)

ill try to re-train, but now due to the latest notebook updates, i cant run the google colab version on paperspace to do my ffmpg work-around... i dont why it wont install ffmpg with the paperspace notebook, but it did with the google colab notebook on paperspace... makes no sense.

is there anything i can share with you to know why it wont install / find it for preparing the dataset? cuz everything else works

update: nvm fixed it.

ill try to re-train, but now due to the latest notebook updates, i cant run the google colab version on paperspace to do my ffmpg work-around... i dont why it wont install ffmpg with the paperspace notebook, but it did with the google colab notebook on paperspace... makes no sense. is there anything i can share with you to know why it wont install / find it for preparing the dataset? cuz everything else works update: nvm fixed it.
mrq closed this issue 2023-03-13 17:38:56 +07:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#97
There is no content yet.