Training randomly crashes after 358 epochs #163

Closed
opened 2023-03-21 11:25:59 +00:00 by mafiosnik · 3 comments

[Training] [2023-03-21T12:13:35.957141] 23-03-21 12:13:35.957 - INFO: Training Metrics: {"loss_text_ce": 4.026022911071777, "loss_mel_ce": 2.254241704940796, "loss_gpt_total": 2.294501781463623, "lr": 7.8125e-08, "it": 359, "step": 1, "steps": 1, "epoch": 358, "iteration_rate": 8.698551654815674}
[Training] [2023-03-21T12:13:47.880700] Using BitsAndBytes optimizations
[Training] [2023-03-21T12:13:47.880700] Disabled distributed training.
[Training] [2023-03-21T12:13:47.880700] Loading from ./models/tortoise/dvae.pth
[Training] [2023-03-21T12:13:47.880700] Traceback (most recent call last):
[Training] [2023-03-21T12:13:47.880700] File "C:\Users\Mafio\Documents\ai-voice-cloning\venv\lib\site-packages\torch\serialization.py", line 441, in save
[Training] [2023-03-21T12:13:47.883702] _save(obj, opened_zipfile, pickle_module, pickle_protocol)
[Training] [2023-03-21T12:13:47.883702] File "C:\Users\Mafio\Documents\ai-voice-cloning\venv\lib\site-packages\torch\serialization.py", line 668, in _save
[Training] [2023-03-21T12:13:47.883702] zip_file.write_record(name, storage.data_ptr(), num_bytes)
[Training] [2023-03-21T12:13:47.883702] RuntimeError: [enforce fail at ..\caffe2\serialize\inline_container.cc:476] . PytorchStreamWriter failed writing file data/239: file write failed
[Training] [2023-03-21T12:13:47.883702]
[Training] [2023-03-21T12:13:47.883702] During handling of the above exception, another exception occurred:
[Training] [2023-03-21T12:13:47.883702]
[Training] [2023-03-21T12:13:47.883702] Traceback (most recent call last):
[Training] [2023-03-21T12:13:47.883702] File "C:\Users\Mafio\Documents\ai-voice-cloning\src\train.py", line 68, in
[Training] [2023-03-21T12:13:47.883702] train(config_path, args.launcher)
[Training] [2023-03-21T12:13:47.883702] File "C:\Users\Mafio\Documents\ai-voice-cloning\src\train.py", line 35, in train
[Training] [2023-03-21T12:13:47.883702] trainer.do_training()
[Training] [2023-03-21T12:13:47.883702] File "C:\Users\Mafio\Documents\ai-voice-cloning./modules/dlas\codes\train.py", line 374, in do_training
[Training] [2023-03-21T12:13:47.884703] metric = self.do_step(train_data)
[Training] [2023-03-21T12:13:47.884703] File "C:\Users\Mafio\Documents\ai-voice-cloning./modules/dlas\codes\train.py", line 316, in do_step
[Training] [2023-03-21T12:13:47.884703] self.save()
[Training] [2023-03-21T12:13:47.884703] File "C:\Users\Mafio\Documents\ai-voice-cloning./modules/dlas\codes\train.py", line 219, in save
[Training] [2023-03-21T12:13:47.884703] self.model.save_training_state(state)
[Training] [2023-03-21T12:13:47.885704] File "C:\Users\Mafio\Documents\ai-voice-cloning./modules/dlas/codes\trainer\base_model.py", line 151, in save_training_state
[Training] [2023-03-21T12:13:47.885704] torch.save(map_to_device(state, 'cpu'), save_path)
[Training] [2023-03-21T12:13:47.885704] File "C:\Users\Mafio\Documents\ai-voice-cloning\venv\lib\site-packages\torch\serialization.py", line 440, in save
[Training] [2023-03-21T12:13:47.885704] with _open_zipfile_writer(f) as opened_zipfile:
[Training] [2023-03-21T12:13:47.885704] File "C:\Users\Mafio\Documents\ai-voice-cloning\venv\lib\site-packages\torch\serialization.py", line 291, in exit
[Training] [2023-03-21T12:13:47.885704] self.file_like.write_end_of_file()
[Training] [2023-03-21T12:13:47.885704] RuntimeError: [enforce fail at ..\caffe2\serialize\inline_container.cc:337] . unexpected pos 311521536 vs 311521488

Any idea what could cause this? Apparently it's trying to save the model and fails.

I'm using Torch 2.0 stable.

> [Training] [2023-03-21T12:13:35.957141] 23-03-21 12:13:35.957 - INFO: Training Metrics: {"loss_text_ce": 4.026022911071777, "loss_mel_ce": 2.254241704940796, "loss_gpt_total": 2.294501781463623, "lr": 7.8125e-08, "it": 359, "step": 1, "steps": 1, "epoch": 358, "iteration_rate": 8.698551654815674} > [Training] [2023-03-21T12:13:47.880700] Using BitsAndBytes optimizations > [Training] [2023-03-21T12:13:47.880700] Disabled distributed training. > [Training] [2023-03-21T12:13:47.880700] Loading from ./models/tortoise/dvae.pth > [Training] [2023-03-21T12:13:47.880700] Traceback (most recent call last): > [Training] [2023-03-21T12:13:47.880700] File "C:\Users\Mafio\Documents\ai-voice-cloning\venv\lib\site-packages\torch\serialization.py", line 441, in save > [Training] [2023-03-21T12:13:47.883702] _save(obj, opened_zipfile, pickle_module, pickle_protocol) > [Training] [2023-03-21T12:13:47.883702] File "C:\Users\Mafio\Documents\ai-voice-cloning\venv\lib\site-packages\torch\serialization.py", line 668, in _save > [Training] [2023-03-21T12:13:47.883702] zip_file.write_record(name, storage.data_ptr(), num_bytes) > [Training] [2023-03-21T12:13:47.883702] RuntimeError: [enforce fail at ..\caffe2\serialize\inline_container.cc:476] . PytorchStreamWriter failed writing file data/239: file write failed > [Training] [2023-03-21T12:13:47.883702] > [Training] [2023-03-21T12:13:47.883702] During handling of the above exception, another exception occurred: > [Training] [2023-03-21T12:13:47.883702] > [Training] [2023-03-21T12:13:47.883702] Traceback (most recent call last): > [Training] [2023-03-21T12:13:47.883702] File "C:\Users\Mafio\Documents\ai-voice-cloning\src\train.py", line 68, in <module> > [Training] [2023-03-21T12:13:47.883702] train(config_path, args.launcher) > [Training] [2023-03-21T12:13:47.883702] File "C:\Users\Mafio\Documents\ai-voice-cloning\src\train.py", line 35, in train > [Training] [2023-03-21T12:13:47.883702] trainer.do_training() > [Training] [2023-03-21T12:13:47.883702] File "C:\Users\Mafio\Documents\ai-voice-cloning\./modules/dlas\codes\train.py", line 374, in do_training > [Training] [2023-03-21T12:13:47.884703] metric = self.do_step(train_data) > [Training] [2023-03-21T12:13:47.884703] File "C:\Users\Mafio\Documents\ai-voice-cloning\./modules/dlas\codes\train.py", line 316, in do_step > [Training] [2023-03-21T12:13:47.884703] self.save() > [Training] [2023-03-21T12:13:47.884703] File "C:\Users\Mafio\Documents\ai-voice-cloning\./modules/dlas\codes\train.py", line 219, in save > [Training] [2023-03-21T12:13:47.884703] self.model.save_training_state(state) > [Training] [2023-03-21T12:13:47.885704] File "C:\Users\Mafio\Documents\ai-voice-cloning\./modules/dlas/codes\trainer\base_model.py", line 151, in save_training_state > [Training] [2023-03-21T12:13:47.885704] torch.save(map_to_device(state, 'cpu'), save_path) > [Training] [2023-03-21T12:13:47.885704] File "C:\Users\Mafio\Documents\ai-voice-cloning\venv\lib\site-packages\torch\serialization.py", line 440, in save > [Training] [2023-03-21T12:13:47.885704] with _open_zipfile_writer(f) as opened_zipfile: > [Training] [2023-03-21T12:13:47.885704] File "C:\Users\Mafio\Documents\ai-voice-cloning\venv\lib\site-packages\torch\serialization.py", line 291, in __exit__ > [Training] [2023-03-21T12:13:47.885704] self.file_like.write_end_of_file() > [Training] [2023-03-21T12:13:47.885704] RuntimeError: [enforce fail at ..\caffe2\serialize\inline_container.cc:337] . unexpected pos 311521536 vs 311521488 Any idea what could cause this? Apparently it's trying to save the model and fails. I'm using Torch 2.0 stable.

How much free disk space is there?

How much free disk space is there?
Author

Never fucking mind, yeah. I don't have any disk space left. Didn't expect each checkpoint to be 1.6GB big. Is there any way to change the default saving after every 5 epochs?

Never fucking mind, yeah. I don't have any disk space left. Didn't expect each checkpoint to be 1.6GB big. Is there any way to change the default saving after every 5 epochs?
Author

Found it, closing lmao.

Found it, closing lmao.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#163
No description provided.