Error involving zipfile upon attempting to resume training. #170

Open
opened 2023-03-24 18:43:10 +07:00 by sazandora · 1 comments

Python Ver: 3.10.6
OS: Windows 10
GPU: RTX 2070 Super
What I was trying to do: My training failed due to the massive number of extra models made by default, so I switched it to one every 50 epochs, then pointed the config back to the resume state "./training/desco/finetune/training_state/300.state", and the partially trained model, "./training/desco/finetune/models/300_gpt.pth". Since the webUI starts training from for the number of epochs no matter what, even if you're resuming, I had to copy over the archived fine-tuning data from the folder to the main finetune, not sure if that'd affect anything.

In any case, upon trying to resume training, I received an error that it failed to read some sort of zipfile. I'm unsure what zipfile it's trying to access, or why it is, but here's the full console output:

E:\ai-voice-cloning>call .\venv\Scripts\activate.bat
!WARNING! Automatically deduced sample batch size returned 1.
!WARNING! Automatically deduced sample batch size returned 1.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Spawning process:  train.bat ./training/desco/train.yaml
[Training] [2023-03-24T14:32:18.849721]
[Training] [2023-03-24T14:32:18.852712] (venv) E:\ai-voice-cloning>call .\venv\Scripts\activate.bat
[Training] [2023-03-24T14:32:20.887345] NOTE: Redirects are currently not supported in Windows or MacOs.
[Training] [2023-03-24T14:32:23.317031] Disabled distributed training.
[Training] [2023-03-24T14:32:23.321019] Traceback (most recent call last):
[Training] [2023-03-24T14:32:23.325008]   File "E:\ai-voice-cloning\src\train.py", line 64, in <module>
[Training] [2023-03-24T14:32:23.329011]     train(config_path, args.launcher)
[Training] [2023-03-24T14:32:23.333984]   File "E:\ai-voice-cloning\src\train.py", line 30, in train
[Training] [2023-03-24T14:32:23.337981]     trainer.init(config_path, opt, launcher, '')
[Training] [2023-03-24T14:32:23.341964]   File "e:\ai-voice-cloning\modules\dlas\dlas\train.py", line 78, in init
[Training] [2023-03-24T14:32:23.344957]     resume_state = torch.load(
[Training] [2023-03-24T14:32:23.347949]   File "E:\ai-voice-cloning\venv\lib\site-packages\torch\serialization.py", line 797, in load
[Training] [2023-03-24T14:32:23.351944]     with _open_zipfile_reader(opened_file) as opened_zipfile:
[Training] [2023-03-24T14:32:23.355925]   File "E:\ai-voice-cloning\venv\lib\site-packages\torch\serialization.py", line 283, in __init__
[Training] [2023-03-24T14:32:23.359915]     super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
[Training] [2023-03-24T14:32:23.363904] RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

If anyone has any ideas what it could be or why this is happening, please let me know. It'd be a real shame to lose 300 epochs.

Python Ver: 3.10.6 OS: Windows 10 GPU: RTX 2070 Super What I was trying to do: My training failed due to the massive number of extra models made by default, so I switched it to one every 50 epochs, then pointed the config back to the resume state "./training/desco/finetune/training_state/300.state", and the partially trained model, "./training/desco/finetune/models/300_gpt.pth". Since the webUI starts training from for the number of epochs no matter what, even if you're resuming, I had to copy over the archived fine-tuning data from the folder to the main finetune, not sure if that'd affect anything. In any case, upon trying to resume training, I received an error that it failed to read some sort of zipfile. I'm unsure what zipfile it's trying to access, or why it is, but here's the full console output: ``` E:\ai-voice-cloning>call .\venv\Scripts\activate.bat !WARNING! Automatically deduced sample batch size returned 1. !WARNING! Automatically deduced sample batch size returned 1. Running on local URL: http://127.0.0.1:7860 To create a public link, set `share=True` in `launch()`. Spawning process: train.bat ./training/desco/train.yaml [Training] [2023-03-24T14:32:18.849721] [Training] [2023-03-24T14:32:18.852712] (venv) E:\ai-voice-cloning>call .\venv\Scripts\activate.bat [Training] [2023-03-24T14:32:20.887345] NOTE: Redirects are currently not supported in Windows or MacOs. [Training] [2023-03-24T14:32:23.317031] Disabled distributed training. [Training] [2023-03-24T14:32:23.321019] Traceback (most recent call last): [Training] [2023-03-24T14:32:23.325008] File "E:\ai-voice-cloning\src\train.py", line 64, in <module> [Training] [2023-03-24T14:32:23.329011] train(config_path, args.launcher) [Training] [2023-03-24T14:32:23.333984] File "E:\ai-voice-cloning\src\train.py", line 30, in train [Training] [2023-03-24T14:32:23.337981] trainer.init(config_path, opt, launcher, '') [Training] [2023-03-24T14:32:23.341964] File "e:\ai-voice-cloning\modules\dlas\dlas\train.py", line 78, in init [Training] [2023-03-24T14:32:23.344957] resume_state = torch.load( [Training] [2023-03-24T14:32:23.347949] File "E:\ai-voice-cloning\venv\lib\site-packages\torch\serialization.py", line 797, in load [Training] [2023-03-24T14:32:23.351944] with _open_zipfile_reader(opened_file) as opened_zipfile: [Training] [2023-03-24T14:32:23.355925] File "E:\ai-voice-cloning\venv\lib\site-packages\torch\serialization.py", line 283, in __init__ [Training] [2023-03-24T14:32:23.359915] super().__init__(torch._C.PyTorchFileReader(name_or_buffer)) [Training] [2023-03-24T14:32:23.363904] RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory ``` If anyone has any ideas what it could be or why this is happening, please let me know. It'd be a real shame to lose 300 epochs.

The .pth files are actually zips. See if you can open your 300_gpt.pth in 7z or similiar archive program. If it's corrupted you might be out of luck.

The .pth files are actually zips. See if you can open your 300_gpt.pth in 7z or similiar archive program. If it's corrupted you might be out of luck.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#170
There is no content yet.