Google Collab Training Issue #63

Closed
opened 2023-03-05 16:22:27 +00:00 by aniweeb1 · 5 comments
[Training] [2023-03-05T15:56:00.897640] Traceback (most recent call last):
[Training] [2023-03-05T15:56:00.897703] File "./src/train.py", line 93, in <module>
[Training] [2023-03-05T15:56:00.897755] train(args.opt, args.launcher)
[Training] [2023-03-05T15:56:00.897804] File "./src/train.py", line 80, in train
[Training] [2023-03-05T15:56:00.897852] trainer.do_training()
[Training] [2023-03-05T15:56:00.897901] File "/content/ai-voice-cloning/./dlas/codes/train.py", line 331, in do_training
[Training] [2023-03-05T15:56:00.897948] self.do_step(train_data)
[Training] [2023-03-05T15:56:00.897995] File "/content/ai-voice-cloning/./dlas/codes/train.py", line 212, in do_step
[Training] [2023-03-05T15:56:00.898043] gradient_norms_dict = self.model.optimize_parameters(self.current_step, return_grad_norms=will_log)
[Training] [2023-03-05T15:56:00.898091] File "/content/ai-voice-cloning/./dlas/codes/trainer/ExtensibleTrainer.py", line 303, in optimize_parameters
[Training] [2023-03-05T15:56:00.898188] ns = step.do_forward_backward(state, m, step_num, train=train_step, no_ddp_sync=(m+1 < self.batch_factor))
[Training] [2023-03-05T15:56:00.898249] File "/content/ai-voice-cloning/./dlas/codes/trainer/steps.py", line 220, in do_forward_backward
[Training] [2023-03-05T15:56:00.898298] local_state[k] = v[grad_accum_step]
[Training] [2023-03-05T15:56:00.898347] IndexError: list index out of range
[Training] [2023-03-05T15:56:14.001221] ./train.sh: line 13: deactivate: command not found

Not quite sure what happened here but I can do everything up until I try tranining a voice set I have.

``` [Training] [2023-03-05T15:56:00.897640] Traceback (most recent call last): [Training] [2023-03-05T15:56:00.897703] File "./src/train.py", line 93, in <module> [Training] [2023-03-05T15:56:00.897755] train(args.opt, args.launcher) [Training] [2023-03-05T15:56:00.897804] File "./src/train.py", line 80, in train [Training] [2023-03-05T15:56:00.897852] trainer.do_training() [Training] [2023-03-05T15:56:00.897901] File "/content/ai-voice-cloning/./dlas/codes/train.py", line 331, in do_training [Training] [2023-03-05T15:56:00.897948] self.do_step(train_data) [Training] [2023-03-05T15:56:00.897995] File "/content/ai-voice-cloning/./dlas/codes/train.py", line 212, in do_step [Training] [2023-03-05T15:56:00.898043] gradient_norms_dict = self.model.optimize_parameters(self.current_step, return_grad_norms=will_log) [Training] [2023-03-05T15:56:00.898091] File "/content/ai-voice-cloning/./dlas/codes/trainer/ExtensibleTrainer.py", line 303, in optimize_parameters [Training] [2023-03-05T15:56:00.898188] ns = step.do_forward_backward(state, m, step_num, train=train_step, no_ddp_sync=(m+1 < self.batch_factor)) [Training] [2023-03-05T15:56:00.898249] File "/content/ai-voice-cloning/./dlas/codes/trainer/steps.py", line 220, in do_forward_backward [Training] [2023-03-05T15:56:00.898298] local_state[k] = v[grad_accum_step] [Training] [2023-03-05T15:56:00.898347] IndexError: list index out of range [Training] [2023-03-05T15:56:14.001221] ./train.sh: line 13: deactivate: command not found ``` Not quite sure what happened here but I can do everything up until I try tranining a voice set I have.
Owner

Same as #62: you should validate your training settings, because your gradient accumulation size is too large for the given batch size.

Same as https://git.ecker.tech/mrq/ai-voice-cloning/issues/62: you should validate your training settings, because your gradient accumulation size is too large for the given batch size.
Author
[Training] [2023-03-05T17:10:11.149405] Using BitsAndBytes ADAMW optimizations
[Training] [2023-03-05T17:10:11.156231] Disabled distributed training.
[Training] [2023-03-05T17:10:11.163203] Path already exists. Rename it to [/content/ai-voice-cloning/training/Kronii-finetune_archived_230305-170950]
[Training] [2023-03-05T17:10:11.170417] Loading from ./models/tortoise/dvae.pth
[Training] [2023-03-05T17:10:11.177689] Traceback (most recent call last):
[Training] [2023-03-05T17:10:11.184834]   File "./src/train.py", line 93, in <module>
[Training] [2023-03-05T17:10:11.192396]     train(args.opt, args.launcher)
[Training] [2023-03-05T17:10:11.199261]   File "./src/train.py", line 79, in train
[Training] [2023-03-05T17:10:11.206257]     trainer.init(yaml, opt, launcher)
[Training] [2023-03-05T17:10:11.213198]   File "/content/ai-voice-cloning/./dlas/codes/train.py", line 146, in init
[Training] [2023-03-05T17:10:11.224356]     self.model = ExtensibleTrainer(opt)
[Training] [2023-03-05T17:10:11.231864]   File "/content/ai-voice-cloning/./dlas/codes/trainer/ExtensibleTrainer.py", line 102, in __init__
[Training] [2023-03-05T17:10:11.238928]     step = ConfigurableStep(step, self.env)
[Training] [2023-03-05T17:10:11.246406]   File "/content/ai-voice-cloning/./dlas/codes/trainer/steps.py", line 48, in __init__
[Training] [2023-03-05T17:10:11.253248]     self.injectors.append(create_injector(injector, env))
[Training] [2023-03-05T17:10:11.260123]   File "/content/ai-voice-cloning/./dlas/codes/trainer/inject.py", line 68, in create_injector
[Training] [2023-03-05T17:10:11.267270]     return injectors[opt_inject['type']](opt_inject, env)
[Training] [2023-03-05T17:10:11.274104]   File "/content/ai-voice-cloning/./dlas/codes/trainer/injectors/audio_injectors.py", line 178, in __init__
[Training] [2023-03-05T17:10:11.281003]     self.dvae = load_model_from_config(cfg, dvae_name, device=f'cuda:{env["device"]}').eval()
[Training] [2023-03-05T17:10:11.288091]   File "/content/ai-voice-cloning/./dlas/codes/utils/util.py", line 497, in load_model_from_config
[Training] [2023-03-05T17:10:11.301624]     sd = torch.load(load_path, map_location=device)
[Training] [2023-03-05T17:10:11.316921]   File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 771, in load
[Training] [2023-03-05T17:10:11.325254]     with _open_file_like(f, 'rb') as opened_file:
[Training] [2023-03-05T17:10:11.333811]   File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 270, in _open_file_like
[Training] [2023-03-05T17:10:11.342065]     return _open_file(name_or_buffer, mode)
[Training] [2023-03-05T17:10:11.351105]   File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 251, in __init__
[Training] [2023-03-05T17:10:11.359960]     super(_open_file, self).__init__(open(name, mode))
[Training] [2023-03-05T17:10:11.368858] FileNotFoundError: [Errno 2] No such file or directory: './models/tortoise/dvae.pth'
[Training] [2023-03-05T17:10:12.887094] ./train.sh: line 13: deactivate: command not found

Now I'm just running into an infinite stall after hitting the line 13 error for train.sh, I let it run for 25 minutes and it just hangs there.

``` [Training] [2023-03-05T17:10:11.149405] Using BitsAndBytes ADAMW optimizations [Training] [2023-03-05T17:10:11.156231] Disabled distributed training. [Training] [2023-03-05T17:10:11.163203] Path already exists. Rename it to [/content/ai-voice-cloning/training/Kronii-finetune_archived_230305-170950] [Training] [2023-03-05T17:10:11.170417] Loading from ./models/tortoise/dvae.pth [Training] [2023-03-05T17:10:11.177689] Traceback (most recent call last): [Training] [2023-03-05T17:10:11.184834] File "./src/train.py", line 93, in <module> [Training] [2023-03-05T17:10:11.192396] train(args.opt, args.launcher) [Training] [2023-03-05T17:10:11.199261] File "./src/train.py", line 79, in train [Training] [2023-03-05T17:10:11.206257] trainer.init(yaml, opt, launcher) [Training] [2023-03-05T17:10:11.213198] File "/content/ai-voice-cloning/./dlas/codes/train.py", line 146, in init [Training] [2023-03-05T17:10:11.224356] self.model = ExtensibleTrainer(opt) [Training] [2023-03-05T17:10:11.231864] File "/content/ai-voice-cloning/./dlas/codes/trainer/ExtensibleTrainer.py", line 102, in __init__ [Training] [2023-03-05T17:10:11.238928] step = ConfigurableStep(step, self.env) [Training] [2023-03-05T17:10:11.246406] File "/content/ai-voice-cloning/./dlas/codes/trainer/steps.py", line 48, in __init__ [Training] [2023-03-05T17:10:11.253248] self.injectors.append(create_injector(injector, env)) [Training] [2023-03-05T17:10:11.260123] File "/content/ai-voice-cloning/./dlas/codes/trainer/inject.py", line 68, in create_injector [Training] [2023-03-05T17:10:11.267270] return injectors[opt_inject['type']](opt_inject, env) [Training] [2023-03-05T17:10:11.274104] File "/content/ai-voice-cloning/./dlas/codes/trainer/injectors/audio_injectors.py", line 178, in __init__ [Training] [2023-03-05T17:10:11.281003] self.dvae = load_model_from_config(cfg, dvae_name, device=f'cuda:{env["device"]}').eval() [Training] [2023-03-05T17:10:11.288091] File "/content/ai-voice-cloning/./dlas/codes/utils/util.py", line 497, in load_model_from_config [Training] [2023-03-05T17:10:11.301624] sd = torch.load(load_path, map_location=device) [Training] [2023-03-05T17:10:11.316921] File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 771, in load [Training] [2023-03-05T17:10:11.325254] with _open_file_like(f, 'rb') as opened_file: [Training] [2023-03-05T17:10:11.333811] File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 270, in _open_file_like [Training] [2023-03-05T17:10:11.342065] return _open_file(name_or_buffer, mode) [Training] [2023-03-05T17:10:11.351105] File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 251, in __init__ [Training] [2023-03-05T17:10:11.359960] super(_open_file, self).__init__(open(name, mode)) [Training] [2023-03-05T17:10:11.368858] FileNotFoundError: [Errno 2] No such file or directory: './models/tortoise/dvae.pth' [Training] [2023-03-05T17:10:12.887094] ./train.sh: line 13: deactivate: command not found ``` Now I'm just running into an infinite stall after hitting the line 13 error for train.sh, I let it run for 25 minutes and it just hangs there.
Owner

[Errno 2] No such file or directory: './models/tortoise/dvae.pth'

To bluntly put it, you're doing something really, really wrong, if you managed to start the web UI without it downloading the required files on startup. Ensure TTS has loaded once before training.

> [Errno 2] No such file or directory: './models/tortoise/dvae.pth' To bluntly put it, you're doing something really, really wrong, if you managed to start the web UI without it downloading the required files on startup. Ensure TTS has loaded once before training.
Author

I ended up solving the issue, so you were right. It was related to dvae not being downloaded as a required file at startup, but there's also an annotated comment in the collab notebook preset in the wiki that mentions to "disable loading TTS on startup before training" which somehow does not mention that this bypasses that initial file download to initate during the repo download process.

Even after importing the actual files from my own personal computer and running it locally, it wouldn't recognize the files when they were imported into the collab environment unless it was downloaded along with the rest of the repo during the initial process. Sorry to be really silly about it, appreciate the help.

I ended up solving the issue, so you were right. It was related to dvae not being downloaded as a required file at startup, but there's also an annotated comment in the collab notebook preset in the wiki that mentions to "disable loading TTS on startup before training" which somehow does not mention that this bypasses that initial file download to initate during the repo download process. Even after importing the actual files from my own personal computer and running it locally, it wouldn't recognize the files when they were imported into the collab environment unless it was downloaded along with the rest of the repo during the initial process. Sorry to be really silly about it, appreciate the help.

I ended up solving the issue, so you were right. It was related to dvae not being downloaded as a required file at startup, but there's also an annotated comment in the collab notebook preset in the wiki that mentions to "disable loading TTS on startup before training" which somehow does not mention that this bypasses that initial file download to initate during the repo download process.

Even after importing the actual files from my own personal computer and running it locally, it wouldn't recognize the files when they were imported into the collab environment unless it was downloaded along with the rest of the repo during the initial process. Sorry to be really silly about it, appreciate the help.

i think im facing the same issue, what changes did you do to the notebook to fix this issue?

> I ended up solving the issue, so you were right. It was related to dvae not being downloaded as a required file at startup, but there's also an annotated comment in the collab notebook preset in the wiki that mentions to "disable loading TTS on startup before training" which somehow does not mention that this bypasses that initial file download to initate during the repo download process. > > Even after importing the actual files from my own personal computer and running it locally, it wouldn't recognize the files when they were imported into the collab environment unless it was downloaded along with the rest of the repo during the initial process. Sorry to be really silly about it, appreciate the help. i think im facing the same issue, what changes did you do to the notebook to fix this issue?
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#63
No description provided.