Training IndexError: list index out of range #62

New Issue

gasthemall · 2023-03-05T12:42:01Z

gasthemall commented

2023-03-05 12:42:01 +00:00

Python 3.9
GTX 3090
Fresh install

Just trying to train. I was able to succesfully train in a previous version.

[Training] [2023-03-05T19:28:38.502912] File "C:\Users\PC\Desktop\ai-voice-cloning\src\train.py", line 80, in train
[Training] [2023-03-05T19:28:38.503908] trainer.do_training()
[Training] [2023-03-05T19:28:38.503908] File "C:\Users\PC\Desktop\ai-voice-cloning./dlas\codes\train.py", line 331, in do_training
[Training] [2023-03-05T19:28:38.503908] self.do_step(train_data)
[Training] [2023-03-05T19:28:38.504905] File "C:\Users\PC\Desktop\ai-voice-cloning./dlas\codes\train.py", line 212, in do_step
[Training] [2023-03-05T19:28:38.504905] gradient_norms_dict = self.model.optimize_parameters(self.current_step, return_grad_norms=will_log)
[Training] [2023-03-05T19:28:38.504905] File "C:\Users\PC\Desktop\ai-voice-cloning./dlas/codes\trainer\ExtensibleTrainer.py", line 303, in optimize_parameters
[Training] [2023-03-05T19:28:38.505902] ns = step.do_forward_backward(state, m, step_num, train=train_step, no_ddp_sync=(m+1 < self.batch_factor))
[Training] [2023-03-05T19:28:38.505902] File "C:\Users\PC\Desktop\ai-voice-cloning./dlas/codes\trainer\steps.py", line 220, in do_forward_backward
[Training] [2023-03-05T19:28:38.505902] local_state[k] = v[grad_accum_step]
[Training] [2023-03-05T19:28:38.506897] IndexError: list index out of range

Python 3.9 GTX 3090 Fresh install Just trying to train. I was able to succesfully train in a previous version. [Training] [2023-03-05T19:28:38.502912] File "C:\Users\PC\Desktop\ai-voice-cloning\src\train.py", line 80, in train [Training] [2023-03-05T19:28:38.503908] trainer.do_training() [Training] [2023-03-05T19:28:38.503908] File "C:\Users\PC\Desktop\ai-voice-cloning\./dlas\codes\train.py", line 331, in do_training [Training] [2023-03-05T19:28:38.503908] self.do_step(train_data) [Training] [2023-03-05T19:28:38.504905] File "C:\Users\PC\Desktop\ai-voice-cloning\./dlas\codes\train.py", line 212, in do_step [Training] [2023-03-05T19:28:38.504905] gradient_norms_dict = self.model.optimize_parameters(self.current_step, return_grad_norms=will_log) [Training] [2023-03-05T19:28:38.504905] File "C:\Users\PC\Desktop\ai-voice-cloning\./dlas/codes\trainer\ExtensibleTrainer.py", line 303, in optimize_parameters [Training] [2023-03-05T19:28:38.505902] ns = step.do_forward_backward(state, m, step_num, train=train_step, no_ddp_sync=(m+1 < self.batch_factor)) [Training] [2023-03-05T19:28:38.505902] File "C:\Users\PC\Desktop\ai-voice-cloning\./dlas/codes\trainer\steps.py", line 220, in do_forward_backward [Training] [2023-03-05T19:28:38.505902] local_state[k] = v[grad_accum_step] [Training] [2023-03-05T19:28:38.506897] IndexError: list index out of range

mrq commented

2023-03-05 13:42:14 +00:00

Make sure you click Validate Training Configuration for your given settings before saving. desu I pretty sure this is specifically because Batch Size / Gradient Accumulation Size > 2, so validation will clamp it down.

Make sure you click `Validate Training Configuration` for your given settings before saving. desu I pretty sure this is specifically because `Batch Size / Gradient Accumulation Size > 2`, so validation will clamp it down.

gasthemall commented

2023-03-05 14:05:35 +00:00

Ah it's working now, I must have forgotten to click it

gasthemall commented

2023-03-05 14:12:58 +00:00

Different issue
Traceback (most recent call last):
File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 384, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1032, in process_api
result = await self.call_function(
File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 858, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\anyio_backends_asyncio.py", line 867, in run
result = context.run(func, *args)
File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\utils.py", line 448, in async_iteration
return next(iterator)
File "C:\Users\PC\Desktop\ai-voice-cloning\src\utils.py", line 877, in run_training
result, percent, message = training_state.parse( line=line, verbose=verbose, keep_x_past_datasets=keep_x_past_datasets, progress=progress )
File "C:\Users\PC\Desktop\ai-voice-cloning\src\utils.py", line 770, in parse
self.epoch_rate = f'{"{:.3f}".format(self.epoch_time_delta)}s/epoch' if self.epoch_time_delta >= 1 else f'{"{:.3f}".format(1/self.epoch_time_delta)}epoch/s' # I doubt anyone will have it/s rates, but its here
ZeroDivisionError: float division by zero

Different issue Traceback (most recent call last): File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 384, in run_predict output = await app.get_blocks().process_api( File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1032, in process_api result = await self.call_function( File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 858, in call_function prediction = await anyio.to_thread.run_sync( File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run result = context.run(func, *args) File "C:\Users\PC\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\utils.py", line 448, in async_iteration return next(iterator) File "C:\Users\PC\Desktop\ai-voice-cloning\src\utils.py", line 877, in run_training result, percent, message = training_state.parse( line=line, verbose=verbose, keep_x_past_datasets=keep_x_past_datasets, progress=progress ) File "C:\Users\PC\Desktop\ai-voice-cloning\src\utils.py", line 770, in parse self.epoch_rate = f'{"{:.3f}".format(self.epoch_time_delta)}s/epoch' if self.epoch_time_delta >= 1 else f'{"{:.3f}".format(1/self.epoch_time_delta)}epoch/s' # I doubt anyone will have it/s rates, but its here ZeroDivisionError: float division by zero

mrq commented

2023-03-05 14:20:37 +00:00

Lazily wrapped in a try/catch block and one extra == 0 check in commit 35225a35da.

Lazily wrapped in a try/catch block and one extra `== 0` check in commit 35225a35daa5b346a132629a2efc0b1a0a7de067.

gasthemall commented

2023-03-05 15:27:04 +00:00

Error gone.
The training can exceed it's epoch settings

Error gone. The training can exceed it's epoch settings ![image](/attachments/85bc81c5-a2f5-4a11-88e0-9377227464f4)

image.png

96 KiB

mrq referenced this issue

2023-03-05 16:45:43 +00:00

Google Collab Training Issue #63

mrq commented

2023-03-05 20:32:15 +00:00

Got it to replicate, the epoch counter manages to desync when it takes one step to complete an epoch.

I'll try and probe why that's so, despite having a line to re-sync.

Got it to replicate, the epoch counter manages to desync when it takes one step to complete an epoch. I'll try and probe why that's so, despite having a line to re-sync.

mrq commented

2023-03-05 20:51:50 +00:00

Think I fixed it.

mrq closed this issue

2023-03-05 20:51:50 +00:00

Sign in to join this conversation.