Attempting to run training -- libcudart.so not t found #377

Closed
opened 2023-09-12 06:25:44 +00:00 by FergasunFergie · 3 comments

I was able to iterate and train through on 1 voice. Now, I am unable to train on another voice. I guess I don't understand how this program that I have been running for a few days now, actually training a voice on, just breaks down. This meant to talk about the predictability of software.

Cuda is enabled on start-up. When I attempt training I get the following error. Appears to be some issue with BitsandBytes?

[Training] [2023-09-11T23:19:02.221146] warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). "
[Training] [2023-09-11T23:20:33.280553]
[Training] [2023-09-11T23:20:33.280553] ===================================BUG REPORT===================================
[Training] [2023-09-11T23:20:33.281550] Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
[Training] [2023-09-11T23:20:33.295062] For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
[Training] [2023-09-11T23:20:33.295062] ================================================================================
[Training] [2023-09-11T23:20:33.310080] CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
[Training] [2023-09-11T23:20:33.311079] WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
[Training] [2023-09-11T23:20:33.311079] CUDA SETUP: Loading binary D:\applications\ai-voice-cloning\ai-voice-cloning\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so...
[Training] [2023-09-11T23:20:33.312089] Disabled distributed training.
[Training] [2023-09-11T23:20:33.313082] Path already exists. Rename it to [./training\black_male1a\finetune_archived_230911-231829]
[Training] [2023-09-11T23:20:33.313082] Loading from ./models/tortoise/dvae.pth
[Training] [2023-09-11T23:20:33.316092] Traceback (most recent call last):
[Training] [2023-09-11T23:20:33.316092] File "D:\applications\ai-voice-cloning\ai-voice-cloning\src\train.py", line 64, in
[Training] [2023-09-11T23:20:33.317093] train(config_path, args.launcher)
[Training] [2023-09-11T23:20:33.317093] File "D:\applications\ai-voice-cloning\ai-voice-cloning\src\train.py", line 31, in train
[Training] [2023-09-11T23:20:33.318093] trainer.do_training()
[Training] [2023-09-11T23:20:33.318093] File "d:\applications\ai-voice-cloning\ai-voice-cloning\modules\dlas\dlas\train.py", line 408, in do_training
[Training] [2023-09-11T23:20:33.319095] metric = self.do_step(train_data)
[Training] [2023-09-11T23:20:33.319095] File "d:\applications\ai-voice-cloning\ai-voice-cloning\modules\dlas\dlas\train.py", line 271, in do_step
[Training] [2023-09-11T23:20:33.320096] gradient_norms_dict = self.model.optimize_parameters(
[Training] [2023-09-11T23:20:33.321097] File "d:\applications\ai-voice-cloning\ai-voice-cloning\modules\dlas\dlas\trainer\ExtensibleTrainer.py", line 321, in optimize_parameters
[Training] [2023-09-11T23:20:33.321097] ns = step.do_forward_backward(
[Training] [2023-09-11T23:20:33.322098] File "d:\applications\ai-voice-cloning\ai-voice-cloning\modules\dlas\dlas\trainer\steps.py", line 242, in do_forward_backward
[Training] [2023-09-11T23:20:33.324092] local_state[k] = v[grad_accum_step]
[Training] [2023-09-11T23:20:33.324092] IndexError: list index out of range

I was able to iterate and train through on 1 voice. Now, I am unable to train on another voice. I guess I don't understand how this program that I have been running for a few days now, actually training a voice on, just breaks down. This meant to talk about the predictability of software. Cuda is enabled on start-up. When I attempt training I get the following error. Appears to be some issue with BitsandBytes? [Training] [2023-09-11T23:19:02.221146] warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. " [Training] [2023-09-11T23:20:33.280553] [Training] [2023-09-11T23:20:33.280553] ===================================BUG REPORT=================================== [Training] [2023-09-11T23:20:33.281550] Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues [Training] [2023-09-11T23:20:33.295062] For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link [Training] [2023-09-11T23:20:33.295062] ================================================================================ [Training] [2023-09-11T23:20:33.310080] CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... [Training] [2023-09-11T23:20:33.311079] WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)! [Training] [2023-09-11T23:20:33.311079] CUDA SETUP: Loading binary D:\applications\ai-voice-cloning\ai-voice-cloning\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so... [Training] [2023-09-11T23:20:33.312089] Disabled distributed training. [Training] [2023-09-11T23:20:33.313082] Path already exists. Rename it to [./training\black_male1a\finetune_archived_230911-231829] [Training] [2023-09-11T23:20:33.313082] Loading from ./models/tortoise/dvae.pth [Training] [2023-09-11T23:20:33.316092] Traceback (most recent call last): [Training] [2023-09-11T23:20:33.316092] File "D:\applications\ai-voice-cloning\ai-voice-cloning\src\train.py", line 64, in <module> [Training] [2023-09-11T23:20:33.317093] train(config_path, args.launcher) [Training] [2023-09-11T23:20:33.317093] File "D:\applications\ai-voice-cloning\ai-voice-cloning\src\train.py", line 31, in train [Training] [2023-09-11T23:20:33.318093] trainer.do_training() [Training] [2023-09-11T23:20:33.318093] File "d:\applications\ai-voice-cloning\ai-voice-cloning\modules\dlas\dlas\train.py", line 408, in do_training [Training] [2023-09-11T23:20:33.319095] metric = self.do_step(train_data) [Training] [2023-09-11T23:20:33.319095] File "d:\applications\ai-voice-cloning\ai-voice-cloning\modules\dlas\dlas\train.py", line 271, in do_step [Training] [2023-09-11T23:20:33.320096] gradient_norms_dict = self.model.optimize_parameters( [Training] [2023-09-11T23:20:33.321097] File "d:\applications\ai-voice-cloning\ai-voice-cloning\modules\dlas\dlas\trainer\ExtensibleTrainer.py", line 321, in optimize_parameters [Training] [2023-09-11T23:20:33.321097] ns = step.do_forward_backward( [Training] [2023-09-11T23:20:33.322098] File "d:\applications\ai-voice-cloning\ai-voice-cloning\modules\dlas\dlas\trainer\steps.py", line 242, in do_forward_backward [Training] [2023-09-11T23:20:33.324092] local_state[k] = v[grad_accum_step] [Training] [2023-09-11T23:20:33.324092] IndexError: list index out of range
Author

This is probably another artifact of my unelegant install with base tortoise-tts environment. I am going to reinstall, now that I think I understand a bit more.

This is probably another artifact of my unelegant install with base tortoise-tts environment. I am going to reinstall, now that I think I understand a bit more.
Owner

[Training] [2023-09-11T23:20:33.324092] local_state[k] = v[grad_accum_step]
[Training] [2023-09-11T23:20:33.324092] IndexError: list index out of range

Your batch size isn't evenly divisible by your gradient accumulation size. Stick to even numbers for both values.

> [Training] [2023-09-11T23:20:33.324092] local_state[k] = v[grad_accum_step] > [Training] [2023-09-11T23:20:33.324092] IndexError: list index out of range Your batch size isn't evenly divisible by your gradient accumulation size. Stick to even numbers for both values.
Author

Reinstalled under a fresh environment. Not quite sure if the error was precisely related to not having even numbers? I'll never know, but had a clean install this time. Red flag should have been flashing to me when I had so many errors to "power through" originally at pretty much every training step.

Following the install steps (albeit with a modification to commands due to using Powershell) and I both generated voice and am in process of training with no errors except user ones.

Reinstalled under a fresh environment. Not quite sure if the error was precisely related to not having even numbers? I'll never know, but had a clean install this time. Red flag should have been flashing to me when I had so many errors to "power through" originally at pretty much every training step. Following the install steps (albeit with a modification to commands due to using Powershell) and I both generated voice and am in process of training with no errors except user ones.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#377
No description provided.