Attempting to run training -- libcudart.so not t found #377
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#377
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I was able to iterate and train through on 1 voice. Now, I am unable to train on another voice. I guess I don't understand how this program that I have been running for a few days now, actually training a voice on, just breaks down. This meant to talk about the predictability of software.
Cuda is enabled on start-up. When I attempt training I get the following error. Appears to be some issue with BitsandBytes?
[Training] [2023-09-11T23:19:02.221146] warnings.warn("Detected call of
lr_scheduler.step()
beforeoptimizer.step()
. "[Training] [2023-09-11T23:20:33.280553]
[Training] [2023-09-11T23:20:33.280553] ===================================BUG REPORT===================================
[Training] [2023-09-11T23:20:33.281550] Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
[Training] [2023-09-11T23:20:33.295062] For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
[Training] [2023-09-11T23:20:33.295062] ================================================================================
[Training] [2023-09-11T23:20:33.310080] CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
[Training] [2023-09-11T23:20:33.311079] WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
[Training] [2023-09-11T23:20:33.311079] CUDA SETUP: Loading binary D:\applications\ai-voice-cloning\ai-voice-cloning\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so...
[Training] [2023-09-11T23:20:33.312089] Disabled distributed training.
[Training] [2023-09-11T23:20:33.313082] Path already exists. Rename it to [./training\black_male1a\finetune_archived_230911-231829]
[Training] [2023-09-11T23:20:33.313082] Loading from ./models/tortoise/dvae.pth
[Training] [2023-09-11T23:20:33.316092] Traceback (most recent call last):
[Training] [2023-09-11T23:20:33.316092] File "D:\applications\ai-voice-cloning\ai-voice-cloning\src\train.py", line 64, in
[Training] [2023-09-11T23:20:33.317093] train(config_path, args.launcher)
[Training] [2023-09-11T23:20:33.317093] File "D:\applications\ai-voice-cloning\ai-voice-cloning\src\train.py", line 31, in train
[Training] [2023-09-11T23:20:33.318093] trainer.do_training()
[Training] [2023-09-11T23:20:33.318093] File "d:\applications\ai-voice-cloning\ai-voice-cloning\modules\dlas\dlas\train.py", line 408, in do_training
[Training] [2023-09-11T23:20:33.319095] metric = self.do_step(train_data)
[Training] [2023-09-11T23:20:33.319095] File "d:\applications\ai-voice-cloning\ai-voice-cloning\modules\dlas\dlas\train.py", line 271, in do_step
[Training] [2023-09-11T23:20:33.320096] gradient_norms_dict = self.model.optimize_parameters(
[Training] [2023-09-11T23:20:33.321097] File "d:\applications\ai-voice-cloning\ai-voice-cloning\modules\dlas\dlas\trainer\ExtensibleTrainer.py", line 321, in optimize_parameters
[Training] [2023-09-11T23:20:33.321097] ns = step.do_forward_backward(
[Training] [2023-09-11T23:20:33.322098] File "d:\applications\ai-voice-cloning\ai-voice-cloning\modules\dlas\dlas\trainer\steps.py", line 242, in do_forward_backward
[Training] [2023-09-11T23:20:33.324092] local_state[k] = v[grad_accum_step]
[Training] [2023-09-11T23:20:33.324092] IndexError: list index out of range
This is probably another artifact of my unelegant install with base tortoise-tts environment. I am going to reinstall, now that I think I understand a bit more.
Your batch size isn't evenly divisible by your gradient accumulation size. Stick to even numbers for both values.
Reinstalled under a fresh environment. Not quite sure if the error was precisely related to not having even numbers? I'll never know, but had a clean install this time. Red flag should have been flashing to me when I had so many errors to "power through" originally at pretty much every training step.
Following the install steps (albeit with a modification to commands due to using Powershell) and I both generated voice and am in process of training with no errors except user ones.