Getting "RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR" while training #54
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#54
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Full stack trace included as a txt file, but I'll post excerpts here.
I think this is the main error but I'm not 100% sure, I don't really know whats going on with it, it prepares the dataset (33 wav files), then it says it can't read the files?
Confused LOL
Are just the same type of "non-error" that you get from inferencing/generating normally; they're pretty harmless.
Think you're just running out of RAM instead of VRAM. I say that, since searching for the the memory allocation failure for CUDA seems get me errors when people are trying to compile CUDA things, while OOM on VRAM will throw the typical messages (and shouldn't really happen if BitsAndBytes is shown to be working).
I'd keep an eye on both system RAM consumption and VRAM consumption under Task Manager > Performance.
Just wanting to follow up, as an incidental discovery, the amount of system RAM training can consume can really creep up on you. I haven't gotten any luxurious messages about memory allocation errors (I guess another benefit of Windows), but in the world of Linux, I've only caught on I'm triggering OOM kilers from the worker processes being spawned.
For now, you can manually edit how many workers are spawned in the generated
train.yaml
at line 14. Setn_workers
to 2 or something; I haven't noticed any perceptible performance hits with just 2, even with multi-GPU training.I'll probably have it just default to 2, and/or add a field to set it.
Added a field to reduce the worker size. By default, it's set to 2, as there doesn't seem to be any performance penalties.
If you're running into that memory issue again, regenerate your training configuration with a lower worker size.