Getting "RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR" while training #54

New Issue

AI_Pleb · 2023-03-03T12:24:31Z

AI_Pleb commented

2023-03-03 12:24:31 +00:00

Full stack trace included as a txt file, but I'll post excerpts here.

 0%|          | 0/1 [00:00<?, ?it/s]G:\Tortoise-TTS\ai-voice-cloning\./dlas/codes\models\audio\tts\tacotron2\taco_utils.py:17: WavFileWarning: Chunk (non-data) not understood, skipping it.
[Training] [2023-03-03T12:15:17.542027]   sampling_rate, data = read(full_path)
[Training] [2023-03-03T12:15:18.434228] G:\Tortoise-TTS\ai-voice-cloning\venv\lib\site-packages\torch\optim\lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
[Training] [2023-03-03T12:15:18.434228]   warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
[Training] [2023-03-03T12:15:20.304650] fatal   : Memory allocation failure
[Training] [2023-03-03T12:15:20.329655]
[Training] [2023-03-03T12:15:20.330656]   0%|          | 0/1 [00:08<?, ?it/s]
[Training] [2023-03-03T12:15:20.330656] Traceback (most recent call last):

I think this is the main error but I'm not 100% sure, I don't really know whats going on with it, it prepares the dataset (33 wav files), then it says it can't read the files?
Confused LOL

Full stack trace included as a txt file, but I'll post excerpts here. ``` 0%| | 0/1 [00:00<?, ?it/s]G:\Tortoise-TTS\ai-voice-cloning\./dlas/codes\models\audio\tts\tacotron2\taco_utils.py:17: WavFileWarning: Chunk (non-data) not understood, skipping it. [Training] [2023-03-03T12:15:17.542027] sampling_rate, data = read(full_path) [Training] [2023-03-03T12:15:18.434228] G:\Tortoise-TTS\ai-voice-cloning\venv\lib\site-packages\torch\optim\lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate [Training] [2023-03-03T12:15:18.434228] warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. " [Training] [2023-03-03T12:15:20.304650] fatal : Memory allocation failure [Training] [2023-03-03T12:15:20.329655] [Training] [2023-03-03T12:15:20.330656] 0%| | 0/1 [00:08<?, ?it/s] [Training] [2023-03-03T12:15:20.330656] Traceback (most recent call last): ``` I think this is the main error but I'm not 100% sure, I don't really know whats going on with it, it prepares the dataset (33 wav files), then it says it can't read the files? Confused LOL

stack trace.txt

59 KiB

mrq commented

2023-03-03 13:12:58 +00:00

WavFileWarning: Chunk (non-data) not understood, skipping it.

Are just the same type of "non-error" that you get from inferencing/generating normally; they're pretty harmless.

fatal : Memory allocation failure
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Think you're just running out of RAM instead of VRAM. I say that, since searching for the the memory allocation failure for CUDA seems get me errors when people are trying to compile CUDA things, while OOM on VRAM will throw the typical messages (and shouldn't really happen if BitsAndBytes is shown to be working).

I'd keep an eye on both system RAM consumption and VRAM consumption under Task Manager > Performance.

> WavFileWarning: Chunk (non-data) not understood, skipping it. Are just the same type of "non-error" that you get from inferencing/generating normally; they're pretty harmless. > fatal : Memory allocation failure > RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR Think you're just running out of RAM instead of VRAM. I say that, since searching for the the memory allocation failure for CUDA seems get me errors when people are trying to compile CUDA things, while OOM on VRAM will throw the typical messages (and shouldn't really happen if BitsAndBytes is shown to be working). I'd keep an eye on both system RAM consumption and VRAM consumption under Task Manager > Performance.

mrq commented

2023-03-05 01:44:20 +00:00

Just wanting to follow up, as an incidental discovery, the amount of system RAM training can consume can really creep up on you. I haven't gotten any luxurious messages about memory allocation errors (I guess another benefit of Windows), but in the world of Linux, I've only caught on I'm triggering OOM kilers from the worker processes being spawned.

For now, you can manually edit how many workers are spawned in the generated train.yaml at line 14. Set n_workers to 2 or something; I haven't noticed any perceptible performance hits with just 2, even with multi-GPU training.

I'll probably have it just default to 2, and/or add a field to set it.

Just wanting to follow up, as an incidental discovery, the amount of system RAM training can consume can really creep up on you. I haven't gotten any luxurious messages about memory allocation errors (I guess another benefit of Windows), but in the world of Linux, I've only caught on I'm triggering OOM kilers from the worker processes being spawned. For now, you can manually edit how many workers are spawned in the generated `train.yaml` at line 14. Set `n_workers` to 2 or something; I haven't noticed any perceptible performance hits with just 2, even with multi-GPU training. I'll probably have it just default to 2, and/or add a field to set it.

mrq commented

2023-03-05 05:27:12 +00:00

Added a field to reduce the worker size. By default, it's set to 2, as there doesn't seem to be any performance penalties.

If you're running into that memory issue again, regenerate your training configuration with a lower worker size.

Added a field to reduce the worker size. By default, it's set to 2, as there doesn't seem to be any performance penalties. If you're running into that memory issue again, regenerate your training configuration with a lower worker size.

mrq closed this issue

2023-03-05 05:27:12 +00:00

Sign in to join this conversation.