FileNotFoundError immediately after starting training #3

Closed
opened 2023-02-18 13:40:19 +00:00 by sakharam_gatne · 18 comments

I'm not sure if this is a problem with my copy of the repo, but I get this error after I generate a configuration, and try to run training using that configuration. I don't think the problem is with my dataset structure, since the same dataset works fine with 152334H's colab notebook.

The error's probably too vague to pinpoint the problem immediately, but if any other user's able to run training just fine on their computer, please tell me, and I'll try out fixes on my side.

(Also, sorry if you're not yet done implementing the entire training code, and I'm testing too early).

Unloading TTS to save VRAM.
Spawning process:  call.\train.bat ./training/TestTraining.yaml
Traceback (most recent call last):
  File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 374, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1017, in process_api
    result = await self.call_function(
  File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 849, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\gradio\utils.py", line 453, in async_iteration
    return next(iterator)
  File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\src\utils.py", line 449, in run_training
    training_process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, universal_newlines=True)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\lib\subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\lib\subprocess.py", line 1440, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
I'm not sure if this is a problem with my copy of the repo, but I get this error after I generate a configuration, and try to run training using that configuration. I don't think the problem is with my dataset structure, since the same dataset works fine with 152334H's colab notebook. The error's probably too vague to pinpoint the problem immediately, but if any other user's able to run training just fine on their computer, please tell me, and I'll try out fixes on my side. (Also, sorry if you're not yet done implementing the entire training code, and I'm testing too early). ``` Unloading TTS to save VRAM. Spawning process: call.\train.bat ./training/TestTraining.yaml Traceback (most recent call last): File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 374, in run_predict output = await app.get_blocks().process_api( File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1017, in process_api result = await self.call_function( File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 849, in call_function prediction = await anyio.to_thread.run_sync( File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run result = context.run(func, *args) File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\venv\lib\site-packages\gradio\utils.py", line 453, in async_iteration return next(iterator) File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\src\utils.py", line 449, in run_training training_process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, universal_newlines=True) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\lib\subprocess.py", line 971, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\lib\subprocess.py", line 1440, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] The system cannot find the file specified ```
Owner

Should be fixed in 0dd5640a89. It was quick to figure the issue since past-me had the brain to at least have it print the command it executes.

Several brain worms went wrong with that line: call.\train.bat ./training/TestTraining.yaml:

  • I didn't put a comma between 'call', .\train.bat'`
  • I forgot call only seemed to work when I was trying to make shell=True work
  • .\\train.bat actually doesn't work as well for subprocess.Popen
  • Ironically, I extensively tested it originally on Windows, only to end up breaking it when extensively testing it for a Colab notebook

(Also, sorry if you're not yet done implementing the entire training code, and I'm testing too early).

Nah, you're in the right time to test. I finished it up last night, but don't really have any real way to test it. The Colab notebook I was using did that Busy "disconnect" and lost my progress when training something. I definitely need more people to try and break it for me.

Should be fixed in 0dd5640a89aba2eafa4e886d74f28f3a5e95357f. It was quick to figure the issue since past-me had the brain to at least have it print the command it executes. Several brain worms went wrong with that line: `call.\train.bat ./training/TestTraining.yaml`: * I didn't put a comma between `'call', `.\\train.bat'` * I forgot `call` only seemed to work when I was trying to make shell=True work * `.\\train.bat` *actually* doesn't work as well for `subprocess.Popen` * Ironically, I extensively tested it originally on Windows, only to end up breaking it when extensively testing it for a Colab notebook > (Also, sorry if you're not yet done implementing the entire training code, and I'm testing too early). Nah, you're in the right time to test. I finished it up last night, but don't really have any real way to test it. The Colab notebook I was using did that `Busy` "disconnect" and lost my progress when training something. I definitely need more people to try and break it for me.
mrq closed this issue 2023-02-18 14:19:01 +00:00
Author

I think commit 2615cafd75 broke something, because now I get this error while while running start.bat:

Traceback (most recent call last):
  File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\src\main.py", line 22, in <module>
    tts = setup_tortoise()
  File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\src\utils.py", line 496, in setup_tortoise
    tts = TextToSpeech(minor_optimizations=not args.low_vram, autoregressive_model_path=args.autoregressive_model)
TypeError: TextToSpeech.__init__() got an unexpected keyword argument 'autoregressive_model_path'

EDIT: Also I just noticed that the setup_training.bat script, when run by setup-cuda.bat clones the training repo to a temporary folder. Thus, all dependencies are installed correctly, but the contents of the repo aren't copied to the ai-voice-cloning folder. So when I try to run training, I get

ModuleNotFoundError: No module named 'codes'
I think commit 2615cafd75bc7ad397f9fea2d0055b125af0ffaf broke something, because now I get this error while while running start.bat: ``` Traceback (most recent call last): File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\src\main.py", line 22, in <module> tts = setup_tortoise() File "C:\Users\Username\Documents\GitHub\ai-voice-cloning\src\utils.py", line 496, in setup_tortoise tts = TextToSpeech(minor_optimizations=not args.low_vram, autoregressive_model_path=args.autoregressive_model) TypeError: TextToSpeech.__init__() got an unexpected keyword argument 'autoregressive_model_path' ``` EDIT: Also I just noticed that the setup_training.bat script, when run by setup-cuda.bat clones the training repo to a temporary folder. Thus, all dependencies are installed correctly, but the contents of the repo aren't copied to the ai-voice-cloning folder. So when I try to run training, I get ``` ModuleNotFoundError: No module named 'codes' ```
Owner

I think commit 2615cafd75 broke something, because now I get this error while while running start.bat

Did you manually pull with git pull? You'll need to re-install dependencies, as I updated [mrq/tortoise-tts] to allow selecting which autoregressive model to load. Do so with the update script.

Also I just noticed that the setup_training.bat script, when run by setup-cuda.bat clones the training repo to a temporary folder. Thus, all dependencies are installed correctly, but the contents of the repo aren't copied to the ai-voice-cloning folder

I don't exactly follow. I'll probe it later, but just move the dlas folder to the right place then.

> I think commit 2615cafd75 broke something, because now I get this error while while running start.bat Did you manually pull with `git pull`? You'll need to re-install dependencies, as I updated [mrq/tortoise-tts] to allow selecting which autoregressive model to load. Do so with the update script. > Also I just noticed that the setup_training.bat script, when run by setup-cuda.bat clones the training repo to a temporary folder. Thus, all dependencies are installed correctly, but the contents of the repo aren't copied to the ai-voice-cloning folder I don't exactly follow. I'll probe it later, but just move the `dlas` folder to the right place then.
Owner

Sorry about that, several little brain worms happened, all should be fixed in commit 996e5217d2.

Turns out anything after deactivate do not get called, at all.

Sorry about that, several little brain worms happened, all should be fixed in commit 996e5217d2224804be6d71ca29af941c6fe3a3ef. Turns out anything after `deactivate` do not get called, at all.
Author

Thanks for the fix! I still get this error.

TypeError: TextToSpeech.__init__() got an unexpected keyword argument 'autoregressive_model_path'

I've run update.bat, update-force.bat, and setup-cuda.bat one by one, but if it's working as intended for others, I'll try doing a clean install of the repo tomorrow.

Thanks for the fix! I still get this error. ``` TypeError: TextToSpeech.__init__() got an unexpected keyword argument 'autoregressive_model_path' ``` I've run `update.bat`, `update-force.bat`, and `setup-cuda.bat` one by one, but if it's working as intended for others, I'll try doing a clean install of the repo tomorrow.
Owner

Run this to force install mrq/tortoise-tts in pip:

call .\venv\Scripts\activate.bat
pip install -U git+https://git.ecker.tech/mrq/tortoise-tts.git
Run this to force install mrq/tortoise-tts in pip: ``` call .\venv\Scripts\activate.bat pip install -U git+https://git.ecker.tech/mrq/tortoise-tts.git ```
Author

Run this to force install mrq/tortoise-tts in pip:

That unfortunately didn't work. Then I ran

pip uninstall tortoise
pip install git+https://git.ecker.tech/mrq/tortoise-tts.git

and now it works. Weird...

> Run this to force install mrq/tortoise-tts in pip: That unfortunately didn't work. Then I ran ``` pip uninstall tortoise pip install git+https://git.ecker.tech/mrq/tortoise-tts.git ``` and now it works. Weird...
Owner

Strange.

You didn't happen to migrate over your tortoise-venv folder, did you?

If you did, I can sort of make some assumptions on what happened:

  • mrq/tortoise-tts gets pulled, the setup script was ran
  • since the setup script still had python setup.py install, it'll install (copy) the version of tortoise as it existed at setup time to its venv folder
  • the venv gets moved, the old copy of tortoise will still remain
  • either:
    • PIP sees there's a tortoise, since there's no version requirement, it won't bother updating, unless explicitly requested with -U
    • PIP sees there's a non-git package of tortoise, and won't bother pulling by default (as it seems any package from a git URL will always pull)

I might need to add a post-migration script to do what you did to fix it (uninstall then reinstall), or just explicitly say to not copy over the venv.

Strange. You didn't happen to migrate over your `tortoise-venv` folder, did you? If you did, I can sort of make some assumptions on what happened: * mrq/tortoise-tts gets pulled, the setup script was ran * since the setup script still had `python setup.py install`, it'll install (copy) the version of tortoise as it existed at setup time to its venv folder * the venv gets moved, the old copy of tortoise will still remain * either: - PIP sees there's a tortoise, since there's no version requirement, it won't bother updating, unless explicitly requested with `-U` - PIP sees there's a non-git package of tortoise, and won't bother pulling by default (as it seems any package from a git URL will always pull) I might need to add a post-migration script to do what you did to fix it (uninstall then reinstall), or just explicitly say to not copy over the venv.

I had this error on windows. I fixed it by dropping ffmpeg.exe into the root folder of the repo.

I had this error on windows. I fixed it by dropping ffmpeg.exe into the root folder of the repo.
Author

You didn't happen to migrate over your tortoise-venv folder, did you?

Yup, that's exactly what I did lol

I'm doing a fresh install now, as even after that problem I had another error while training... If reinstalling doesn't fix that, I'll post it.

> You didn't happen to migrate over your `tortoise-venv` folder, did you? Yup, that's exactly what I did lol I'm doing a fresh install now, as even after that problem I had another error while training... If reinstalling doesn't fix that, I'll post it.
Author

Now there's a fresh new error immediately after starting training:

TorToiSe initialized, ready for generation.
Traceback (most recent call last):
  File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 374, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1015, in process_api
    inputs = self.preprocess_data(fn_index, inputs, state)
  File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 913, in preprocess_data    processed_input.append(block.preprocess(inputs[i]))
IndexError: list index out of range

It's definitely the last few commits that caused this, because I did a fresh install about half an hour ago, and ran the update script just now. After the fresh install, I was getting this error (after loading all necessary models for training, just before the actual training starts):

[Training] Loading from ./models/tortoise/dvae.pth
[Training] WARNING! Unable to find EMA network! Starting a new EMA from given model parameters.
[Training]
[Training]   0%|          | 0/44 [00:00<?, ?it/s]C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\optim\lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
[Training]   warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
[Training] Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x000001855FF74DC0>
[Training] Traceback (most recent call last):
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1466, in __del__
[Training]     self._shutdown_workers()
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1411, in _shutdown_workers
[Training]     self._worker_result_queue.put((None, None))
[Training]   File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\queues.py", line 94, in put
[Training]     self._start_thread()
[Training]   File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\queues.py", line 177, in _start_thread
[Training]     self._thread.start()
[Training]   File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\lib\threading.py", line 935, in start
[Training]     _start_new_thread(self._bootstrap, ())
[Training] RuntimeError: can't start new thread
[Training]
[Training]   0%|          | 0/44 [00:57<?, ?it/s]
[Training] Traceback (most recent call last):
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\src\train.py", line 61, in <module>
[Training]     train(args.opt, args.launcher)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\src\train.py", line 53, in train
[Training]     trainer.do_training()
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas\codes\train.py", line 330, in do_training
[Training]     self.do_step(train_data)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas\codes\train.py", line 211, in do_step
[Training]     gradient_norms_dict = self.model.optimize_parameters(self.current_step, return_grad_norms=will_log)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas/codes\trainer\ExtensibleTrainer.py", line 302, in optimize_parameters
[Training]     ns = step.do_forward_backward(state, m, step_num, train=train_step, no_ddp_sync=(m+1 < self.batch_factor))
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas/codes\trainer\steps.py", line 246, in do_forward_backward
[Training]     injected = inj(local_state)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
[Training]     return forward_call(*input, **kwargs)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas/codes\trainer\injectors\audio_injectors.py", line 184, in forward
[Training]     codes = self.dvae.get_codebook_indices(inp)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
[Training]     return func(*args, **kwargs)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas/codes\models\audio\tts\lucidrains_dvae.py", line 25, in inner
[Training]     out = fn(model, *args, **kwargs)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas/codes\models\audio\tts\lucidrains_dvae.py", line 186, in get_codebook_indices
[Training]     logits = self.encoder(img).permute((0,2,3,1) if len(img.shape) == 4 else (0,2,1))
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
[Training]     return forward_call(*input, **kwargs)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\container.py", line 204, in forward
[Training]     input = module(input)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
[Training]     return forward_call(*input, **kwargs)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\container.py", line 204, in forward
[Training]     input = module(input)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
[Training]     return forward_call(*input, **kwargs)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\conv.py", line 313, in forward
[Training]     return self._conv_forward(input, self.weight, self.bias)
[Training]   File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\conv.py", line 309, in _conv_forward
[Training]     return F.conv1d(input, weight, bias, self.stride,
[Training] RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
[Training] You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
[Training]
[Training] import torch
[Training] torch.backends.cuda.matmul.allow_tf32 = True
[Training] torch.backends.cudnn.benchmark = True
[Training] torch.backends.cudnn.deterministic = False
[Training] torch.backends.cudnn.allow_tf32 = True
[Training] data = torch.randn([8, 80, 1, 1000], dtype=torch.float, device='cuda', requires_grad=True)
[Training] net = torch.nn.Conv2d(80, 512, kernel_size=[1, 3], padding=[0, 1], stride=[1, 2], dilation=[1, 1], groups=1)
[Training] net = net.cuda().float()
[Training] out = net(data)
[Training] out.backward(torch.randn_like(out))
[Training] torch.cuda.synchronize()
[Training]
[Training] ConvolutionParams
[Training]     memory_format = Contiguous
[Training]     data_type = CUDNN_DATA_FLOAT
[Training]     padding = [0, 1, 0]
[Training]     stride = [1, 2, 0]
[Training]     dilation = [1, 1, 0]
[Training]     groups = 1
[Training]     deterministic = false
[Training]     allow_tf32 = true
[Training] input: TensorDescriptor 00000186435C77B0
[Training]     type = CUDNN_DATA_FLOAT
[Training]     nbDims = 4
[Training]     dimA = 8, 80, 1, 1000,
[Training]     strideA = 80000, 1000, 1000, 1,
[Training] output: TensorDescriptor 00000186435C7820
[Training]     type = CUDNN_DATA_FLOAT
[Training]     nbDims = 4
[Training]     dimA = 8, 512, 1, 500,
[Training]     strideA = 256000, 500, 500, 1,
[Training] weight: FilterDescriptor 00000186430E8970
[Training]     type = CUDNN_DATA_FLOAT
[Training]     tensor_format = CUDNN_TENSOR_NCHW
[Training]     nbDims = 4
[Training]     dimA = 512, 80, 1, 3,
[Training] Pointer addresses:
[Training]     input: 00000007A26F0800
[Training]     output: 00000007A4000000
[Training]     weight: 0000000744F68400
Now there's a fresh new error immediately after starting training: ``` TorToiSe initialized, ready for generation. Traceback (most recent call last): File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 374, in run_predict output = await app.get_blocks().process_api( File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1015, in process_api inputs = self.preprocess_data(fn_index, inputs, state) File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 913, in preprocess_data processed_input.append(block.preprocess(inputs[i])) IndexError: list index out of range ``` It's definitely the last few commits that caused this, because I did a fresh install about half an hour ago, and ran the update script just now. After the fresh install, I was getting this error (after loading all necessary models for training, just before the actual training starts): ``` [Training] Loading from ./models/tortoise/dvae.pth [Training] WARNING! Unable to find EMA network! Starting a new EMA from given model parameters. [Training] [Training] 0%| | 0/44 [00:00<?, ?it/s]C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\optim\lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate [Training] warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. " [Training] Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x000001855FF74DC0> [Training] Traceback (most recent call last): [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1466, in __del__ [Training] self._shutdown_workers() [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1411, in _shutdown_workers [Training] self._worker_result_queue.put((None, None)) [Training] File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\queues.py", line 94, in put [Training] self._start_thread() [Training] File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\queues.py", line 177, in _start_thread [Training] self._thread.start() [Training] File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\lib\threading.py", line 935, in start [Training] _start_new_thread(self._bootstrap, ()) [Training] RuntimeError: can't start new thread [Training] [Training] 0%| | 0/44 [00:57<?, ?it/s] [Training] Traceback (most recent call last): [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\src\train.py", line 61, in <module> [Training] train(args.opt, args.launcher) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\src\train.py", line 53, in train [Training] trainer.do_training() [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas\codes\train.py", line 330, in do_training [Training] self.do_step(train_data) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas\codes\train.py", line 211, in do_step [Training] gradient_norms_dict = self.model.optimize_parameters(self.current_step, return_grad_norms=will_log) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas/codes\trainer\ExtensibleTrainer.py", line 302, in optimize_parameters [Training] ns = step.do_forward_backward(state, m, step_num, train=train_step, no_ddp_sync=(m+1 < self.batch_factor)) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas/codes\trainer\steps.py", line 246, in do_forward_backward [Training] injected = inj(local_state) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl [Training] return forward_call(*input, **kwargs) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas/codes\trainer\injectors\audio_injectors.py", line 184, in forward [Training] codes = self.dvae.get_codebook_indices(inp) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context [Training] return func(*args, **kwargs) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas/codes\models\audio\tts\lucidrains_dvae.py", line 25, in inner [Training] out = fn(model, *args, **kwargs) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\./dlas/codes\models\audio\tts\lucidrains_dvae.py", line 186, in get_codebook_indices [Training] logits = self.encoder(img).permute((0,2,3,1) if len(img.shape) == 4 else (0,2,1)) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl [Training] return forward_call(*input, **kwargs) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\container.py", line 204, in forward [Training] input = module(input) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl [Training] return forward_call(*input, **kwargs) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\container.py", line 204, in forward [Training] input = module(input) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl [Training] return forward_call(*input, **kwargs) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\conv.py", line 313, in forward [Training] return self._conv_forward(input, self.weight, self.bias) [Training] File "C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\conv.py", line 309, in _conv_forward [Training] return F.conv1d(input, weight, bias, self.stride, [Training] RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR [Training] You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue. [Training] [Training] import torch [Training] torch.backends.cuda.matmul.allow_tf32 = True [Training] torch.backends.cudnn.benchmark = True [Training] torch.backends.cudnn.deterministic = False [Training] torch.backends.cudnn.allow_tf32 = True [Training] data = torch.randn([8, 80, 1, 1000], dtype=torch.float, device='cuda', requires_grad=True) [Training] net = torch.nn.Conv2d(80, 512, kernel_size=[1, 3], padding=[0, 1], stride=[1, 2], dilation=[1, 1], groups=1) [Training] net = net.cuda().float() [Training] out = net(data) [Training] out.backward(torch.randn_like(out)) [Training] torch.cuda.synchronize() [Training] [Training] ConvolutionParams [Training] memory_format = Contiguous [Training] data_type = CUDNN_DATA_FLOAT [Training] padding = [0, 1, 0] [Training] stride = [1, 2, 0] [Training] dilation = [1, 1, 0] [Training] groups = 1 [Training] deterministic = false [Training] allow_tf32 = true [Training] input: TensorDescriptor 00000186435C77B0 [Training] type = CUDNN_DATA_FLOAT [Training] nbDims = 4 [Training] dimA = 8, 80, 1, 1000, [Training] strideA = 80000, 1000, 1000, 1, [Training] output: TensorDescriptor 00000186435C7820 [Training] type = CUDNN_DATA_FLOAT [Training] nbDims = 4 [Training] dimA = 8, 512, 1, 500, [Training] strideA = 256000, 500, 500, 1, [Training] weight: FilterDescriptor 00000186430E8970 [Training] type = CUDNN_DATA_FLOAT [Training] tensor_format = CUDNN_TENSOR_NCHW [Training] nbDims = 4 [Training] dimA = 512, 80, 1, 3, [Training] Pointer addresses: [Training] input: 00000007A26F0800 [Training] output: 00000007A4000000 [Training] weight: 0000000744F68400 ```
Owner

That looks like an issue that crops up during training itself rather than the web UI. I'm assuming it offers a code block to try and reproduce it:

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([8, 80, 1, 1000], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(80, 512, kernel_size=[1, 3], padding=[0, 1], stride=[1, 2], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

I just ran it for shits and grins and nothing broke for me on my dingy 2060.

I'm not too sure what exactly would lead to a weird driver state like that, but as much as I hate to suggest it, I'll suggest (in no particular order):

  • under Settings, check Defer TTS Load and restart the UI to train again
  • restart your computer

I doubt it's a YAML training configuration issue, as it would've complained if:

  • you set too low of a batch size (anything 3 and below)
  • you had an older generated YAML config from before I cleaned up things
That looks like an issue that crops up during training itself rather than the web UI. I'm assuming it offers a code block to try and reproduce it: ``` import torch torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([8, 80, 1, 1000], dtype=torch.float, device='cuda', requires_grad=True) net = torch.nn.Conv2d(80, 512, kernel_size=[1, 3], padding=[0, 1], stride=[1, 2], dilation=[1, 1], groups=1) net = net.cuda().float() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize() ``` I just ran it for shits and grins and nothing broke for me on my dingy 2060. I'm not too sure what exactly would lead to a weird driver state like that, but as much as I hate to suggest it, I'll suggest (in no particular order): * under `Settings`, check `Defer TTS Load` and restart the UI to train again * restart your computer I doubt it's a YAML training configuration issue, as it would've complained if: * you set too low of a batch size (anything 3 and below) * you had an older generated YAML config from before I cleaned up things
Owner

And as much as I hate providing StackOverflow references, this one mentions:

The error RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR is notoriously difficult to debug, but surprisingly often it's an out of memory problem. Usually, you would get the out of memory error, but depending on where it occurs, PyTorch cannot intercept the error and therefore not provide a meaningful error message.

it being a rather vague OOM issue, to which I would suggest to enable Defer TTS Load, restarting (because forcing TorToiSe to unload and calling GC just doesn't actually make it deallocate) to free up some VRAM, and lower your batch size.

And as much as I hate providing StackOverflow references, [this one](https://stackoverflow.com/questions/62067849/pytorch-model-training-runtimeerror-cudnn-error-cudnn-status-internal-error) mentions: > The error RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR is notoriously difficult to debug, but surprisingly often it's an out of memory problem. Usually, you would get the out of memory error, but depending on where it occurs, PyTorch cannot intercept the error and therefore not provide a meaningful error message. it being a rather vague OOM issue, to which I would suggest to enable `Defer TTS Load`, restarting (because forcing TorToiSe to unload and calling GC just doesn't actually make it deallocate) to free up some VRAM, and lower your batch size.
Author

I just ran it for shits and grins and nothing broke for me on my dingy 2060.

Can you give me the parameters you used? I have almost the same GPU (2060 Laptop version). The program suggested a batch size of 64, which I reduced to 32, and I left all other parameters as is.
After loading the dvae, it did some "3D work", and then ran out of memory.

[Training] [2023-02-19T11:31:45.710841] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 5.09 GiB already allocated; 0 bytes free; 5.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Also, I get this just after loading the dvae.

[Training] [2023-02-19T11:30:06.316706] WARNING! Unable to find EMA network! Starting a new EMA from given model parameters.
[Training] [2023-02-19T11:30:06.323711]
[Training] [2023-02-19T11:30:56.771552]   0%|          | 0/44 [00:00<?, ?it/s]C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\optim\lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
> I just ran it for shits and grins and nothing broke for me on my dingy 2060. Can you give me the parameters you used? I have almost the same GPU (2060 Laptop version). The program suggested a batch size of 64, which I reduced to 32, and I left all other parameters as is. After loading the dvae, it did some "3D work", and then ran out of memory. ``` [Training] [2023-02-19T11:31:45.710841] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 5.09 GiB already allocated; 0 bytes free; 5.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ``` Also, I get this just after loading the dvae. ``` [Training] [2023-02-19T11:30:06.316706] WARNING! Unable to find EMA network! Starting a new EMA from given model parameters. [Training] [2023-02-19T11:30:06.323711] [Training] [2023-02-19T11:30:56.771552] 0%| | 0/44 [00:00<?, ?it/s]C:\Users\Username\Desktop\ai-voice-cloning\venv\lib\site-packages\torch\optim\lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate ```
Owner

2060

Yeah... you're going to have to CBT yourself and use a Colab notebook to train. I haven't gotten it to work locally on my 2060 at all, as even a batch size of 4 will cause it to OOM. I'm guessing it's a mix from all the shit that gets loaded during training and all the fragmentation that even something small like 20MiB can't get allocated.

I even tried that max_split_size_mb override but I don't think it does anything. I could be trying to set it wrong, but it hasn't gotten any different results.

> 2060 Yeah... you're going to have to CBT yourself and use a Colab notebook to train. I haven't gotten it to work locally on my 2060 at all, as even a batch size of 4 will cause it to OOM. I'm guessing it's a mix from all the shit that gets loaded during training and all the fragmentation that even something small like 20MiB can't get allocated. I even tried that max_split_size_mb override but I don't think it does anything. I could be trying to set it wrong, but it hasn't gotten any different results.
Author

I haven't gotten it to work locally on my 2060 at all

Oh ok, I'll just stick to the colab then.

By the way, I think ec550d74fd broke the notebook, because now it says No module named 'tortoise'. For some reason the tortoise-tts folder doesn't get copied to the ai-voice-cloning folder in colab. Git cloning it there manually also doesn't seem to fix it.

EDIT: I moved the ai-voice-cloning/tortoise-tts/tortoise to ai-voice-cloning/tortoise, and now it can locate the api.py file.

> I haven't gotten it to work locally on my 2060 at all Oh ok, I'll just stick to the colab then. By the way, I think ec550d74fd5a32ad1878528b9431ec5fde2b4c0a broke the notebook, because now it says `No module named 'tortoise'`. For some reason the tortoise-tts folder doesn't get copied to the ai-voice-cloning folder in colab. Git cloning it there manually also doesn't seem to fix it. EDIT: I moved the `ai-voice-cloning/tortoise-tts/tortoise` to `ai-voice-cloning/tortoise`, and now it can locate the `api.py` file.
Owner

Yeah, I just realized that when I'm fucking about with my Colab. I'll need to update it.

Yeah, I just realized that when I'm fucking about with my Colab. I'll need to update it.
Owner

Crashed out before I forgot to mention it: notebook updated in 3891870b5d.

Crashed out before I forgot to mention it: notebook updated in 3891870b5dbc0b1044b8aa843475c7334628faeb.
mrq closed this issue 2023-02-20 15:46:52 +00:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#3
No description provided.