Discussion About BitsAndBytes Integration #25

Closed
opened 2023-02-23 03:54:28 +07:00 by mrq · 23 comments

Completely not coping for a lack of discussion or any way to provide news to my users.

My fellow gamers, I've struck gold once more with another impossible feat: training on low(er) VRAM cards: image

With it, I'm able to train models on my 2060 (batch size 3, mega batch factor 2).

These gains leverage TimDettmers/bitsandbytes, with deep magic wizardry involving quantization to net huge VRAM gains.

In short (at least, from my understanding of quantizing vertex data to "compress" meshes in graphics rendering), this will quantize float32 data into int8 and leverage special hardware on Turing/Ampere/Lovelace cards to compute on those tensors, achieving 4x VRAM savings, and 4x data throughout.

In theory. This should pretty much only quantize the big models like the autoregressives and maybe the VQVAE, and any performance uplift would come from anything bandwidth limited (to my understanding at least, I am not a professional). I honestly don't care about performance uplifts, as I'm moreso focused on reducing VRAM usage.

For now, I'm training both locally and on a paperspace instance with an A4000, and both seem to be training swimingly. I'm not sure how much of a quality hit either will have from training at effectively integer8, as float16 does have inherent quality problems.

> OK cool, how do I use it?

Right now, it should be as easy as running the update script for it to pull an update to mrq/DL-Art-School and install an extra dependency: bitsandbytes==0.35.0.

On Linux, no additional setup is required. On Windows:

  • open ./ai-voice-cloning/dlas/bitsandbytes_windows/ in one window
  • open ./ai-voice-cloning/venv/Lib/site-packages/bitsandbytes/
  • copy the folder from the first into the second

In the future, I'll include this for Windows in the setup batch files.

By default, the MITMing of some torch.optim calls are disabled, as I need to validate the output models, but it should be as simple as flipping a switch when it's ready.

If you're fiending to train absolutely now, you can open ./ai-voice-cloning/dlas/codes/torch_intermediary/__init__.py, and change lines 16 and 17 from False to True.

> B-but what's the catch?

So far, I don't see any, except for annoying console spam on Windows.

BitsAndBytes also boasts optimizations for inferencing, and I can eventually leverage it for generating audio too, but I need to see how promising it is for training before I rape mrq/tortoise-tts with more intermediary systems.

~~Completely not coping for a lack of discussion or any way to provide news to my users.~~ My fellow gamers, I've struck gold once more with another impossible feat: training on low(er) VRAM cards: ![image](/attachments/32acb660-a99e-46d2-883e-d4792857ef27) With it, I'm able to train models on my 2060 (batch size 3, mega batch factor 2). These gains leverage [TimDettmers/bitsandbytes](https://github.com/TimDettmers/bitsandbytes), with deep magic wizardry involving quantization to net huge VRAM gains. In short (at least, from my understanding of quantizing vertex data to "compress" meshes in graphics rendering), this will quantize float32 data into int8 and leverage special hardware on Turing/Ampere/Lovelace cards to compute on those tensors, achieving 4x VRAM savings, and 4x data throughout. In theory. This should pretty much only quantize the big models like the autoregressives and *maybe* the VQVAE, and any performance uplift would come from anything bandwidth limited (to my understanding at least, I am not a professional). I honestly don't care about performance uplifts, as I'm moreso focused on reducing VRAM usage. For now, I'm training both locally and on a paperspace instance with an A4000, and both seem to be training swimingly. I'm not sure how much of a quality hit either will have from training at effectively integer8, as float16 *does* have inherent quality problems. >\> OK cool, how do I use it? Right now, it should be as easy as running the update script for it to pull an update to mrq/DL-Art-School and install an extra dependency: `bitsandbytes==0.35.0`. On Linux, no additional setup is required. On Windows: * open `./ai-voice-cloning/dlas/bitsandbytes_windows/` in one window * open `./ai-voice-cloning/venv/Lib/site-packages/bitsandbytes/` * copy the folder from the first into the second In the future, I'll include this for Windows in the setup batch files. By default, the MITMing of some `torch.optim` calls are disabled, as I need to validate the output models, but it should be as simple as flipping a switch when it's ready. If you're fiending to train absolutely now, you can open `./ai-voice-cloning/dlas/codes/torch_intermediary/__init__.py`, and change lines 16 and 17 from `False` to `True`. >\> B-but what's the catch? So far, I don't see any, except for annoying console spam on Windows. BitsAndBytes also boasts optimizations for inferencing, and I can eventually leverage it for generating audio too, but I need to see how promising it is for training before I rape mrq/tortoise-tts with more intermediary systems.
120 KiB
mrq added the
enhancement
label 2023-02-23 03:54:28 +07:00
mrq added the
news
label 2023-02-23 03:55:33 +07:00
Poster
Owner

Success!

I was able to successfully train a model on my 2060. I didn't have anyone come and break in and CBT me, nor are my outputs from a model finetuned with this optimization degraded at all, it sounds similar to a model trained without it (similar, I trained it up until the same loss rate).

I've tested training on Windows and on paperspace instance running Linux, so both are good to go. I'm pretty sure bitsandbytes won't work with Linux + ROCm, as the libs seem to be dependent on a CUDA runtime, so ROCm's torch.cuda compatibility layer thing won't be able to catch it (on the other hand, you probably have more than enough VRAM anyways if you do have an AMD card).

The only thing I haven't yet tested are:

  • setting up this repo from a clean install, and validating it works (it should-ish, as it worked on my paperspace instance)
  • setting up this repo from an existing install
    • existing users will need to follow the above post and copy the files from .\dlas\bitsandbytes_windows\
Success! I was able to successfully train a model on my 2060. I didn't have anyone come and break in and CBT me, nor are my outputs from a model finetuned with this optimization degraded at all, it sounds similar to a model trained without it (similar, I trained it up until the same loss rate). I've tested training on Windows and on paperspace instance running Linux, so both are good to go. I'm pretty sure bitsandbytes won't work with Linux + ROCm, as the libs seem to be dependent on a CUDA runtime, so ROCm's `torch.cuda` compatibility layer thing won't be able to catch it (on the other hand, you probably have more than enough VRAM anyways if you do have an AMD card). The only thing I haven't yet tested are: * setting up this repo from a clean install, and validating it works (it should-ish, as it worked on my paperspace instance) * setting up this repo from an existing install - existing users will need to follow the above post and copy the files from `.\dlas\bitsandbytes_windows\`

Hello. I have good news and a potential bug report.

I ran a 15 sample dataset, batch size 16, mega batch factor 4, at half precision. I ran it for 125 epochs for a quick stability test.

I ran this on a RTX3070 with 8gb VRAM.

The VRAM usage hovers around 7.2-7.6GB, but thankfully it never OOMed and it was able to complete its training.

image

However, after loading up this new model and attempting to generate a new voice, I get the following error:

C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning>call .\venv\Scripts\activate.bat
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Loading TorToiSe... (using model: ./models/finetunes//125_gpt.pth)
Hardware acceleration found: cuda
Loaded TorToiSe, ready for generation.
Reading from latent: ./voices\21VO\cond_latents.pth
Traceback (most recent call last):
  File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\src\webui.py", line 49, in run_generation
    sample, outputs, stats = generate(
  File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\src\utils.py", line 225, in generate
    if prompt.strip() != "":
AttributeError: 'NoneType' object has no attribute 'strip'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 384, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1024, in process_api
    result = await self.call_function(
  File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 836, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\gradio\helpers.py", line 584, in tracked_fn
    response = fn(*args)
  File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\src\webui.py", line 78, in run_generation
    raise gr.Error(message)
gradio.exceptions.Error: "'NoneType' object has no attribute 'strip'"
Hello. I have good news and a potential bug report. I ran a 15 sample dataset, batch size 16, mega batch factor 4, at half precision. I ran it for 125 epochs for a quick stability test. I ran this on a RTX3070 with 8gb VRAM. The VRAM usage hovers around 7.2-7.6GB, but thankfully it never OOMed and it was able to complete its training. ![image](/attachments/1389a68d-ac1b-477f-a0ab-cac62a7b50f3) However, after loading up this new model and attempting to generate a new voice, I get the following error: ``` C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning>call .\venv\Scripts\activate.bat Running on local URL: http://127.0.0.1:7860 To create a public link, set `share=True` in `launch()`. Loading TorToiSe... (using model: ./models/finetunes//125_gpt.pth) Hardware acceleration found: cuda Loaded TorToiSe, ready for generation. Reading from latent: ./voices\21VO\cond_latents.pth Traceback (most recent call last): File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\src\webui.py", line 49, in run_generation sample, outputs, stats = generate( File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\src\utils.py", line 225, in generate if prompt.strip() != "": AttributeError: 'NoneType' object has no attribute 'strip' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 384, in run_predict output = await app.get_blocks().process_api( File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1024, in process_api result = await self.call_function( File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 836, in call_function prediction = await anyio.to_thread.run_sync( File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run result = context.run(func, *args) File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\venv\lib\site-packages\gradio\helpers.py", line 584, in tracked_fn response = fn(*args) File "C:\Users\LXC PC\Desktop\mrqtts\ai-voice-cloning\src\webui.py", line 78, in run_generation raise gr.Error(message) gradio.exceptions.Error: "'NoneType' object has no attribute 'strip'" ```
191 KiB
Poster
Owner

thankfully it never OOMed and it was able to complete its training.

Good to hear.

if prompt.strip() != "":
AttributeError: 'NoneType' object has no attribute 'strip'

Not too sure how you managed to gum it up, but it should be remedied in commit 1cbcf14cff.

> thankfully it never OOMed and it was able to complete its training. Good to hear. > if prompt.strip() != "": > AttributeError: 'NoneType' object has no attribute 'strip' Not too sure how you managed to gum it up, but it should be remedied in commit 1cbcf14cffacbcbbcd25e7d38db2cdb71f34ae55.

@mrq I'm training with an approx. 100 sample dataset, with the exact same settings as you. It calculated the total steps as 15666.
(500 epochs * 94 lines / batch size 3)
When I start training, I get a progress bar which goes up to 31 (which is 94 lines / 3).

image

In the "Run Training" tab in the WebUI, in the progress counter [x/xxxxx], 'x' goes up by one every time the progress bar completes. To my knowledge, that should be one epoch.

However, the WebUI, shows the total steps [x / 15666] instead of the total epochs [x / 500]. So the ETA is a ridiculously high number, and also it doesn't save training states as configured. (i.e. it won't save 5 times per epoch, but instead once per 4-5 epochs).

image

Unless I'm misunderstanding and that's just how long it takes with bitsandbytes...

@mrq I'm training with an approx. 100 sample dataset, with the exact same settings as you. It calculated the total steps as 15666. (500 epochs * 94 lines / batch size 3) When I start training, I get a progress bar which goes up to 31 (which is 94 lines / 3). ![image](/attachments/dfd02998-a23f-4285-bcb7-a232cf325ad8) In the "Run Training" tab in the WebUI, in the progress counter `[x/xxxxx]`, 'x' goes up by one every time the progress bar completes. To my knowledge, that should be one epoch. However, the WebUI, shows the total steps `[x / 15666]` instead of the total epochs `[x / 500]`. So the ETA is a ridiculously high number, and also it doesn't save training states as configured. (i.e. it won't save 5 times per epoch, but instead once per 4-5 epochs). ![image](/attachments/dd1303cd-0fe8-410a-b4d7-91d8a0d13843) Unless I'm misunderstanding and that's just how long it takes with bitsandbytes...
Poster
Owner

No you're right, there's a bit of a discrepancy. The training output will interchange iterations and steps, as evident here:

23-02-23 02:30:25.212 - INFO: [epoch:124, iter:   2,750, lr:(6.250e-07,6.250e-07,)] step: 2.7500e+03 [...]

while I make the assertion that steps make up iterations, and iterations and epochs are interchangable.

I'll adjust the parsing output to reflect this. Goes to show that I shouldn't have done this mostly on max batch sizes.

No you're right, there's a bit of a discrepancy. The training output will interchange iterations and steps, as evident here: ``` 23-02-23 02:30:25.212 - INFO: [epoch:124, iter: 2,750, lr:(6.250e-07,6.250e-07,)] step: 2.7500e+03 [...] ``` while I make the assertion that steps make up iterations, and iterations and epochs are interchangable. I'll adjust the parsing output to reflect this. Goes to show that I shouldn't have done this mostly on max batch sizes.

As I understand it, when the progress bar in the command line fills up, it's gone through the entire dataset, thus one epoch is complete. This is reflected in the webui by increasing the value of x in [x/xxxxx] by one. However, the value of xxxxx, which should be the number of epochs (or iterations), is currently the number of steps (which is a much higher number obviously). So the progress bar fills up much slower, and the ETA is way off.

the iterations I pass to the YAML are interpreted as step counts for training.

I think that's what I mean as well, though I'm not sure

The training output will interchange iterations and steps, as evident here.

Right, that makes sense. Thanks!

As I understand it, when the progress bar in the command line fills up, it's gone through the entire dataset, thus one epoch is complete. This is reflected in the webui by increasing the value of **`x`** in `[x/xxxxx]` by one. However, the value of **`xxxxx`**, which should be the number of epochs (or iterations), is currently the number of steps (which is a much higher number obviously). So the progress bar fills up much slower, and the ETA is way off. > the iterations I pass to the YAML are interpreted as step counts for training. I think that's what I mean as well, though I'm not sure > The training output will interchange iterations and steps, as evident here. Right, that makes sense. Thanks!
Poster
Owner

Remedied in commit 487f2ebf32, at the cost of my Bateman finetune because I accidentally deleted it and not the re-test setup (oh well). Outputted units are in terms of epochs, rather than the incestuous blend of iterations and steps.

Funny, since I was intending to make it show in terms of epochs anyways.

Remedied in commit 487f2ebf321570f0e0c9b34d311339dfde7228c0, at the cost of my Bateman finetune because I accidentally deleted it and not the re-test setup (oh well). Outputted units are in terms of epochs, rather than the incestuous blend of iterations and steps. Funny, since I was intending to make it show in terms of epochs anyways.

because I accidentally deleted it and not the re-test setup

oh dear :(

By the way, another strange bug I've noticed sometimes while training is that if you have Verbose Console Output checked, and are on the Run Training sub-tab, the training goes extremely slow, to the point where it looks like it's frozen. Changing the sub-tab or the tab fixes the issue, and the training resumes at normal speed.

I have no idea what could be causing it, and if occurs every time, but I thought I'd let you know.

> because I accidentally deleted it and not the re-test setup oh dear :( By the way, another strange bug I've noticed sometimes while training is that if you have `Verbose Console Output` checked, and are on the **Run Training** sub-tab, the training goes extremely slow, to the point where it looks like it's frozen. Changing the sub-tab or the tab fixes the issue, and the training resumes at normal speed. I have no idea what could be causing it, and if occurs every time, but I thought I'd let you know.
Poster
Owner

oh dear :(

Not a huge deal, only the 5 hours I slept while it trained.

I have no idea what could be causing it, and if occurs every time, but I thought I'd let you know.

I'll look into it in a bit, but I wouldn't rely on verbose outputting since it interrupts the progress, and was a remnant of debugging.

> oh dear :( Not a huge deal, only the 5 hours I slept while it trained. > I have no idea what could be causing it, and if occurs every time, but I thought I'd let you know. I'll look into it in a bit, but I wouldn't rely on verbose outputting since it interrupts the progress, and was a remnant of debugging.
mrq changed title from Highly Experimental BitsAndBytes Support to Discussion About BitsAndBytes Integration 2023-02-23 16:38:35 +07:00

I keep getting these errors while training, did a clean install as well but it still can't find libcudart.so

I keep getting these errors while training, did a clean install as well but it still can't find libcudart.so
Poster
Owner

Did you copy the files from .\ai-voice-cloning\dlas\bitsandbytes_windows\ into .\ai-voice-cloning\venv\Lib\site-packages\bitsandbytes\?

The setup-cuda.sh script should do it automagically with: copy .\dlas\bitsandbytes_windows\* .\venv\Lib\site-packages\bitsandbytes\. /Y

Did you copy the files from `.\ai-voice-cloning\dlas\bitsandbytes_windows\` into `.\ai-voice-cloning\venv\Lib\site-packages\bitsandbytes\`? The `setup-cuda.sh` script should do it automagically with: `copy .\dlas\bitsandbytes_windows\* .\venv\Lib\site-packages\bitsandbytes\. /Y`

Hi
I had the same error.

copy .\dlas\bitsandbytes_windows\* .\venv\Lib\site-packages\bitsandbytes\. /Y
didn't overwrite for me, I still had to copy manually.

Now it's training fine on 8GB VRAM

However, Cuda usage seems very low (just spikes no constant usage) and CPU is high at 95% avg constant usage

Will compare with same model trained on colab later.

Hi I had the same error. `copy .\dlas\bitsandbytes_windows\* .\venv\Lib\site-packages\bitsandbytes\. /Y` didn't overwrite for me, I still had to copy manually. Now it's training fine on 8GB VRAM However, Cuda usage seems very low (just spikes no constant usage) and CPU is high at 95% avg constant usage Will compare with same model trained on colab later.
Poster
Owner

didn't overwrite for me, I still had to copy manually.

Noted, I'll dig around for the right flags to assume overwrite (although I thought that was what /Y did). I guess that might explain why, despite doing that command myself on my main setup, it didn't update to remove all the nagging messages.

However, Cuda usage seems very low (just spikes no constant usage) and CPU is high at 95% avg constant usage

That seems fairly right (the low GPU usage at least, I didn't take note of CPU utilization); I noticed on my 2060 that it was heavily underutilized during training (similar to how it is during inference/generating audio), while all the other training I did on paperspace instances had the GPU pretty much pinned.

I assume it's just because it's bottlenecked by low batch sizes. In other words, it's just starved for work because of low VRAM.

> didn't overwrite for me, I still had to copy manually. Noted, I'll dig around for the right flags to assume overwrite (although I thought that was what `/Y` did). I guess that might explain why, despite doing that command myself on my main setup, it didn't update to remove all the nagging messages. > However, Cuda usage seems very low (just spikes no constant usage) and CPU is high at 95% avg constant usage That seems fairly right (the low GPU usage at least, I didn't take note of CPU utilization); I noticed on my 2060 that it was heavily underutilized during training (similar to how it is during inference/generating audio), while all the other training I did on paperspace instances had the GPU pretty much pinned. I assume it's just because it's bottlenecked by low batch sizes. In other words, it's just starved for work because of low VRAM.
Poster
Owner

Turned out that I somehow assumed copy/xcopy just copied folders too, so all files should copy now for bitsandbytes.

I didn't get a good chance to test with Verbose console whatever, but I made some changes so it only actually stores the last buffer_size messages instead of just slicing every time, although that issue should have affected normal training.

Turned out that I somehow assumed `copy`/`xcopy` just copied folders too, so all files should copy now for bitsandbytes. I didn't get a good chance to test with `Verbose console` whatever, but I made some changes so it only actually stores the last `buffer_size` messages instead of just slicing every time, although that issue should have affected normal training.
[Training] [2023-03-02T01:01:30.089720] 
[Training] [2023-03-02T01:01:30.089843]   0%|          | 0/2 [00:08<?, ?it/s]
[Training] [2023-03-02T01:01:30.089856] Traceback (most recent call last):
[Training] [2023-03-02T01:01:30.089863]   File "/home/linuxuser/Documents/ai-voice-cloning/./src/train.py", line 93, in <module>
[Training] [2023-03-02T01:01:30.089889]     train(args.opt, args.launcher)
[Training] [2023-03-02T01:01:30.089894]   File "/home/linuxuser/Documents/ai-voice-cloning/./src/train.py", line 80, in train
[Training] [2023-03-02T01:01:30.089912]     trainer.do_training()
[Training] [2023-03-02T01:01:30.089917]   File "/home/linuxuser/Documents/ai-voice-cloning/./dlas/codes/train.py", line 330, in do_training
[Training] [2023-03-02T01:01:30.089956]     self.do_step(train_data)
[Training] [2023-03-02T01:01:30.089961]   File "/home/linuxuser/Documents/ai-voice-cloning/./dlas/codes/train.py", line 211, in do_step
[Training] [2023-03-02T01:01:30.089984]     gradient_norms_dict = self.model.optimize_parameters(self.current_step, return_grad_norms=will_log)
[Training] [2023-03-02T01:01:30.089989]   File "/home/linuxuser/Documents/ai-voice-cloning/./dlas/codes/trainer/ExtensibleTrainer.py", line 373, in optimize_parameters
[Training] [2023-03-02T01:01:30.090025]     self.consume_gradients(state, step, it)
[Training] [2023-03-02T01:01:30.090029]   File "/home/linuxuser/Documents/ai-voice-cloning/./dlas/codes/trainer/ExtensibleTrainer.py", line 418, in consume_gradients
[Training] [2023-03-02T01:01:30.090060]     step.do_step(it)
[Training] [2023-03-02T01:01:30.090064]   File "/home/linuxuser/Documents/ai-voice-cloning/./dlas/codes/trainer/steps.py", line 365, in do_step
[Training] [2023-03-02T01:01:30.094302]     self.scaler.step(opt)
[Training] [2023-03-02T01:01:30.094315]   File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 313, in step
[Training] [2023-03-02T01:01:30.094468]     return optimizer.step(*args, **kwargs)
[Training] [2023-03-02T01:01:30.094475]   File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
[Training] [2023-03-02T01:01:30.094664]     return wrapped(*args, **kwargs)
[Training] [2023-03-02T01:01:30.094669]   File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 140, in wrapper
[Training] [2023-03-02T01:01:30.094889]     out = func(*args, **kwargs)
[Training] [2023-03-02T01:01:30.094897]   File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
[Training] [2023-03-02T01:01:30.094919]     return func(*args, **kwargs)
[Training] [2023-03-02T01:01:30.094923]   File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 265, in step
[Training] [2023-03-02T01:01:30.094959]     self.update_step(group, p, gindex, pindex)
[Training] [2023-03-02T01:01:30.094963]   File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
[Training] [2023-03-02T01:01:30.094977]     return func(*args, **kwargs)
[Training] [2023-03-02T01:01:30.094981]   File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 506, in update_step
[Training] [2023-03-02T01:01:30.095120]     F.optimizer_update_8bit_blockwise(
[Training] [2023-03-02T01:01:30.095126]   File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 858, in optimizer_update_8bit_blockwise
[Training] [2023-03-02T01:01:30.095654]     str2optimizer8bit_blockwise[optimizer_name][0](
[Training] [2023-03-02T01:01:30.095687] NameError: name 'str2optimizer8bit_blockwise' is not defined

I guess this is related, I can't train on AMD anymore, seems to get stuck on bitsandbytes despite me unchecking all the boxes related to it in Gradio.

``` [Training] [2023-03-02T01:01:30.089720] [Training] [2023-03-02T01:01:30.089843] 0%| | 0/2 [00:08<?, ?it/s] [Training] [2023-03-02T01:01:30.089856] Traceback (most recent call last): [Training] [2023-03-02T01:01:30.089863] File "/home/linuxuser/Documents/ai-voice-cloning/./src/train.py", line 93, in <module> [Training] [2023-03-02T01:01:30.089889] train(args.opt, args.launcher) [Training] [2023-03-02T01:01:30.089894] File "/home/linuxuser/Documents/ai-voice-cloning/./src/train.py", line 80, in train [Training] [2023-03-02T01:01:30.089912] trainer.do_training() [Training] [2023-03-02T01:01:30.089917] File "/home/linuxuser/Documents/ai-voice-cloning/./dlas/codes/train.py", line 330, in do_training [Training] [2023-03-02T01:01:30.089956] self.do_step(train_data) [Training] [2023-03-02T01:01:30.089961] File "/home/linuxuser/Documents/ai-voice-cloning/./dlas/codes/train.py", line 211, in do_step [Training] [2023-03-02T01:01:30.089984] gradient_norms_dict = self.model.optimize_parameters(self.current_step, return_grad_norms=will_log) [Training] [2023-03-02T01:01:30.089989] File "/home/linuxuser/Documents/ai-voice-cloning/./dlas/codes/trainer/ExtensibleTrainer.py", line 373, in optimize_parameters [Training] [2023-03-02T01:01:30.090025] self.consume_gradients(state, step, it) [Training] [2023-03-02T01:01:30.090029] File "/home/linuxuser/Documents/ai-voice-cloning/./dlas/codes/trainer/ExtensibleTrainer.py", line 418, in consume_gradients [Training] [2023-03-02T01:01:30.090060] step.do_step(it) [Training] [2023-03-02T01:01:30.090064] File "/home/linuxuser/Documents/ai-voice-cloning/./dlas/codes/trainer/steps.py", line 365, in do_step [Training] [2023-03-02T01:01:30.094302] self.scaler.step(opt) [Training] [2023-03-02T01:01:30.094315] File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 313, in step [Training] [2023-03-02T01:01:30.094468] return optimizer.step(*args, **kwargs) [Training] [2023-03-02T01:01:30.094475] File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper [Training] [2023-03-02T01:01:30.094664] return wrapped(*args, **kwargs) [Training] [2023-03-02T01:01:30.094669] File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 140, in wrapper [Training] [2023-03-02T01:01:30.094889] out = func(*args, **kwargs) [Training] [2023-03-02T01:01:30.094897] File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context [Training] [2023-03-02T01:01:30.094919] return func(*args, **kwargs) [Training] [2023-03-02T01:01:30.094923] File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 265, in step [Training] [2023-03-02T01:01:30.094959] self.update_step(group, p, gindex, pindex) [Training] [2023-03-02T01:01:30.094963] File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context [Training] [2023-03-02T01:01:30.094977] return func(*args, **kwargs) [Training] [2023-03-02T01:01:30.094981] File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 506, in update_step [Training] [2023-03-02T01:01:30.095120] F.optimizer_update_8bit_blockwise( [Training] [2023-03-02T01:01:30.095126] File "/home/linuxuser/Documents/ai-voice-cloning/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 858, in optimizer_update_8bit_blockwise [Training] [2023-03-02T01:01:30.095654] str2optimizer8bit_blockwise[optimizer_name][0]( [Training] [2023-03-02T01:01:30.095687] NameError: name 'str2optimizer8bit_blockwise' is not defined ``` I guess this is related, I can't train on AMD anymore, seems to get stuck on bitsandbytes despite me unchecking all the boxes related to it in Gradio.
Poster
Owner

If you want to be a guinea pig, run:

source ./venv/bin/activate
pip3 uninstall bitsandbytes
pip install git+https://github.com/broncotc/bitsandbytes-rocm
deactivate

It's a specific variant for AMD cards (Linux only). I was going to get around to testing it end-of-day-tomorrow when my impulsive hardware purchase comes in, but you're free to try it for yourself.

If that won't work, then you should be able to force it off by editing ./src/train.py and copy pasting this block: image

If you want to be a guinea pig, run: ``` source ./venv/bin/activate pip3 uninstall bitsandbytes pip install git+https://github.com/broncotc/bitsandbytes-rocm deactivate ``` It's a specific variant for AMD cards (Linux only). I was going to get around to testing it end-of-day-tomorrow when my impulsive hardware purchase comes in, but you're free to try it for yourself. If that won't work, then you should be able to force it off by editing `./src/train.py` and copy pasting this block: ![image](/attachments/aa1e4134-477c-4dab-b8e9-a6896b1bc4be)

If you want to be a guinea pig, run:

source ./venv/bin/activate
pip3 uninstall bitsandbytes
pip install git+https://github.com/broncotc/bitsandbytes-rocm
deactivate

It's a specific variant for AMD cards (Linux only). I was going to get around to testing it end-of-day-tomorrow when my impulsive hardware purchase comes in, but you're free to try it for yourself.

If that won't work, then you should be able to force it off by editing ./src/train.py and copy pasting this block:

I installed the rocm version and can confirm it's actually started training now on my 6900XT, I'll drop an update in if/when the training finishes.

> If you want to be a guinea pig, run: > ``` > source ./venv/bin/activate > pip3 uninstall bitsandbytes > pip install git+https://github.com/broncotc/bitsandbytes-rocm > deactivate > ``` > > It's a specific variant for AMD cards (Linux only). I was going to get around to testing it end-of-day-tomorrow when my impulsive hardware purchase comes in, but you're free to try it for yourself. > > If that won't work, then you should be able to force it off by editing `./src/train.py` and copy pasting this block: I installed the rocm version and can confirm it's actually started training now on my 6900XT, I'll drop an update in if/when the training finishes.
Poster
Owner

Naisu, I'll edit the setup-rocm.sh install script to have it install that.

Naisu, I'll edit the `setup-rocm.sh` install script to have it install that.

Naisu, I'll edit the setup-rocm.sh install script to have it install that.

do get this "error" now when launching, might need removing or contextualising for AMD users

CUDA SETUP: Setup Failed!
CUDA SETUP: Setup Failed!
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
<make_cmd here, commented out>
python setup.py install

doesn't stop Tortoise from starting albeit

> Naisu, I'll edit the `setup-rocm.sh` install script to have it install that. do get this "error" now when launching, might need removing or contextualising for AMD users ``` CUDA SETUP: Setup Failed! CUDA SETUP: Setup Failed! CUDA SETUP: Something unexpected happened. Please compile from source: git clone git@github.com:TimDettmers/bitsandbytes.git cd bitsandbytes <make_cmd here, commented out> python setup.py install ``` doesn't stop Tortoise from starting albeit
Poster
Owner

I guess I'll need to have the setup script do that. For now I'll just have it uninstall bitsandbytes when installing through ROCm, and tomorrow I should be able to get something cobbled to set it up.

I guess I'll need to have the setup script do that. For now I'll just have it uninstall bitsandbytes when installing through ROCm, and tomorrow I should be able to get something cobbled to set it up.
Poster
Owner

Alrighty, I've successfully cobbled together my dedicated Linux + AMD system. Seems it's a bit of a chore to compile bitsandbytes-rocm. I might need to host a fork here with a slightly edited Makefile to make lives easier for Arch Linux users like myself (the fork's Makefile assumes hipcc under /usr/bin/, but also assumes some some incs/libs under /opt/rocm-5.3.0/).

It's faily simple if I didn't dirty my env vars. Just:

source ./venv/bin/activate
# make sure they're uninstalled
pip3 uninstall bitsandbytes
pip3 uninstall bitsandbytes-rocm
git clone -b rocm https://github.com/0cc4m/bitsandbytes-rocm/
cd bitsandbytes-rocm
make hip
CUDA_VERSION=gfx1030 python setup.py install # assumes you're using a 6XXX series card
python3 -m bitsandbytes # makes sure it works

I'll validate this works by reproducing it with a script, and then ship it off as setup-rocm-bnb.sh (able to be called for existing setups, and gets called for new setups)

Alrighty, I've successfully cobbled together my dedicated Linux + AMD system. Seems it's a bit of a chore to compile bitsandbytes-rocm. I might need to host a fork here with a slightly edited Makefile to make lives easier for Arch Linux users like myself (the fork's Makefile assumes `hipcc` under `/usr/bin/`, but also assumes some some incs/libs under `/opt/rocm-5.3.0/`). It's faily simple if I didn't dirty my env vars. Just: ``` source ./venv/bin/activate # make sure they're uninstalled pip3 uninstall bitsandbytes pip3 uninstall bitsandbytes-rocm git clone -b rocm https://github.com/0cc4m/bitsandbytes-rocm/ cd bitsandbytes-rocm make hip CUDA_VERSION=gfx1030 python setup.py install # assumes you're using a 6XXX series card python3 -m bitsandbytes # makes sure it works ``` I'll validate this works by reproducing it with a script, and then ship it off as `setup-rocm-bnb.sh` (able to be called for existing setups, and gets called for new setups)
Poster
Owner

Naisu. Added the setup script for bitsandbytes-rocm in e205322c8d. Technically you can delete bitsandbytes-rocm afterwards, as the compiled .egg gets copied over.

Naisu. Added the setup script for bitsandbytes-rocm in e205322c8db1808e063ea10f3154dd006ff08395. *Technically* you can delete `bitsandbytes-rocm` afterwards, as the compiled .egg gets copied over.

I'm an idiot that bought a Pascal quadro instead of a 3090.

this is what I get:


CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: TODO: compile library for specific version: libbitsandbytes_cuda121_nocublaslt.so
CUDA SETUP: Defaulting to libbitsandbytes.so...
CUDA SETUP: CUDA detection failed. Either CUDA driver not installed, CUDA not installed, or you have multiple conflicting CUDA libraries!
CUDA SETUP: If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION` for example, `make CUDA_VERSION=113`.

I'm told it WILL work on training. After a slight patch that fixes cuda detection, it is possible to use it. Unfortunately it is very slow. About 1/2 the speed of FP16.

I'm an idiot that bought a Pascal quadro instead of a 3090. this is what I get: ``` CUDA SETUP: Highest compute capability among GPUs detected: 6.1 CUDA SETUP: Detected CUDA version 121 CUDA SETUP: TODO: compile library for specific version: libbitsandbytes_cuda121_nocublaslt.so CUDA SETUP: Defaulting to libbitsandbytes.so... CUDA SETUP: CUDA detection failed. Either CUDA driver not installed, CUDA not installed, or you have multiple conflicting CUDA libraries! CUDA SETUP: If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION` for example, `make CUDA_VERSION=113`. ``` I'm told it WILL work on training. After a slight patch that fixes cuda detection, it is possible to use it. Unfortunately it is very slow. About 1/2 the speed of FP16.
mrq closed this issue 2023-03-13 17:38:29 +07:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
7 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#25
There is no content yet.