Nvidia Driver Woes - Super slow training #399

New Issue

stilltravelling · 2023-09-28T13:09:38Z

stilltravelling commented

2023-09-28 13:09:38 +00:00

It's been a while since I last fine-tuned a model, but thought I would train again today. I'm using Windows 11 (this might impact 10 too) but when I tried today, training was horrendously slow. I tried updating my drivers (573.42) and again the training was incredibly slow, something that would take at least 10 hours.

Now i've re-installed some old drivers (531.79) and training is back to the speeds it was (about 40 minutes). Does anyone know why the new drivers are so bad for training?

Should point out generation did not seem to be impacted by the new drivers, just the training.

It's been a while since I last fine-tuned a model, but thought I would train again today. I'm using Windows 11 (this might impact 10 too) but when I tried today, training was horrendously slow. I tried updating my drivers (573.42) and again the training was incredibly slow, something that would take at least 10 hours. Now i've re-installed some old drivers (531.79) and training is back to the speeds it was (about 40 minutes). Does anyone know why the new drivers are so bad for training? Should point out generation did not seem to be impacted by the new drivers, just the training.

mrq commented

2023-09-28 14:54:39 +00:00

I remember reading in the LLaMA sphere to stick with older Nvidia driver versions due to newer drivers happily spilling over-committed VRAM allocations onto system RAM. I haven't encountered it since on my training rig with version 535.104.05, as I still get actual OOM errors.

If you're using the same dials and knobs then I'm not sure why it would be an issue, but under the new drivers it might be "mitigated" with lowering a batch size.

I remember reading in the LLaMA sphere to stick with older Nvidia driver versions due to newer drivers happily spilling over-committed VRAM allocations onto system RAM. I haven't encountered it since on my training rig with version 535.104.05, as I still get actual OOM errors. If you're using the same dials and knobs then I'm not sure why it would be an issue, but under the new drivers it might be "mitigated" with lowering a batch size.

stilltravelling commented

2023-09-28 16:01:04 +00:00

Thanks for the reply. I think you're right, I was checking dedicated GPU Memory and Shared GPU memory while using the newer drivers and Shared GPU memory seemed to be much much higher than I remember. Sadly I didn't check when I used the older drivers as the difference was just so quick and I was happy.

The dials and knobs were the same, 48 batch size, 16 Gradient. 110 epochs. Processing 78 files (Probably a zany set up anyway, I was being very lazy). I suspect you're right with it spilling into system RAM with the newer drivers. I suspect that if the VRAM is getting near the limit, it's likely to push some stuff into shared memory. Hopefully I'll get a bit of time to experiment later.

ASUS ROG STRIX - Geforce RTX 2080Ti
Newer drivers:
23-09-28 11:37:31.534 - INFO: Random seed: 2255
23-09-28 11:37:38.446 - INFO: Number of training data elements: 78, iters: 2
23-09-28 11:37:38.446 - INFO: Total epochs needed: 110 for iters 220
23-09-28 11:37:54.366 - INFO: Loading model for [./models/tortoise/autoregressive.pth]
23-09-28 11:37:58.694 - INFO: Start training from epoch: 0, iter: 0
23-09-28 11:40:10.994 - INFO: Training Metrics: {"loss_text_ce": 5.364519119262695, "loss_mel_ce": 4.1386260986328125, "loss_gpt_total": 4.192271709442139, "lr": 0.0001, "it": 1, "step": 1, "steps": 1, "epoch": 0, "iteration_rate": 123.2040798664093}
23-09-28 11:48:02.006 - INFO: Training Metrics: {"loss_text_ce": 5.445357322692871, "loss_mel_ce": 3.8755671977996826, "loss_gpt_total": 3.930020809173584, "lr": 0.0001, "it": 2, "step": 1, "steps": 1, "epoch": 1, "iteration_rate": 470.49365401268005}

531.79 Driver:
23-09-28 14:00:28.698 - INFO: Random seed: 9818
23-09-28 14:00:30.855 - INFO: Number of training data elements: 78, iters: 2
23-09-28 14:00:30.855 - INFO: Total epochs needed: 110 for iters 220
23-09-28 14:00:45.393 - INFO: Loading model for [./models/tortoise/autoregressive.pth]
23-09-28 14:00:46.264 - INFO: Start training from epoch: 0, iter: 0
23-09-28 14:01:10.522 - INFO: Training Metrics: {"loss_text_ce": 5.283626556396484, "loss_mel_ce": 3.9180142879486084, "loss_gpt_total": 3.9708499908447266, "lr": 0.0001, "it": 1, "step": 1, "steps": 1, "epoch": 0, "iteration_rate": 13.430541515350342}
23-09-28 14:01:23.141 - INFO: Training Metrics: {"loss_text_ce": 5.097756385803223, "loss_mel_ce": 3.7688241004943848, "loss_gpt_total": 3.8198018074035645, "lr": 0.0001, "it": 2, "step": 1, "steps": 1, "epoch": 1, "iteration_rate": 11.764817714691162}
23-09-28 14:01:33.397 - INFO: Training Metrics: {"loss_text_ce": 5.057187557220459, "loss_mel_ce": 3.628704071044922, "loss_gpt_total": 3.67927622795105, "lr": 5e-05, "it": 3, "step": 1, "steps": 1, "epoch": 2, "iteration_rate": 9.741585493087769}
23-09-28 14:01:43.465 - INFO: Training Metrics: {"loss_text_ce": 5.017062187194824, "loss_mel_ce": 3.565491199493408, "loss_gpt_total": 3.615661382675171, "lr": 5e-05, "it": 4, "step": 1, "steps": 1, "epoch": 3, "iteration_rate": 9.618443250656128}
23-09-28 14:01:53.392 - INFO: Training Metrics: {"loss_text_ce": 4.947103977203369, "loss_mel_ce": 3.4956722259521484, "loss_gpt_total": 3.5451436042785645, "lr": 5e-05, "it": 5, "step": 1, "steps": 1, "epoch": 4, "iteration_rate": 9.516880750656128}

Thanks for the reply. I think you're right, I was checking dedicated GPU Memory and Shared GPU memory while using the newer drivers and Shared GPU memory seemed to be much much higher than I remember. Sadly I didn't check when I used the older drivers as the difference was just so quick and I was happy. The dials and knobs were the same, 48 batch size, 16 Gradient. 110 epochs. Processing 78 files (Probably a zany set up anyway, I was being very lazy). I suspect you're right with it spilling into system RAM with the newer drivers. I suspect that if the VRAM is getting near the limit, it's likely to push some stuff into shared memory. Hopefully I'll get a bit of time to experiment later. ASUS ROG STRIX - Geforce RTX 2080Ti Newer drivers: 23-09-28 11:37:31.534 - INFO: Random seed: 2255 23-09-28 11:37:38.446 - INFO: Number of training data elements: 78, iters: 2 23-09-28 11:37:38.446 - INFO: Total epochs needed: 110 for iters 220 23-09-28 11:37:54.366 - INFO: Loading model for [./models/tortoise/autoregressive.pth] 23-09-28 11:37:58.694 - INFO: Start training from epoch: 0, iter: 0 23-09-28 11:40:10.994 - INFO: Training Metrics: {"loss_text_ce": 5.364519119262695, "loss_mel_ce": 4.1386260986328125, "loss_gpt_total": 4.192271709442139, "lr": 0.0001, "it": 1, "step": 1, "steps": 1, "epoch": 0, "iteration_rate": 123.2040798664093} 23-09-28 11:48:02.006 - INFO: Training Metrics: {"loss_text_ce": 5.445357322692871, "loss_mel_ce": 3.8755671977996826, "loss_gpt_total": 3.930020809173584, "lr": 0.0001, "it": 2, "step": 1, "steps": 1, "epoch": 1, "iteration_rate": 470.49365401268005} 531.79 Driver: 23-09-28 14:00:28.698 - INFO: Random seed: 9818 23-09-28 14:00:30.855 - INFO: Number of training data elements: 78, iters: 2 23-09-28 14:00:30.855 - INFO: Total epochs needed: 110 for iters 220 23-09-28 14:00:45.393 - INFO: Loading model for [./models/tortoise/autoregressive.pth] 23-09-28 14:00:46.264 - INFO: Start training from epoch: 0, iter: 0 23-09-28 14:01:10.522 - INFO: Training Metrics: {"loss_text_ce": 5.283626556396484, "loss_mel_ce": 3.9180142879486084, "loss_gpt_total": 3.9708499908447266, "lr": 0.0001, "it": 1, "step": 1, "steps": 1, "epoch": 0, "iteration_rate": 13.430541515350342} 23-09-28 14:01:23.141 - INFO: Training Metrics: {"loss_text_ce": 5.097756385803223, "loss_mel_ce": 3.7688241004943848, "loss_gpt_total": 3.8198018074035645, "lr": 0.0001, "it": 2, "step": 1, "steps": 1, "epoch": 1, "iteration_rate": 11.764817714691162} 23-09-28 14:01:33.397 - INFO: Training Metrics: {"loss_text_ce": 5.057187557220459, "loss_mel_ce": 3.628704071044922, "loss_gpt_total": 3.67927622795105, "lr": 5e-05, "it": 3, "step": 1, "steps": 1, "epoch": 2, "iteration_rate": 9.741585493087769} 23-09-28 14:01:43.465 - INFO: Training Metrics: {"loss_text_ce": 5.017062187194824, "loss_mel_ce": 3.565491199493408, "loss_gpt_total": 3.615661382675171, "lr": 5e-05, "it": 4, "step": 1, "steps": 1, "epoch": 3, "iteration_rate": 9.618443250656128} 23-09-28 14:01:53.392 - INFO: Training Metrics: {"loss_text_ce": 4.947103977203369, "loss_mel_ce": 3.4956722259521484, "loss_gpt_total": 3.5451436042785645, "lr": 5e-05, "it": 5, "step": 1, "steps": 1, "epoch": 4, "iteration_rate": 9.516880750656128}

oz commented

2023-10-03 16:36:35 +00:00

I'm very new to this. I'm creating my own voices from scratch and training a model from ~90 minutes of good audio is taking 10-24 hours on my 3070. Is this normal? From what you're saying it should be a lot quicker!
Thanks,
My 'dials';
"epochs": 50,
"learning_rate": 1e-05,
"mel_lr_weight": 1,
"text_lr_weight": 0.01,
"learning_rate_scheme": "Multistep",
"learning_rate_schedule": "",
"learning_rate_restarts": 4,
"batch_size": 128,
"gradient_accumulation_size": 32,
"save_rate": 5,
"validation_rate": 5,
"half_p": false,
"bitsandbytes": true,
"validation_enabled": false,
"workers": 2,
"gpus": 1,
"source_model": "./models/tortoise/autoregressive.pth"

From my log

23-10-02 22:33:58.007 - INFO: Random seed: 1907
23-10-02 22:33:59.122 - INFO: Number of training data elements: 1,138, iters: 9
23-10-02 22:33:59.122 - INFO: Total epochs needed: 50 for iters 450
23-10-02 22:34:06.853 - INFO: Loading model for [./models/tortoise/autoregressive.pth]
23-10-02 22:34:07.598 - INFO: Start training from epoch: 0, iter: 0
23-10-02 22:38:57.887 - INFO: Training Metrics: {"loss_text_ce": 5.198568820953369, "loss_mel_ce": 2.8442232608795166, "loss_gpt_total": 2.8962090015411377, "lr": 1e-05, "it": 1, "step": 1, "steps": 8, "epoch": 0, "iteration_rate": 282.99078822135925}
23-10-02 22:41:09.326 - INFO: Training Metrics: {"loss_text_ce": 5.175328731536865, "loss_mel_ce": 2.7385458946228027, "loss_gpt_total": 2.790299415588379, "lr": 1e-05, "it": 2, "step": 2, "steps": 8, "epoch": 0, "iteration_rate": 131.43751049041748}
23-10-02 22:43:45.563 - INFO: Training Metrics: {"loss_text_ce": 5.136586666107178, "loss_mel_ce": 2.6687371730804443, "loss_gpt_total": 2.7201030254364014, "lr": 1e-05, "it": 3, "step": 3, "steps": 8, "epoch": 0, "iteration_rate": 156.23569202423096}
23-10-02 22:47:49.691 - INFO: Training Metrics: {"loss_text_ce": 5.105230331420898, "loss_mel_ce": 2.618992805480957, "loss_gpt_total": 2.6700451374053955, "lr": 1e-05, "it": 4, "step": 4, "steps": 8, "epoch": 0, "iteration_rate": 244.1250512599945}
23-10-02 22:52:33.681 - INFO: Training Metrics: {"loss_text_ce": 5.0865912437438965, "loss_mel_ce": 2.5784993171691895, "loss_gpt_total": 2.6293649673461914, "lr": 1e-05, "it": 5, "step": 5, "steps": 8, "epoch": 0, "iteration_rate": 283.9891369342804}
23-10-02 22:56:07.132 - INFO: Training Metrics: {"loss_text_ce": 5.0709428787231445, "loss_mel_ce": 2.5455710887908936, "loss_gpt_total": 2.596280336380005, "lr": 1e-05, "it": 6, "step": 6, "steps": 8, "epoch": 0, "iteration_rate": 213.44964361190796}
23-10-02 23:00:44.174 - INFO: Training Metrics: {"loss_text_ce": 5.059730529785156, "loss_mel_ce": 2.5179443359375, "loss_gpt_total": 2.5685417652130127, "lr": 1e-05, "it": 7, "step": 7, "steps": 8, "epoch": 0, "iteration_rate": 277.0396902561188}

I'm very new to this. I'm creating my own voices from scratch and training a model from ~90 minutes of good audio is taking 10-24 hours on my 3070. Is this normal? From what you're saying it should be a lot quicker! Thanks, My 'dials'; "epochs": 50, "learning_rate": 1e-05, "mel_lr_weight": 1, "text_lr_weight": 0.01, "learning_rate_scheme": "Multistep", "learning_rate_schedule": "", "learning_rate_restarts": 4, "batch_size": 128, "gradient_accumulation_size": 32, "save_rate": 5, "validation_rate": 5, "half_p": false, "bitsandbytes": true, "validation_enabled": false, "workers": 2, "gpus": 1, "source_model": "./models/tortoise/autoregressive.pth" From my log > > > 23-10-02 22:33:58.007 - INFO: Random seed: 1907 > 23-10-02 22:33:59.122 - INFO: Number of training data elements: 1,138, iters: 9 > 23-10-02 22:33:59.122 - INFO: Total epochs needed: 50 for iters 450 > 23-10-02 22:34:06.853 - INFO: Loading model for [./models/tortoise/autoregressive.pth] > 23-10-02 22:34:07.598 - INFO: Start training from epoch: 0, iter: 0 > 23-10-02 22:38:57.887 - INFO: Training Metrics: {"loss_text_ce": 5.198568820953369, "loss_mel_ce": 2.8442232608795166, "loss_gpt_total": 2.8962090015411377, "lr": 1e-05, "it": 1, "step": 1, "steps": 8, "epoch": 0, "iteration_rate": 282.99078822135925} > 23-10-02 22:41:09.326 - INFO: Training Metrics: {"loss_text_ce": 5.175328731536865, "loss_mel_ce": 2.7385458946228027, "loss_gpt_total": 2.790299415588379, "lr": 1e-05, "it": 2, "step": 2, "steps": 8, "epoch": 0, "iteration_rate": 131.43751049041748} > 23-10-02 22:43:45.563 - INFO: Training Metrics: {"loss_text_ce": 5.136586666107178, "loss_mel_ce": 2.6687371730804443, "loss_gpt_total": 2.7201030254364014, "lr": 1e-05, "it": 3, "step": 3, "steps": 8, "epoch": 0, "iteration_rate": 156.23569202423096} > 23-10-02 22:47:49.691 - INFO: Training Metrics: {"loss_text_ce": 5.105230331420898, "loss_mel_ce": 2.618992805480957, "loss_gpt_total": 2.6700451374053955, "lr": 1e-05, "it": 4, "step": 4, "steps": 8, "epoch": 0, "iteration_rate": 244.1250512599945} > 23-10-02 22:52:33.681 - INFO: Training Metrics: {"loss_text_ce": 5.0865912437438965, "loss_mel_ce": 2.5784993171691895, "loss_gpt_total": 2.6293649673461914, "lr": 1e-05, "it": 5, "step": 5, "steps": 8, "epoch": 0, "iteration_rate": 283.9891369342804} > 23-10-02 22:56:07.132 - INFO: Training Metrics: {"loss_text_ce": 5.0709428787231445, "loss_mel_ce": 2.5455710887908936, "loss_gpt_total": 2.596280336380005, "lr": 1e-05, "it": 6, "step": 6, "steps": 8, "epoch": 0, "iteration_rate": 213.44964361190796} > 23-10-02 23:00:44.174 - INFO: Training Metrics: {"loss_text_ce": 5.059730529785156, "loss_mel_ce": 2.5179443359375, "loss_gpt_total": 2.5685417652130127, "lr": 1e-05, "it": 7, "step": 7, "steps": 8, "epoch": 0, "iteration_rate": 277.0396902561188}

oz commented

2023-10-03 17:00:11 +00:00

I know that the number of iterations will be higher, but my iterations are taking much longer, is that expected also?
If I had a sample of 78 I would think that it would still take much longer than it would for you.
I notice your learning rate is much higher in the second set of logs.

I know that the number of iterations will be higher, but my iterations are taking much longer, is that expected also? If I had a sample of 78 I would think that it would still take much longer than it would for you. I notice your learning rate is much higher in the second set of logs.

stilltravelling commented

2023-10-05 12:03:16 +00:00

The reason for the logs was to demonstrate how long it took between each iteration using different drivers using the same data. You can see the timestamps between the iterations are much smaller using the older drivers (around 10 seconds per iteration) compared to the newer drivers I was trying (123 seconds, 470 seconds). I gave up after 2 iterations because it was taking so long. The learning rate schedule was the same for both, it's just that the 2nd log shows more iterations.

A lot of it is simply trial and error, finding something that works with your dataset and tweaking the dials. Frustrating but very rewarding when you get really good results at the end. Start off small and then build up.

The reason for the logs was to demonstrate how long it took between each iteration using different drivers using the same data. You can see the timestamps between the iterations are much smaller using the older drivers (around 10 seconds per iteration) compared to the newer drivers I was trying (123 seconds, 470 seconds). I gave up after 2 iterations because it was taking so long. The learning rate schedule was the same for both, it's just that the 2nd log shows more iterations. A lot of it is simply trial and error, finding something that works with your dataset and tweaking the dials. Frustrating but very rewarding when you get really good results at the end. Start off small and then build up.

mrq referenced this issue

2023-10-06 02:38:57 +00:00

Very Bad Training Time & Results. #403

Sign in to join this conversation.