Update 'Training'

master
mrq 2023-03-09 19:49:00 +07:00
parent c94dc8652d
commit 41d297482b
1 changed files with 6 additions and 10 deletions

@ -61,10 +61,10 @@ This will generate the YAML necessary to feed into training. For documentation's
* `Learning Rate`: rate that determines how fast a model will "learn". Higher values train faster, but at the risk of frying the model, overfitting, or other problems. The default is "sane" enough for safety, especially in the scope of retraining, but definitely needs some adjustments. If you want faster training, bump this up to `0.0001` (1e-5), but be wary you may fry your finetune without tighter scheduling.
* `Text_CE LR Weight`: an experimental setting to govern how much weight to factor in with the provided learning rate. This is ***a highly experimental tunable***, and is only exposed so I don't need to edit it myself when testing it. ***Leave this to the default 0.01 unless you know what you are doing.***
* `Learning Rate Scheme`: sets the type of learning rate adjustments, each one exposes its own options:
* `Multistep`: MultiStepLR, will decay at fixed intervals to fixed values
- `Learning Rate Schedule`: a list of epochs on when to decay the learning rate. You really should leave this as the default.
* `Multistep`: MultiStepLR, will decay at fixed intervals to by a factor (default set to 0.5, so it will halve every milestone).
- `Learning Rate Schedule`: a list of epochs on when to decay the learning rate. More experiments are needed to determine optimal schedules.
* `Cos. Annealing`: CosineAnnealingLR_Restart, will gradually decay the learning rate over training, and restarts periodically
- `Learning Rate Restarts`: how many times to "restart" the learning rate scheduling, but gradually dampen it
- `Learning Rate Restarts`: how many times to "restart" the learning rate scheduling, but with a decay.
* `Batch Size`: how large of a batch size for training. Larger batch sizes will result in faster training steps, but at the cost of increased VRAM consumption. This value must exceed the size of your dataset, and *should* be evenly divisible by your dataset size.
* `Gradient Accumulation Size` (*originally named `mega batch factor`*): At first seemed very confusing, but it's very simple. This will further divide batches into mini-batches, parse them in sequence, but only updates the model after completing all mini-batches. This effectively saves more VRAM by de-facto running at a smaller batch size, but without constantly updating the model, as if running at a larger batch size. This does have some quirks, like crashing when saving at a specific batch size:gradient accumulation ratio, odd pacing of training, etc.
* `Print Frequency`: how often the trainer should print its training statistics in epochs. Printing takes a little bit of time, but it's a nice way to gauge how a finetune is baking, as it lists your losses and other statistics. This is purely for debugging and babysitting if a model is being trained adequately. The web UI *should* parse the information from stdout and grab the total loss and report it back.
@ -91,13 +91,11 @@ Getting decent training results is quite the pickle, and it seems my nuggets of
- In other words, there's no (perceptable) difference between training for 25 epochs, then 25 epochs, over training for just 50 epochs.
- **!**NOTE**!**: there seems to be some quirk with the learning rate scheduler, where it'll take some time for it to "re-adjust" when resuming, so keep in mind your learning rate might be un-decayed for a few iterations when training-then-resuming.
* (assumption) leave the learning rate where it's at, as training with BitsAndBytes or half-precision will yield worse results the higher the learning rate is.
- For smaller datasets, you can crank this up to the maximum (1e-4), as it will be quicker for the learning rate schedule to decay this.
* Text CE LR Ratio most definitely should not be touched.
- I'm under the impression that it actually wants a high loss ratio to not overfit.
* As for a learning rate schedule, I *feel* like very large datasets require tighter scheduling, but the suggested schedule was for a dataset of around either 4k or 7k files, so it should be fine regardless.
* With MultiStepLR for a learning rate schedule, I need to do better tests on when the LR should decay.
* Your batch size and gradient accumulation size greatly determine how much VRAM gets consumed. It is a bit tough to nail right, yet easy to fail and get un-optimal training (desu, it should be ratio instead of factor).
- The batch size divided by the gradient accumulation size will determine how much VRAM gets used. For example, similar VRAM is consumed if using a ratio of 64:1, 128:2, 256:4, 512:8, 1024:16.
- The only downside is that increasing your gradient accumulation size means more system RAM is consumed, at least, appears to. Reduce the worker size if needed.
- The smaller your print and save frequencies, the more time training will pause to return metrics and dump to disk. I don't think printing very often will harm *too* much, but it pleases my autism to have a tight resolution for my training losses. Naturally, saving often also means more checkpoints on disk.
- I haven't thoroughly tested half-precision training, as using it was the means of reducing VRAM consumption before BitsAndBytes was integrated.
+ in theory, this should be mutually exclusive with BitsAndBytes, as in, you can only enable one or the other.
@ -182,15 +180,13 @@ I have yet to fully train a model with validation enabled to see how well it far
**!**NOTE**!**: This is Linux only, simply because I do not have a way to test it on Windows, nor the care to port the shell script to the batch script. This is left as an exercise to the Windows user.
If you have multiple GPUs, you can easily leverage them by simply specifying how many GPUs you have in the `Run Training` tab. With it, it'll divide out the workload by splitting the batches to work on among the pool (at least, my understanding). Your training configuration will also be modified to better suit multi-GPU training (namely, using the `adamw_zero` optimizer over the `adamw` one, per the comment suggesting to use it for distributed training, although I have my doubts on this being right).
If you have multiple GPUs, you can easily leverage them by simply specifying how many GPUs you have in the `Run Training` tab. With it, it'll divide out the workload by splitting the batches to work on among the pool (at least, my understanding).
However, training large datasets (several thousand+ lines) seems to introduce some instability (at least with ROCm backends). I've had so, so, so, ***so*** many headaches over the course of a week trying to train a large data:
* initially, I was able to leverage insane batch sizes with proportionally insane gradient accumulation sizes (I think something like bs=1024, ga=16) for a day, and recreating configurations with those values will bring about instability (after one epoch it'll crash Xorg and I can never catch if it's from a system OOM). This could just be from additional adjustments, however.
* worker processes count need to be reduced, as it spawns more processes for each GPU, leading to more system RAM pressure. If you have tons and tons of VRAM, you shouldn't worry (something like 1.5X your combined VRAM size should be fine).
* using a rather conservative (~2%) validation dataset size will cause the GPUs to crash, or time out.
* I believe a finished GPU will block until the other GPU finishes its data, even with GPUs at near-parity, it allegedly offers some delay (efficacy of a remedy to this is still being tested)
* at least for CosineAnnealingLR, it does not seem to step down when using multiple GPUs.
* it's very, very easy to fuck it up where the text model trains "better" (relatively) compared to the mel (not good).
* there *may* be additional configuration changes needed for it, but some hints just cause more harm than good.
Smaller workloads seem to not have these egregious issues (a few hundred lines), at least in my recent memory.