Update 'Training'

2023-03-04 15:43:50 +00:00 · 2023-03-04 15:43:50 +00:00 · 7bd385f0a7
commit 7bd385f0a7
parent ff1523741f
1 changed files with 4 additions and 6 deletions
--- a/Training.md
+++ b/Training.md
@ -62,7 +62,7 @@ This will generate the YAML necessary to feed into training. For documentation's
 * `Text_CE LR Weight`: an experimental setting to govern how much weight to factor in with the provided learning rate. This is ***a highly experimental tunable***, and is only exposed so I don't need to edit it myself when testing it. ***Leave this to the default 0.01 unless you know what you are doing.***
 * `Learning Rate Schedule`: a list of epochs on when to decay the learning rate. You really should leave this as the default.
 * `Batch Size`: how large of a batch size for training. Larger batch sizes will result in faster training steps, but at the cost of increased VRAM consumption. This value must exceed the size of your dataset, and *should* be evenly divisible by your dataset size.
-* `Mega Batch Factor`: "Gradient accumulation factor". This was commented rather oddly, implying to decrease it to save on VRAM, when the inverse is true. If you're straining on VRAM, increase this, up to half of your batch size. I'm not too sure what the performance implicatons are from this, but I *feel* lower values will train faster.
+* `Gradient Accumulation Size` (*originally named `mega batch factor`*): At first seemed very confusing, but it's very simple. This will further divide batches into mini-batches, parse them in sequence, but only updates the model after completing all mini-batches. This effectively saves more VRAM by de-facto running at a smaller batch size, but without constantly updating the model, as if running at a larger batch size. This does have some quirks, like crashing when saving at a specific batch size:gradient accumulation ratio, odd pacing of training, etc.
 * `Print Frequency`: how often the trainer should print its training statistics in epochs. Printing takes a little bit of time, but it's a nice way to gauge how a finetune is baking, as it lists your losses and other statistics. This is purely for debugging and babysitting if a model is being trained adequately. The web UI *should* parse the information from stdout and grab the total loss and report it back.
 * `Save Frequency`: how often to save a copy of the model during training in epochs. It seems the training will save a normal copy, an `ema` version of the model, *AND* a backup archive containing both to resume from. If you're training on a Colab with your Drive mounted, these can easily rack up and eat your allotted space. You *can* delete older copies from training, but it's wise not to in case you want to resume from an older state.
 * `Resume State Path`: the last training state saved to resume from. The general path structure is what the placeholder value is. This will resume from whatever iterations it was last at, and iterate from there until the target step count (for example, resuming from iteration 2500, while requesting 5000 iterations, will iterate 2500 more times).
@ -90,11 +90,9 @@ Getting decent training results is quite the pickle, and it seems my nuggets of
 * Text CE LR Ratio most definitely should not be touched.
 	- I'm under the impression that it actually wants a high loss ratio to not overfit.
 * As for a learning rate schedule, I *feel* like very large datasets require tighter scheduling, but the suggested schedule was for a dataset of around either 4k or 7k files, so it should be fine regardless.
-* Your batch size and mega batch factor greatly determine how much VRAM gets consumed. It is a bit tough to nail right, yet easy to fail and get un-optimal training (desu, it should be ratio instead of factor).
-	- With a dataset size of 5304, a ratio of 80:1 consumes about the same VRAM as a ratio of 128:2, and 512:8, and 1024:16.
-    - The pacing of the training also seems similar for an equivalent ratio. On 2x6800XTs, it takes about 480 seconds to parse one epoch for a dataset size of 5304 (assuming my bottleneck is raw calculations, rather than VRAM speed).
-    - However, the bigger your mega batch factor, the "slower" the pacing of your training gets. I've had an astronomically high batch size and mega batch factor in one training test, and my initial loss was 4.X, while reducing it to sensible numbers had my initial loss at 2.X.
-    - ***DO NOT*** think you can be clever by setting astronomical batch sizes and mega batch factors thinking you're getting a good deal on throughput. I imagine the reality of setting very high mega batch factors comes from effectively stretching how much training you actually need, defeating the purpose of large batch sizes.
+* Your batch size and gradient accumulation size greatly determine how much VRAM gets consumed. It is a bit tough to nail right, yet easy to fail and get un-optimal training (desu, it should be ratio instead of factor).
+	- The batch size divided by the gradient accumulation size will determine how much VRAM gets used. For example, similar VRAM is consumed if using a ratio of 64:1, 128:2, 256:4, 512:8, 1024:16.
+    - I need to play around with this more, as I feel one set of understanding gets replaced with a different set of understandings when applying them.
 - The smaller your print and save frequencies, the more time training will pause to return metrics and dump to disk. I don't think printing very often will harm *too* much, but it pleases my autism to have a tight resolution for my training losses. Naturally, saving often also means more checkpoints on disk.
 - I haven't thoroughly tested half-precision training, as using it was the means of reducing VRAM consumption before BitsAndBytes was integrated.
 	+ in theory, this should be mutually exclusive with BitsAndBytes, as in, you can only enable one or the other.