Commit Graph

148 Commits

Author SHA1 Message Date
mrq
2dd80a03ff stuff for interfacing with the loss scaler value (because I want to cap it) 2025-03-06 17:07:29 -06:00
mrq
a30dffcca7 wandb additions (to-do eventually, upload samples as artifacts) 2025-03-06 15:44:40 -06:00
mrq
1cd24f3381 a birdie tells me i should probably use a different optimizer (also preliminary support for native sparse attention but I don't know if I'll use it) 2025-03-04 14:53:02 -06:00
mrq
56f8be4d62 lol 2025-02-28 22:15:37 -06:00
mrq
ddc49c89c5 the learning rate scheduler pill is a tough pill to swallow 2025-02-28 22:12:19 -06:00
mrq
b97faa8173 fixes... 2025-02-28 18:53:07 -06:00
mrq
ceecac6ffe I think I made resp_parallel_training=True faster with loss factoring? 2025-02-26 23:13:32 -06:00
mrq
7d2e64630c lol 2025-02-26 10:49:06 -06:00
mrq
95da4e9405 made muon actually work by actually utilizing param groups (thanks APOLLO for reminding me this is the sane way to handle this split) 2025-02-26 10:39:13 -06:00
mrq
f593ee98fc ugh 2025-02-23 21:20:36 -06:00
mrq
cbf6b84e27 fixed grad norm and loss scale not reporting for local trainer 2025-02-23 19:08:26 -06:00
mrq
b640fabab5 borrowed muon since it might better work under deepspeed and not require cruft (even though it really does not like the masked-NAR, also make the masked-NAR faux-causal since it might better help out for cfg.model.version >= 7 2025-02-23 17:23:24 -06:00
mrq
3019c88799 separate mask token and stop token because this might cause issues 2025-02-23 11:36:32 -06:00
mrq
6634d07576 added muon optimizer through kludge hacks because it necessitates a second optimizer in tandum that seems to only sometimes work with deepspeed 2025-02-23 11:22:13 -06:00
mrq
ab0abd2b12 fixes fixes fixes (a quarter of my recently processed audio returned zero'd tensors......) 2025-02-22 09:07:33 -06:00
mrq
a65c8144f4 with the amount of tweaks I keep making I could have probably had the nvidia/audio-codec-44khz model realized already...... 2025-02-13 18:38:40 -06:00
mrq
e8f182b634 cleaned up loss calc code (it REALLY hates ignore_loss_for_inputs, but is fine with splitting with loss factors) 2025-02-13 09:35:27 -06:00
mrq
b52c5c5d80 this seems to work in testing 2025-02-12 16:16:04 -06:00
mrq
e029a8804d ironically none of this cruft gets the loss lower than the original way 2025-02-12 11:17:00 -06:00
mrq
e5916ea519 for my sanity it seems having extraneous tokens in the embedding/classifier has the loss/acc a little higher than it should 2025-02-11 14:47:35 -06:00
mrq
497bdfc67b more work (the wall is non-causal decoding......) 2024-12-22 20:11:31 -06:00
mrq
5f289db275 ugh 2024-12-22 16:15:24 -06:00
mrq
353e478e68 agony 2024-12-21 22:52:10 -06:00
mrq
4800e7179a remove nan checks because it causes problems in distributed training because I'm not syncing between GPUs (and nan losses gets ignored anyways with loss scaling) 2024-12-15 09:42:54 -06:00
mrq
3dd31e74d1 finally figured out a clean way to handle "resuming" the tqdm bar 2024-12-14 18:44:43 -06:00
mrq
09804ecc16 APOLLO tweaks to make it work with deepspeed 2024-12-13 23:03:52 -06:00
mrq
64c67160a3 tweaks 2024-12-13 19:00:35 -06:00
mrq
0fbfb8bbe8 actually save the optimizer for the local engine backend because safetensors doesn't save it 2024-12-12 17:12:59 -06:00
mrq
f41251f648 more fixes for local engine backend 2024-12-12 14:38:42 -06:00
mrq
6b237ae5e3 tweaks for the local engine orchestrator (that I never caught since I always used the deepspeed backend) 2024-12-12 13:37:38 -06:00
mrq
9a62e3b824 APOLLO cringe (doesn't want to work with deepspeed) 2024-12-12 00:31:58 -06:00
mrq
8568a93dad added WER/SIM-O metrics, added APOLLO but I need to test it 2024-12-10 20:13:21 -06:00
mrq
61ed662856 ACTUALLY actually fix KD-loss (the -inf in the logits was caused by cringecode) 2024-12-07 12:31:54 -06:00
mrq
23d402bf01 added knowledge distillation in the trainer (sadly it is not agnostic because of the grave mistake of further processing the batch within the forward pass, so subsequent calls do not match......) 2024-12-05 23:05:52 -06:00
mrq
3fc0540f49 m 2024-11-21 15:07:46 -06:00
mrq
dfdba3f190 oops 2024-11-20 19:21:03 -06:00
mrq
cd6e9ba2f2 oops 2024-11-20 16:27:51 -06:00
mrq
1a73ac6a20 I cannot believe it's not actually called Wand DB (added wandb logging support since I think it would have been a much better way to look at my metrics) 2024-11-20 16:10:47 -06:00
mrq
190a917b3e I did it. 2024-11-19 12:24:33 -06:00
mrq
e412e98125 ugh 2024-11-14 07:34:22 -06:00
mrq
269648605e move NAR-len rvq level 0 to separate embedding 2024-11-13 11:38:58 -06:00
mrq
48490757da fixes 2024-11-10 20:37:50 -06:00
mrq
9cb0b6901b unified nar.py into ar_nar.py 2024-11-10 12:19:48 -06:00
mrq
e108c54daf new NAR-len training paradigm...... 2024-11-07 11:32:11 -06:00
mrq
c83670c38c Windows specific fixes (to-do: find libespeak-ng.dll automatically because it cannot be trusted to do it by default) 2024-11-03 19:19:15 -06:00
mrq
62fe5b0943 ughh 2024-11-01 22:36:48 -05:00
mrq
ef1c17430f skip step on nan loss (ironically I have not had a nan loss after adding this), throw exception with invalid cfg.dataset.sample_type and sample_order combination (because I was tricked by this in my yaml and had inconsistent vram usage) 2024-11-01 20:54:53 -05:00
mrq
4049f51ba9 added option to load lora directly from the model file itself with --lora 2024-10-26 00:13:10 -05:00
mrq
ccf71dc1b6 added option to load from a model state dict directly instead of a yaml (to-do: do this for LoRAs too), automatically download the default model if none is provided 2024-10-25 22:15:15 -05:00
mrq
75b90be325 cleaned up unused config flags, allow less strict yaml by pruning missing keys, renamed some dataset configs to be more unified 2024-10-17 17:06:48 -05:00