|
95da4e9405
|
made muon actually work by actually utilizing param groups (thanks APOLLO for reminding me this is the sane way to handle this split)
|
2025-02-26 10:39:13 -06:00 |
|
|
cbf6b84e27
|
fixed grad norm and loss scale not reporting for local trainer
|
2025-02-23 19:08:26 -06:00 |
|
|
b640fabab5
|
borrowed muon since it might better work under deepspeed and not require cruft (even though it really does not like the masked-NAR, also make the masked-NAR faux-causal since it might better help out for cfg.model.version >= 7
|
2025-02-23 17:23:24 -06:00 |
|