|
1cd24f3381
|
a birdie tells me i should probably use a different optimizer (also preliminary support for native sparse attention but I don't know if I'll use it)
|
2025-03-04 14:53:02 -06:00 |
|
|
ddc49c89c5
|
the learning rate scheduler pill is a tough pill to swallow
|
2025-02-28 22:12:19 -06:00 |
|
|
b640fabab5
|
borrowed muon since it might better work under deepspeed and not require cruft (even though it really does not like the masked-NAR, also make the masked-NAR faux-causal since it might better help out for cfg.model.version >= 7
|
2025-02-23 17:23:24 -06:00 |
|
|
3019c88799
|
separate mask token and stop token because this might cause issues
|
2025-02-23 11:36:32 -06:00 |
|
|
6634d07576
|
added muon optimizer through kludge hacks because it necessitates a second optimizer in tandum that seems to only sometimes work with deepspeed
|
2025-02-23 11:22:13 -06:00 |
|