|
a30dffcca7
|
wandb additions (to-do eventually, upload samples as artifacts)
|
2025-03-06 15:44:40 -06:00 |
|
|
1cd24f3381
|
a birdie tells me i should probably use a different optimizer (also preliminary support for native sparse attention but I don't know if I'll use it)
|
2025-03-04 14:53:02 -06:00 |
|
|
ddc49c89c5
|
the learning rate scheduler pill is a tough pill to swallow
|
2025-02-28 22:12:19 -06:00 |
|
|
7d2e64630c
|
lol
|
2025-02-26 10:49:06 -06:00 |
|
|
95da4e9405
|
made muon actually work by actually utilizing param groups (thanks APOLLO for reminding me this is the sane way to handle this split)
|
2025-02-26 10:39:13 -06:00 |
|
|
f593ee98fc
|
ugh
|
2025-02-23 21:20:36 -06:00 |
|
|
b640fabab5
|
borrowed muon since it might better work under deepspeed and not require cruft (even though it really does not like the masked-NAR, also make the masked-NAR faux-causal since it might better help out for cfg.model.version >= 7
|
2025-02-23 17:23:24 -06:00 |
|
|
3019c88799
|
separate mask token and stop token because this might cause issues
|
2025-02-23 11:36:32 -06:00 |
|
|
6634d07576
|
added muon optimizer through kludge hacks because it necessitates a second optimizer in tandum that seems to only sometimes work with deepspeed
|
2025-02-23 11:22:13 -06:00 |
|
|
ab0abd2b12
|
fixes fixes fixes (a quarter of my recently processed audio returned zero'd tensors......)
|
2025-02-22 09:07:33 -06:00 |
|
|
a65c8144f4
|
with the amount of tweaks I keep making I could have probably had the nvidia/audio-codec-44khz model realized already......
|
2025-02-13 18:38:40 -06:00 |
|
|
e8f182b634
|
cleaned up loss calc code (it REALLY hates ignore_loss_for_inputs, but is fine with splitting with loss factors)
|
2025-02-13 09:35:27 -06:00 |
|
|
b52c5c5d80
|
this seems to work in testing
|
2025-02-12 16:16:04 -06:00 |
|
|
e029a8804d
|
ironically none of this cruft gets the loss lower than the original way
|
2025-02-12 11:17:00 -06:00 |
|
|
e5916ea519
|
for my sanity it seems having extraneous tokens in the embedding/classifier has the loss/acc a little higher than it should
|
2025-02-11 14:47:35 -06:00 |
|
|
497bdfc67b
|
more work (the wall is non-causal decoding......)
|
2024-12-22 20:11:31 -06:00 |
|
|
5f289db275
|
ugh
|
2024-12-22 16:15:24 -06:00 |
|
|
353e478e68
|
agony
|
2024-12-21 22:52:10 -06:00 |
|
|
3dd31e74d1
|
finally figured out a clean way to handle "resuming" the tqdm bar
|
2024-12-14 18:44:43 -06:00 |
|
|
09804ecc16
|
APOLLO tweaks to make it work with deepspeed
|
2024-12-13 23:03:52 -06:00 |
|
|
64c67160a3
|
tweaks
|
2024-12-13 19:00:35 -06:00 |
|
|
9a62e3b824
|
APOLLO cringe (doesn't want to work with deepspeed)
|
2024-12-12 00:31:58 -06:00 |
|
|
8568a93dad
|
added WER/SIM-O metrics, added APOLLO but I need to test it
|
2024-12-10 20:13:21 -06:00 |
|
|
61ed662856
|
ACTUALLY actually fix KD-loss (the -inf in the logits was caused by cringecode)
|
2024-12-07 12:31:54 -06:00 |
|
|
23d402bf01
|
added knowledge distillation in the trainer (sadly it is not agnostic because of the grave mistake of further processing the batch within the forward pass, so subsequent calls do not match......)
|
2024-12-05 23:05:52 -06:00 |
|
|
3fc0540f49
|
m
|
2024-11-21 15:07:46 -06:00 |
|
|
cd6e9ba2f2
|
oops
|
2024-11-20 16:27:51 -06:00 |
|
|
1a73ac6a20
|
I cannot believe it's not actually called Wand DB (added wandb logging support since I think it would have been a much better way to look at my metrics)
|
2024-11-20 16:10:47 -06:00 |
|
|
190a917b3e
|
I did it.
|
2024-11-19 12:24:33 -06:00 |
|
|
e412e98125
|
ugh
|
2024-11-14 07:34:22 -06:00 |
|
|
269648605e
|
move NAR-len rvq level 0 to separate embedding
|
2024-11-13 11:38:58 -06:00 |
|
|
48490757da
|
fixes
|
2024-11-10 20:37:50 -06:00 |
|
|
9cb0b6901b
|
unified nar.py into ar_nar.py
|
2024-11-10 12:19:48 -06:00 |
|
|
e108c54daf
|
new NAR-len training paradigm......
|
2024-11-07 11:32:11 -06:00 |
|
|
4049f51ba9
|
added option to load lora directly from the model file itself with --lora
|
2024-10-26 00:13:10 -05:00 |
|
|
ccf71dc1b6
|
added option to load from a model state dict directly instead of a yaml (to-do: do this for LoRAs too), automatically download the default model if none is provided
|
2024-10-25 22:15:15 -05:00 |
|
|
c8d4716a9f
|
ugh
|
2024-09-18 21:40:57 -05:00 |
|
|
413097f5f7
|
fixes
|
2024-09-05 21:42:59 -05:00 |
|
|
685f4faec0
|
ugh
|
2024-08-30 10:46:26 -05:00 |
|
|
32287710a2
|
moved prints to use logger, edited readme (fused_attn doesnt seem stable for training)
|
2024-08-29 13:27:16 -05:00 |
|
|
b7b99a25f1
|
added ability to specify attention backend for CLI and webui (because im tired of editing the yaml)
|
2024-08-26 19:33:51 -05:00 |
|
|
d19f93a2c0
|
documentation update
|
2024-08-04 00:14:49 -05:00 |
|
|
2cb465018b
|
implicitly load either normal pickled weights or safetensors on loading the model
|
2024-08-03 23:34:18 -05:00 |
|
|
c09133d00f
|
added safetensors support (with metadata) and feed whatever torch.load/torch.save into it
|
2024-08-03 23:15:20 -05:00 |
|
|
6a733eb2ed
|
changed torch.Tensor().to(device, dtype) to just torch.tensor(..., device, dtype) because it's been bothering my autism that I'm creating tensors then converting rather than creating with the right device/dtype, some 'optimization' to compile the model but it doesnt seem to do anything useful
|
2024-08-03 22:10:21 -05:00 |
|
|
66407e5bdb
|
tweaks for the NAR-len model, maybe
|
2024-08-03 08:40:39 -05:00 |
|
|
7a77978096
|
oversight with using resize_modules
|
2024-08-02 20:28:49 -05:00 |
|
|
b4c895114c
|
naive model offloading support (handles automatically splitting parts of the model to requested device per memory constraints, either inferred or requested in the yaml, input tensors are automatically migrated to the right device, it SEEMS to work for training under the test trainer when split between GPU and CPU) (this was specifically only because that Flux imagegen model released so I can test it there)
|
2024-08-01 20:12:06 -05:00 |
|
|
387358bc8a
|
fixes for the NAR-len model, and documentation some config options, and a better way to handle resizing modules on state_dict load
|
2024-07-31 20:35:09 -05:00 |
|
|
d7c6be6f78
|
fix weird regression in handling checkpoints when backend is local, but deepspeed checkpoints are in (it was handled with LoRA loading but not real loading...)
|
2024-07-30 22:15:56 -05:00 |
|