|
353e478e68
|
agony
|
2024-12-21 22:52:10 -06:00 |
|
|
4800e7179a
|
remove nan checks because it causes problems in distributed training because I'm not syncing between GPUs (and nan losses gets ignored anyways with loss scaling)
|
2024-12-15 09:42:54 -06:00 |
|
|
3dd31e74d1
|
finally figured out a clean way to handle "resuming" the tqdm bar
|
2024-12-14 18:44:43 -06:00 |
|
|
09804ecc16
|
APOLLO tweaks to make it work with deepspeed
|
2024-12-13 23:03:52 -06:00 |
|
|
64c67160a3
|
tweaks
|
2024-12-13 19:00:35 -06:00 |
|
|
0fbfb8bbe8
|
actually save the optimizer for the local engine backend because safetensors doesn't save it
|
2024-12-12 17:12:59 -06:00 |
|
|
f41251f648
|
more fixes for local engine backend
|
2024-12-12 14:38:42 -06:00 |
|
|
6b237ae5e3
|
tweaks for the local engine orchestrator (that I never caught since I always used the deepspeed backend)
|
2024-12-12 13:37:38 -06:00 |
|
|
9a62e3b824
|
APOLLO cringe (doesn't want to work with deepspeed)
|
2024-12-12 00:31:58 -06:00 |
|
|
8568a93dad
|
added WER/SIM-O metrics, added APOLLO but I need to test it
|
2024-12-10 20:13:21 -06:00 |
|
|
61ed662856
|
ACTUALLY actually fix KD-loss (the -inf in the logits was caused by cringecode)
|
2024-12-07 12:31:54 -06:00 |
|
|
23d402bf01
|
added knowledge distillation in the trainer (sadly it is not agnostic because of the grave mistake of further processing the batch within the forward pass, so subsequent calls do not match......)
|
2024-12-05 23:05:52 -06:00 |
|
|
3fc0540f49
|
m
|
2024-11-21 15:07:46 -06:00 |
|
|
dfdba3f190
|
oops
|
2024-11-20 19:21:03 -06:00 |
|
|
cd6e9ba2f2
|
oops
|
2024-11-20 16:27:51 -06:00 |
|
|
1a73ac6a20
|
I cannot believe it's not actually called Wand DB (added wandb logging support since I think it would have been a much better way to look at my metrics)
|
2024-11-20 16:10:47 -06:00 |
|
|
190a917b3e
|
I did it.
|
2024-11-19 12:24:33 -06:00 |
|
|
e412e98125
|
ugh
|
2024-11-14 07:34:22 -06:00 |
|
|
269648605e
|
move NAR-len rvq level 0 to separate embedding
|
2024-11-13 11:38:58 -06:00 |
|
|
48490757da
|
fixes
|
2024-11-10 20:37:50 -06:00 |
|
|
9cb0b6901b
|
unified nar.py into ar_nar.py
|
2024-11-10 12:19:48 -06:00 |
|
|
e108c54daf
|
new NAR-len training paradigm......
|
2024-11-07 11:32:11 -06:00 |
|
|
c83670c38c
|
Windows specific fixes (to-do: find libespeak-ng.dll automatically because it cannot be trusted to do it by default)
|
2024-11-03 19:19:15 -06:00 |
|
|
62fe5b0943
|
ughh
|
2024-11-01 22:36:48 -05:00 |
|
|
ef1c17430f
|
skip step on nan loss (ironically I have not had a nan loss after adding this), throw exception with invalid cfg.dataset.sample_type and sample_order combination (because I was tricked by this in my yaml and had inconsistent vram usage)
|
2024-11-01 20:54:53 -05:00 |
|
|
4049f51ba9
|
added option to load lora directly from the model file itself with --lora
|
2024-10-26 00:13:10 -05:00 |
|
|
ccf71dc1b6
|
added option to load from a model state dict directly instead of a yaml (to-do: do this for LoRAs too), automatically download the default model if none is provided
|
2024-10-25 22:15:15 -05:00 |
|
|
75b90be325
|
cleaned up unused config flags, allow less strict yaml by pruning missing keys, renamed some dataset configs to be more unified
|
2024-10-17 17:06:48 -05:00 |
|
|
c8d4716a9f
|
ugh
|
2024-09-18 21:40:57 -05:00 |
|
|
31e8b7edb8
|
tweaks and fixes for lora stuffs
|
2024-09-08 18:05:21 -05:00 |
|
|
413097f5f7
|
fixes
|
2024-09-05 21:42:59 -05:00 |
|
|
d319d33368
|
haha
|
2024-09-04 14:52:26 -05:00 |
|
|
619369236b
|
ugh
|
2024-08-30 21:10:57 -05:00 |
|
|
685f4faec0
|
ugh
|
2024-08-30 10:46:26 -05:00 |
|
|
32287710a2
|
moved prints to use logger, edited readme (fused_attn doesnt seem stable for training)
|
2024-08-29 13:27:16 -05:00 |
|
|
b7b99a25f1
|
added ability to specify attention backend for CLI and webui (because im tired of editing the yaml)
|
2024-08-26 19:33:51 -05:00 |
|
|
3a65cc4b22
|
fix issue with sft and shared tensors...
|
2024-08-04 19:56:21 -05:00 |
|
|
d19f93a2c0
|
documentation update
|
2024-08-04 00:14:49 -05:00 |
|
|
2cb465018b
|
implicitly load either normal pickled weights or safetensors on loading the model
|
2024-08-03 23:34:18 -05:00 |
|
|
c09133d00f
|
added safetensors support (with metadata) and feed whatever torch.load/torch.save into it
|
2024-08-03 23:15:20 -05:00 |
|
|
6a733eb2ed
|
changed torch.Tensor().to(device, dtype) to just torch.tensor(..., device, dtype) because it's been bothering my autism that I'm creating tensors then converting rather than creating with the right device/dtype, some 'optimization' to compile the model but it doesnt seem to do anything useful
|
2024-08-03 22:10:21 -05:00 |
|
|
66407e5bdb
|
tweaks for the NAR-len model, maybe
|
2024-08-03 08:40:39 -05:00 |
|
|
7a77978096
|
oversight with using resize_modules
|
2024-08-02 20:28:49 -05:00 |
|
|
b4c895114c
|
naive model offloading support (handles automatically splitting parts of the model to requested device per memory constraints, either inferred or requested in the yaml, input tensors are automatically migrated to the right device, it SEEMS to work for training under the test trainer when split between GPU and CPU) (this was specifically only because that Flux imagegen model released so I can test it there)
|
2024-08-01 20:12:06 -05:00 |
|
|
387358bc8a
|
fixes for the NAR-len model, and documentation some config options, and a better way to handle resizing modules on state_dict load
|
2024-07-31 20:35:09 -05:00 |
|
|
d7c6be6f78
|
fix weird regression in handling checkpoints when backend is local, but deepspeed checkpoints are in (it was handled with LoRA loading but not real loading...)
|
2024-07-30 22:15:56 -05:00 |
|
|
06e948aec1
|
suppress warning on exit about distributed not being cleaned up (because I updated my system)
|
2024-07-25 16:50:47 -05:00 |
|
|
188d116222
|
some weird fixes for an equally weird regression with LoRA loading
|
2024-07-22 20:47:24 -05:00 |
|
|
75b04686f8
|
added prom-less training / inferencing, some other things
|
2024-07-22 19:36:07 -05:00 |
|
|
d87b492295
|
added rudimentary demo page creator (currently just embeds base64 wavs into the page, need to test not doing that)
|
2024-07-19 20:49:40 -05:00 |
|