Commit Graph

82 Commits

Author SHA1 Message Date
mrq
5e9d1a5302 one more time one more time (this normalization isn't a spook) 2025-03-07 19:32:42 -06:00
mrq
2dd80a03ff stuff for interfacing with the loss scaler value (because I want to cap it) 2025-03-06 17:07:29 -06:00
mrq
56f8be4d62 lol 2025-02-28 22:15:37 -06:00
mrq
ddc49c89c5 the learning rate scheduler pill is a tough pill to swallow 2025-02-28 22:12:19 -06:00
mrq
b97faa8173 fixes... 2025-02-28 18:53:07 -06:00
mrq
ceecac6ffe I think I made resp_parallel_training=True faster with loss factoring? 2025-02-26 23:13:32 -06:00
mrq
f593ee98fc ugh 2025-02-23 21:20:36 -06:00
mrq
cbf6b84e27 fixed grad norm and loss scale not reporting for local trainer 2025-02-23 19:08:26 -06:00
mrq
6634d07576 added muon optimizer through kludge hacks because it necessitates a second optimizer in tandum that seems to only sometimes work with deepspeed 2025-02-23 11:22:13 -06:00
mrq
a65c8144f4 with the amount of tweaks I keep making I could have probably had the nvidia/audio-codec-44khz model realized already...... 2025-02-13 18:38:40 -06:00
mrq
4800e7179a remove nan checks because it causes problems in distributed training because I'm not syncing between GPUs (and nan losses gets ignored anyways with loss scaling) 2024-12-15 09:42:54 -06:00
mrq
09804ecc16 APOLLO tweaks to make it work with deepspeed 2024-12-13 23:03:52 -06:00
mrq
64c67160a3 tweaks 2024-12-13 19:00:35 -06:00
mrq
0fbfb8bbe8 actually save the optimizer for the local engine backend because safetensors doesn't save it 2024-12-12 17:12:59 -06:00
mrq
f41251f648 more fixes for local engine backend 2024-12-12 14:38:42 -06:00
mrq
6b237ae5e3 tweaks for the local engine orchestrator (that I never caught since I always used the deepspeed backend) 2024-12-12 13:37:38 -06:00
mrq
23d402bf01 added knowledge distillation in the trainer (sadly it is not agnostic because of the grave mistake of further processing the batch within the forward pass, so subsequent calls do not match......) 2024-12-05 23:05:52 -06:00
mrq
dfdba3f190 oops 2024-11-20 19:21:03 -06:00
mrq
cd6e9ba2f2 oops 2024-11-20 16:27:51 -06:00
mrq
1a73ac6a20 I cannot believe it's not actually called Wand DB (added wandb logging support since I think it would have been a much better way to look at my metrics) 2024-11-20 16:10:47 -06:00
mrq
190a917b3e I did it. 2024-11-19 12:24:33 -06:00
mrq
c83670c38c Windows specific fixes (to-do: find libespeak-ng.dll automatically because it cannot be trusted to do it by default) 2024-11-03 19:19:15 -06:00
mrq
62fe5b0943 ughh 2024-11-01 22:36:48 -05:00
mrq
75b90be325 cleaned up unused config flags, allow less strict yaml by pruning missing keys, renamed some dataset configs to be more unified 2024-10-17 17:06:48 -05:00
mrq
31e8b7edb8 tweaks and fixes for lora stuffs 2024-09-08 18:05:21 -05:00
mrq
d319d33368 haha 2024-09-04 14:52:26 -05:00
mrq
619369236b ugh 2024-08-30 21:10:57 -05:00
mrq
32287710a2 moved prints to use logger, edited readme (fused_attn doesnt seem stable for training) 2024-08-29 13:27:16 -05:00
mrq
3a65cc4b22 fix issue with sft and shared tensors... 2024-08-04 19:56:21 -05:00
mrq
c09133d00f added safetensors support (with metadata) and feed whatever torch.load/torch.save into it 2024-08-03 23:15:20 -05:00
mrq
06e948aec1 suppress warning on exit about distributed not being cleaned up (because I updated my system) 2024-07-25 16:50:47 -05:00
mrq
188d116222 some weird fixes for an equally weird regression with LoRA loading 2024-07-22 20:47:24 -05:00
mrq
75b04686f8 added prom-less training / inferencing, some other things 2024-07-22 19:36:07 -05:00
mrq
d87b492295 added rudimentary demo page creator (currently just embeds base64 wavs into the page, need to test not doing that) 2024-07-19 20:49:40 -05:00
mrq
fe0f235335 mechanism to store the model config inside the weights and load them, some other things to allow LoRA training on the RetNet (gradient checkpointing will gripe about inputs not having require_grad and nothing seems to remedy it) 2024-07-16 18:23:13 -05:00
mrq
1a392b69f6 local training backend should be a bit more aware of variable batch sizes, maybe 2024-06-28 22:39:05 -05:00
mrq
7cfb78fa64 enable LoRA for targetted RVQ levels (to experiment with, seems to help) 2024-06-17 21:45:03 -05:00
mrq
7047fcc6e2 actually make deepspeed work with LoRAs 2024-06-17 13:55:37 -05:00
mrq
1d159b1476 updated export routine to split LoRA weights from the state dict (should work with deepspeed) 2024-06-17 13:28:18 -05:00
mrq
726a4b613f naive, rudimentary DeepSpeed support (just live with the LoRA weights living with the original weights, they can be split later) 2024-06-17 13:17:24 -05:00
mrq
bd0bc10ec0 added LoRA policy to decide what layer of the model gets adapted based on simple inclusion/exclusion terms 2024-06-17 13:05:06 -05:00
mrq
45a39fb79f very rudimentary lora support (no deepspeed support, tested training and saving but not loading yet) 2024-06-17 00:09:16 -05:00
mrq
a7a6e0ac76 validated that inferencing works, changed some defaults (NAR benefits from greedy sampling) 2024-06-09 17:11:38 -05:00
mrq
4ade2b60ee ugh 2024-06-06 21:57:11 -05:00
mrq
fcac9503e2 cleanup 2024-06-06 13:08:02 -05:00
mrq
e50edc3b48 added a flag to convert to a HF compatible model on export by stitching things 2024-06-03 22:34:47 -05:00
mrq
934672252b feverish cleanup 2024-06-03 21:28:49 -05:00
mrq
c2a436d368 somehow between training sessions grad_norm = None even though it worked before 2024-06-02 08:29:27 -05:00
mrq
827cf632e7 report current loss scale and adjust grad norm by loss scale (for deepspeed) 2024-06-01 10:44:32 -05:00
mrq
856545f8bb nan loss detection (should have added it earlier), loss scaling for local backend + fp16 2024-05-11 22:23:29 -05:00