Commit Graph

96 Commits

Author SHA1 Message Date
mrq
413097f5f7 fixes 2024-09-05 21:42:59 -05:00
mrq
d319d33368 haha 2024-09-04 14:52:26 -05:00
mrq
619369236b ugh 2024-08-30 21:10:57 -05:00
mrq
685f4faec0 ugh 2024-08-30 10:46:26 -05:00
mrq
32287710a2 moved prints to use logger, edited readme (fused_attn doesnt seem stable for training) 2024-08-29 13:27:16 -05:00
mrq
b7b99a25f1 added ability to specify attention backend for CLI and webui (because im tired of editing the yaml) 2024-08-26 19:33:51 -05:00
mrq
3a65cc4b22 fix issue with sft and shared tensors... 2024-08-04 19:56:21 -05:00
mrq
d19f93a2c0 documentation update 2024-08-04 00:14:49 -05:00
mrq
2cb465018b implicitly load either normal pickled weights or safetensors on loading the model 2024-08-03 23:34:18 -05:00
mrq
c09133d00f added safetensors support (with metadata) and feed whatever torch.load/torch.save into it 2024-08-03 23:15:20 -05:00
mrq
6a733eb2ed changed torch.Tensor().to(device, dtype) to just torch.tensor(..., device, dtype) because it's been bothering my autism that I'm creating tensors then converting rather than creating with the right device/dtype, some 'optimization' to compile the model but it doesnt seem to do anything useful 2024-08-03 22:10:21 -05:00
mrq
66407e5bdb tweaks for the NAR-len model, maybe 2024-08-03 08:40:39 -05:00
mrq
7a77978096 oversight with using resize_modules 2024-08-02 20:28:49 -05:00
mrq
b4c895114c naive model offloading support (handles automatically splitting parts of the model to requested device per memory constraints, either inferred or requested in the yaml, input tensors are automatically migrated to the right device, it SEEMS to work for training under the test trainer when split between GPU and CPU) (this was specifically only because that Flux imagegen model released so I can test it there) 2024-08-01 20:12:06 -05:00
mrq
387358bc8a fixes for the NAR-len model, and documentation some config options, and a better way to handle resizing modules on state_dict load 2024-07-31 20:35:09 -05:00
mrq
d7c6be6f78 fix weird regression in handling checkpoints when backend is local, but deepspeed checkpoints are in (it was handled with LoRA loading but not real loading...) 2024-07-30 22:15:56 -05:00
mrq
06e948aec1 suppress warning on exit about distributed not being cleaned up (because I updated my system) 2024-07-25 16:50:47 -05:00
mrq
188d116222 some weird fixes for an equally weird regression with LoRA loading 2024-07-22 20:47:24 -05:00
mrq
75b04686f8 added prom-less training / inferencing, some other things 2024-07-22 19:36:07 -05:00
mrq
d87b492295 added rudimentary demo page creator (currently just embeds base64 wavs into the page, need to test not doing that) 2024-07-19 20:49:40 -05:00
mrq
d53038a9e4 actually have split classifiers working 2024-07-19 15:33:31 -05:00
mrq
fe0f235335 mechanism to store the model config inside the weights and load them, some other things to allow LoRA training on the RetNet (gradient checkpointing will gripe about inputs not having require_grad and nothing seems to remedy it) 2024-07-16 18:23:13 -05:00
mrq
3acc54df22 allow loading a different model within the web ui (apparently I did not have the web UI in the documentation) 2024-07-15 19:59:48 -05:00
mrq
c4dd523b6f change from chunk-slicing paths for distributed dataloader to instead interleave 2024-06-29 10:10:35 -05:00
mrq
dd40463803 limit eval size because the training batch size seems to be used for the eval dataloader, somehow (bandaid) 2024-06-29 09:11:28 -05:00
mrq
1a392b69f6 local training backend should be a bit more aware of variable batch sizes, maybe 2024-06-28 22:39:05 -05:00
mrq
8fffb94964 backport fix from tortoise_tts with local trainer + loading state when training lora 2024-06-25 13:41:29 -05:00
mrq
8a986eb480 load exported LoRA weights if exists (to-do: make a better LoRA loading mechanism) 2024-06-18 21:45:46 -05:00
mrq
7cfb78fa64 enable LoRA for targetted RVQ levels (to experiment with, seems to help) 2024-06-17 21:45:03 -05:00
mrq
7047fcc6e2 actually make deepspeed work with LoRAs 2024-06-17 13:55:37 -05:00
mrq
1d159b1476 updated export routine to split LoRA weights from the state dict (should work with deepspeed) 2024-06-17 13:28:18 -05:00
mrq
726a4b613f naive, rudimentary DeepSpeed support (just live with the LoRA weights living with the original weights, they can be split later) 2024-06-17 13:17:24 -05:00
mrq
bd0bc10ec0 added LoRA policy to decide what layer of the model gets adapted based on simple inclusion/exclusion terms 2024-06-17 13:05:06 -05:00
mrq
45a39fb79f very rudimentary lora support (no deepspeed support, tested training and saving but not loading yet) 2024-06-17 00:09:16 -05:00
mrq
19410a919e ugh 2024-06-15 12:29:03 -05:00
mrq
a7a6e0ac76 validated that inferencing works, changed some defaults (NAR benefits from greedy sampling) 2024-06-09 17:11:38 -05:00
mrq
132a02c48b sanity cleanup, backup config yaml for each log file 2024-06-09 11:22:52 -05:00
mrq
8d92dac829 forgot I renamed this 2024-06-09 11:12:30 -05:00
mrq
4ade2b60ee ugh 2024-06-06 21:57:11 -05:00
mrq
fcac9503e2 cleanup 2024-06-06 13:08:02 -05:00
mrq
b2194b859a re-added loading multiple models because I'm now entertaining having split AR/NAR models again (and need a way to load both at once) 2024-06-06 09:48:43 -05:00
mrq
b05a905b95 ugh 2024-06-05 21:02:05 -05:00
mrq
e50edc3b48 added a flag to convert to a HF compatible model on export by stitching things 2024-06-03 22:34:47 -05:00
mrq
934672252b feverish cleanup 2024-06-03 21:28:49 -05:00
mrq
c2a436d368 somehow between training sessions grad_norm = None even though it worked before 2024-06-02 08:29:27 -05:00
mrq
827cf632e7 report current loss scale and adjust grad norm by loss scale (for deepspeed) 2024-06-01 10:44:32 -05:00
mrq
856545f8bb nan loss detection (should have added it earlier), loss scaling for local backend + fp16 2024-05-11 22:23:29 -05:00
mrq
88e9b9caff local ddp fix 2024-05-11 17:29:01 -05:00
mrq
71e373064f remove redundant loss, tweak readme 2024-05-11 15:02:47 -05:00
mrq
c22a177cf8 forgot to pass warmup to schedule free 2024-05-09 22:18:49 -05:00