Commit Graph

354 Commits

Author SHA1 Message Date
mrq
726a4b613f naive, rudimentary DeepSpeed support (just live with the LoRA weights living with the original weights, they can be split later) 2024-06-17 13:17:24 -05:00
mrq
bd0bc10ec0 added LoRA policy to decide what layer of the model gets adapted based on simple inclusion/exclusion terms 2024-06-17 13:05:06 -05:00
mrq
be051d9544 added other LoRA method using parametrization rather than linear injection 2024-06-17 09:58:34 -05:00
mrq
45a39fb79f very rudimentary lora support (no deepspeed support, tested training and saving but not loading yet) 2024-06-17 00:09:16 -05:00
mrq
19410a919e ugh 2024-06-15 12:29:03 -05:00
mrq
d343bde09b residual_in_fp32=False for mamba arch backends because it breaks the classifier (output projection / lm head / what-have-you) under AMP 2024-06-15 12:08:03 -05:00
mrq
ccb14c06ef mamba2-hf using vasqu/mamba2-torch because it lets me use mamba2 without triton ops (training with my 4xV100s are not happy with mamba2 because of triton) 2024-06-14 19:42:17 -05:00
mrq
31f71fa134 sampler update (some brainworm just never actually had a sampler for sample_type=path) 2024-06-14 16:55:40 -05:00
mrq
b3b67f34ac added option to sort paths by durations to better group equally lengthed sequences together (and there was maybe a logic error from creating the samplers and then interleave-reordering paths, desyncing them, maybe) 2024-06-13 22:37:34 -05:00
mrq
83eab4fa59 actually going for the suggested "2x layers, no intermediate scaling" is wrong for VALL-E, directly copying the normal transformer structure fixes mamba2 performance in the test trainer 2024-06-13 20:08:22 -05:00
mrq
26da24fd8d mamba updated to fix that pesky NaN error during training 2024-06-13 12:38:33 -05:00
mrq
bcf3910a17 the NAR only dream is dead (it just won't work) 2024-06-12 19:49:47 -05:00
mrq
a9353cf9fa ugh 2024-06-12 00:14:29 -05:00
mrq
cca542a4c0 ugh 2024-06-11 23:59:28 -05:00
mrq
65a8960305 option to split classifier per-level instead of sharing one (at this point I'm just scrambling to try and cope with training a DAC model, the NAR is being a pain) 2024-06-11 22:28:59 -05:00
mrq
a7a6e0ac76 validated that inferencing works, changed some defaults (NAR benefits from greedy sampling) 2024-06-09 17:11:38 -05:00
mrq
234f9efc6e ugh 2024-06-09 11:39:43 -05:00
mrq
132a02c48b sanity cleanup, backup config yaml for each log file 2024-06-09 11:22:52 -05:00
mrq
8d92dac829 forgot I renamed this 2024-06-09 11:12:30 -05:00
mrq
80f9530840 ugh 2024-06-09 01:43:44 -05:00
mrq
5c732b72ee ugh 2024-06-08 20:34:00 -05:00
mrq
8d068fa3f9 reticulating splines 2024-06-08 20:30:15 -05:00
mrq
ead3e2f0cb ugh 2024-06-08 16:14:57 -05:00
mrq
b072f9b96b fixes 2024-06-08 16:01:34 -05:00
mrq
58fb0a84db added experimental NAR only model (inferences text length, need more experimenting), AudioEmbedding logic cleanup (I still think it's being done wrong) 2024-06-08 15:42:02 -05:00
mrq
e35a91c67a ugh 2024-06-07 21:56:14 -05:00
mrq
7d6fff24f9 un-tensor'd quant_level marker since it doesn't need to be one (I forgot why I had it as one but nothing seems to need it as a tensor that didn't already make it one) 2024-06-07 20:46:22 -05:00
mrq
b0158a61d5 fixed some logic errors with training (grabbing wrong quant level...) 2024-06-07 20:34:36 -05:00
mrq
eafa622be2 I forgot the actual reason I was cleaning things up was to re-include prom loss calculation (I realized the reason I did this was because of an prom embedding oversight, it seems to work now) 2024-06-07 20:29:25 -05:00
mrq
da8242d086 finally got around to removing omegaconf 2024-06-07 20:23:53 -05:00
mrq
4ade2b60ee ugh 2024-06-06 21:57:11 -05:00
mrq
f9f309281a ugh 2024-06-06 20:55:27 -05:00
mrq
a5c90348d9 head hurt 2024-06-06 20:51:31 -05:00
mrq
516b0894d7 m 2024-06-06 19:41:26 -05:00
mrq
ee25d2e62e removed the need to supply targ_list + different AudioEmbedding + other things 2024-06-06 18:52:41 -05:00
mrq
fcac9503e2 cleanup 2024-06-06 13:08:02 -05:00
mrq
b2194b859a re-added loading multiple models because I'm now entertaining having split AR/NAR models again (and need a way to load both at once) 2024-06-06 09:48:43 -05:00
mrq
b05a905b95 ugh 2024-06-05 21:02:05 -05:00
mrq
4073656293 oops 2024-06-05 20:53:10 -05:00
mrq
ff6fe6f1bc cleanup 2024-06-05 20:30:43 -05:00
mrq
880b4ecd1b cleanup, putting some thoughts in comments before I forget about them 2024-06-05 19:50:06 -05:00
mrq
3cfc8a96bb oops 2024-06-05 10:30:04 -05:00
mrq
48cd1054f9 madness 2024-06-04 23:48:51 -05:00
mrq
9e3f2e300f experimental "just have a token for what rvq level we're on" that seems to help all models (mamba almost works, but it might just have to be relegated as a pure AR model) 2024-06-04 23:23:31 -05:00
mrq
e0886c5a78 re-added mamba as a possible non-experimental arch backend (test trainer will set it as AR only, doing any NAR tasks lobotomizes it) 2024-06-04 22:41:22 -05:00
mrq
687c71e028 disable accuracy calc because it breaks with actual batched training even though it shouldn't 2024-06-04 22:13:44 -05:00
mrq
d005e24953 oops 2024-06-04 22:10:04 -05:00
mrq
0f7f3ae754 added loss calc split and acc for experimental model 2024-06-04 22:04:40 -05:00
mrq
014e565c4b tweaks 2024-06-04 20:41:13 -05:00
mrq
6d5bd0156a fixes 2024-06-04 18:50:48 -05:00
mrq
ed3aeaf3a1 copy pasted from test to actual trainer 2024-06-04 18:40:30 -05:00
mrq
0aa01ba31a forgot one crucial detail (you *need* the previous RVQ level to keep coherence between all RVQ levels) (experimental deinterleaved is a bit crusty though) 2024-06-04 18:30:30 -05:00
mrq
2ffad5cb6f typo 2024-06-04 14:20:57 -05:00
mrq
406ff7bbe1 re-implemented config.model.interleave for the HF-compat experimental method 2024-06-04 14:19:52 -05:00
mrq
c93d5863fd fixes 2024-06-04 00:07:00 -05:00
mrq
186b93a77e oops 2024-06-03 22:35:55 -05:00
mrq
e50edc3b48 added a flag to convert to a HF compatible model on export by stitching things 2024-06-03 22:34:47 -05:00
mrq
934672252b feverish cleanup 2024-06-03 21:28:49 -05:00
mrq
7feeb944a0 probably insane with even entertaining going this route 2024-06-03 20:26:27 -05:00
mrq
c2a436d368 somehow between training sessions grad_norm = None even though it worked before 2024-06-02 08:29:27 -05:00
mrq
c1fcd889d5 reverted automatically disabling split loss calc, since it seems that it's actually cacling loss on prom causes the oddities, maybe 2024-06-01 12:34:59 -05:00
mrq
8cf176ab46 ugh 2024-06-01 10:46:42 -05:00
mrq
827cf632e7 report current loss scale and adjust grad norm by loss scale (for deepspeed) 2024-06-01 10:44:32 -05:00
mrq
d0ebce6bac ugh 2024-06-01 10:30:13 -05:00
mrq
39bc019142 actually save per-rank sampler states 2024-06-01 09:46:32 -05:00
mrq
74df2f5332 split sampler dict by global_rank, also handle splitting dataset paths by global_rank if sampler_type == path (because I do not trust DistributedSampler) (need to test) 2024-06-01 09:29:49 -05:00
mrq
31785f4eeb actually don't default to compute split losses, test bitnet model doesn't seem to be doing things right (despite debug printouts showing theyre roughly the same logit/loss sequences, could just be bitnet linears being not up to par on actual models) 2024-06-01 09:12:51 -05:00
mrq
e9c87060df oops 2024-05-31 22:22:28 -05:00
mrq
b482ca19ff added model config option to set KV head count for MQA/GQA instead of MHA for llama-based models (i think its very negligible both ways on such a small model size) 2024-05-31 19:32:37 -05:00
mrq
e15c6c74c3 correctness 2024-05-30 20:50:45 -05:00
mrq
da473295b7 better way to compute per-segment losses 2024-05-28 19:29:54 -05:00
mrq
6c49ad06a3 forgot to reinclude mult by loss factors 2024-05-27 20:40:21 -05:00
mrq
b82f0d5c0c finally nailed the issue that caused logging to break on one machine but not another (bitnet includes zetascale which is a parasite that will break logging) 2024-05-27 19:47:58 -05:00
mrq
c0ac84c795 uh 2024-05-27 19:05:56 -05:00
mrq
197d517181 ugh 2024-05-27 17:09:35 -05:00
mrq
5af6f41c94 added loss calcs against prom (requires the right settings for not shit results, disabled by default) 2024-05-27 08:43:00 -05:00
mrq
05cd8b797e nevermind it breaks training 2024-05-25 18:03:43 -05:00
mrq
85f9684720 some cleanup 2024-05-25 17:46:52 -05:00
mrq
d760924719 added kludgy eval only so I don't have to start training, type eval, stop training, then delete the logs for that session 2024-05-25 17:39:51 -05:00
mrq
ddbacde0d1 DAC just doesn't work well enough...... 2024-05-25 11:07:52 -05:00
mrq
e3ef89f5aa 100x better for subtrain/eval to be by group instead 2024-05-19 16:40:14 -05:00
mrq
458b95d196 added option to split between text loss and audio loss (to-do: document this better), because it may or may not be a problem with LLaMA-backed models because my loss hovers around 3.9 / 56% accuracy despite sounding decent at the moment 2024-05-19 11:23:56 -05:00
mrq
74e531d391 ugh 2024-05-18 12:02:56 -05:00
mrq
4bc7e5a6d1 fix loading without needing an hdf5 dataset already prepped (and some other incidental speedups during dataloader prep) 2024-05-18 07:14:26 -05:00
mrq
d88a5ca183 ugh 2024-05-16 07:25:33 -05:00
mrq
d9aabfa3ae final tweaks, hopefully, again 2024-05-15 23:04:19 -05:00
mrq
8d79f78e0a god I need to replace omegaconf 2024-05-12 14:01:52 -05:00
mrq
5eb5db7f7f just don't use DAC 24Khz, it's bad 2024-05-12 13:41:17 -05:00
mrq
230da8b559 should be the final things to scramble around for, DAC's 24KHz model is unusable for this, but both encodec's 24KHz and DAC's 44KHz work 2024-05-12 13:22:08 -05:00
mrq
2437a86efa ugh 2024-05-12 13:02:15 -05:00
mrq
4f1593c8db a bunch of shit to salvage my old encodec-quantized audio because dac-encoded audio just does not want to converge 2024-05-12 10:17:29 -05:00
mrq
917eeb40d2 ughhh 2024-05-12 08:22:39 -05:00
mrq
9910c75d5a checkpointing for bitnet impl 2024-05-12 07:52:54 -05:00
mrq
14709ac67f ughh 2024-05-12 07:30:59 -05:00
mrq
3774fcbdee ugh 2024-05-11 22:58:38 -05:00
mrq
856545f8bb nan loss detection (should have added it earlier), loss scaling for local backend + fp16 2024-05-11 22:23:29 -05:00
mrq
a755eb3c62 ugh 2024-05-11 17:34:45 -05:00
mrq
88e9b9caff local ddp fix 2024-05-11 17:29:01 -05:00
mrq
3337c69e5a leverage between xformers and torch.backends.cuda.sdp_kernel for attention 2024-05-11 17:14:05 -05:00
mrq
d33c7bb7cf ugh 2024-05-11 16:47:19 -05:00