Commit Graph

94 Commits

Author SHA1 Message Date
mrq
e0886c5a78 re-added mamba as a possible non-experimental arch backend (test trainer will set it as AR only, doing any NAR tasks lobotomizes it) 2024-06-04 22:41:22 -05:00
mrq
934672252b feverish cleanup 2024-06-03 21:28:49 -05:00
mrq
7feeb944a0 probably insane with even entertaining going this route 2024-06-03 20:26:27 -05:00
mrq
b482ca19ff added model config option to set KV head count for MQA/GQA instead of MHA for llama-based models (i think its very negligible both ways on such a small model size) 2024-05-31 19:32:37 -05:00
mrq
e15c6c74c3 correctness 2024-05-30 20:50:45 -05:00
mrq
da473295b7 better way to compute per-segment losses 2024-05-28 19:29:54 -05:00
mrq
6c49ad06a3 forgot to reinclude mult by loss factors 2024-05-27 20:40:21 -05:00
mrq
b82f0d5c0c finally nailed the issue that caused logging to break on one machine but not another (bitnet includes zetascale which is a parasite that will break logging) 2024-05-27 19:47:58 -05:00
mrq
c0ac84c795 uh 2024-05-27 19:05:56 -05:00
mrq
197d517181 ugh 2024-05-27 17:09:35 -05:00
mrq
5af6f41c94 added loss calcs against prom (requires the right settings for not shit results, disabled by default) 2024-05-27 08:43:00 -05:00
mrq
458b95d196 added option to split between text loss and audio loss (to-do: document this better), because it may or may not be a problem with LLaMA-backed models because my loss hovers around 3.9 / 56% accuracy despite sounding decent at the moment 2024-05-19 11:23:56 -05:00
mrq
917eeb40d2 ughhh 2024-05-12 08:22:39 -05:00
mrq
9910c75d5a checkpointing for bitnet impl 2024-05-12 07:52:54 -05:00
mrq
14709ac67f ughh 2024-05-12 07:30:59 -05:00
mrq
a755eb3c62 ugh 2024-05-11 17:34:45 -05:00
mrq
88e9b9caff local ddp fix 2024-05-11 17:29:01 -05:00
mrq
3337c69e5a leverage between xformers and torch.backends.cuda.sdp_kernel for attention 2024-05-11 17:14:05 -05:00
mrq
d33c7bb7cf ugh 2024-05-11 16:47:19 -05:00
mrq
2109712e5b resolve deprecation warning that doesn't show on my old training rig but does on my new one 2024-05-09 23:25:44 -05:00
mrq
1547de5020 haha... 2024-05-09 23:15:52 -05:00
mrq
0d5d545a40 crammed in DAdaptation (doesn't seem worth it) and ScheduleFree (forgot I wanted to weeks ago, seems promising), optimization wrapper cleanup, test trainer changes, etc. 2024-05-09 20:28:20 -05:00
mrq
33b7f81b94 small cleanups 2024-05-04 22:37:22 -05:00
mrq
253441b750 forgot to disable verbose flag 2024-05-04 13:13:52 -05:00
mrq
3dca1125f5 implemented xformers in HF's Llama (because theres no flash attention for Volta cards) 2024-05-04 13:07:45 -05:00
mrq
ffa200eec7 added option to specify frames per second for the given audio representation (Encodec is 75Hz, DAC is 41Hz (at 24K sources)) 2024-05-04 12:05:41 -05:00
mrq
b5d1456a09 backwards compat for my shitty old weights (was testing if disabling AudioEmbedding summing magically made things better (it did not)) 2024-04-29 22:14:01 -05:00
mrq
5120ffdda7 god it would be nice to know the best way to handle audio embeddings, because I genuinely don't know without skimming through papers or devoting X amount of GPU hours in training 2024-04-29 18:24:05 -05:00
mrq
b0bd88833c refractor cleanup, had a revelation on how I can handle a batch of varying tasks 2024-04-16 21:04:48 -05:00
mrq
467fa1c5ee wrapper fixes 2024-04-16 10:19:02 -05:00
mrq
aa1e25fbf5 backwards compat for old YAMLs with models, option to set flash attention 2 for Llama (and derivatives), included syncdoth/RetNets torchscale retnet for shits and grins, etc. 2024-04-16 10:02:31 -05:00
mrq
545162195b deprecate sole AR/NAR model by only keeping the AR+NAR (the beauty of no one using this is that I can break compat as much as I want), add tone token for when I classify my dataset with tone/emotion in the future, some other things 2024-04-15 19:54:32 -05:00
mrq
d69a00e389 Properly pass retention_mask for retnet-HF, attempt to fix recurrent forward for retnet (doesn't work still) 2024-04-14 13:12:50 -05:00
mrq
9d97eb5104 added FP8 support through NVIDIA/TransformerEngine, added RetNet_HF through syncdoth/RetNet (as an alternative to branch away from torchscale) 2024-04-08 20:14:51 -05:00
mrq
7075c2a5f0 added an option to allow injecting embeddings from another model, because it dawned upon me how valuable embeddings from a good model can be for subsequent trainings (defined under cfg.models._embeddings as a relative path to the yaml) 2024-04-04 19:11:49 -05:00
mrq
35d78a2bb0 Yet Another Underlying Transformer Implementation (BitNet, will give it a few days to see how it fares) 2024-02-29 20:29:17 -06:00
mrq
3da1518ace added Mistral (non-Mixtral) backend, useless optimization when not training, proper adjustment of the LR for Prodigyopt through d_coeff (maybe), recurrent sampling for LLaMA/Mistral/Mixtral backends (again, doesn't actually work) 2024-01-31 21:48:36 -06:00
mrq
cce929e136 nasty hotfix for transformer's Mixtral throwing an error when batch sizes > 1 2024-01-26 19:41:12 -06:00
mrq
e799665759 experimental weighting of prom/resp embeds 2024-01-25 12:18:48 -06:00
mrq
c690aa509d fixes and compat (MoE-fying an existing model and retraining from there just ruins it after a second of audio...) 2023-12-25 21:20:32 -06:00
mrq
0db3203b21 added LLaMA/Mixtral (if experts>1) model arches, utilize XMoE's loss as well, set MoE frequency to 1 to make every layer MoE'd for RetNet, etc. (going to do tests without burning out again to see how things go) 2023-12-22 19:27:36 -06:00
mrq
9c198eb75a added torchscale XMOE integration (because Mixtral 8x7B seems very promising and I want to see if it works) 2023-12-20 18:45:58 -06:00
mrq
9a6040383e make validation samplers ignore sampler type 2023-10-22 09:01:47 -05:00
mrq
a539f6889f mucked around with the loss calculation, this seems better? 2023-10-13 18:22:21 -05:00
mrq
65f500083d tweaks to try and get deepspeed quantized inferencing, validating bitsandbytes and deepspeed quantization, nothing seems to work 2023-10-12 22:21:43 -05:00
mrq
08bae355eb actually use langs from the dataloader 2023-10-11 21:21:50 -05:00
mrq
3af19d79fd oops 2023-10-11 20:49:54 -05:00
mrq
8740cdefc6 added initial support for languages (still testing, marked as model version 3), added experimental 'context extend by limiting the resp context' (untested) 2023-10-11 20:38:40 -05:00
mrq
7facacf7c9 separated samplers into its own file, don't bother copying the logits back to the GPU after sampling, it's not necessary 2023-10-11 12:25:31 -05:00
mrq
47b3077415 fixed mirostat issue 2023-10-10 18:09:49 -05:00