Commit Graph

95 Commits

Author SHA1 Message Date
mrq
d69a00e389 Properly pass retention_mask for retnet-HF, attempt to fix recurrent forward for retnet (doesn't work still) 2024-04-14 13:12:50 -05:00
mrq
f0c4baeb25 added Adagrad (experimenting with it), added 'extended' model size (16 layers instead of 12, experimenting with it) 2024-04-09 22:04:01 -05:00
mrq
4d75ee066c actually do the Linear replacement with TE's Linear 2024-04-09 14:41:13 -05:00
mrq
9d97eb5104 added FP8 support through NVIDIA/TransformerEngine, added RetNet_HF through syncdoth/RetNet (as an alternative to branch away from torchscale) 2024-04-08 20:14:51 -05:00
mrq
7075c2a5f0 added an option to allow injecting embeddings from another model, because it dawned upon me how valuable embeddings from a good model can be for subsequent trainings (defined under cfg.models._embeddings as a relative path to the yaml) 2024-04-04 19:11:49 -05:00
mrq
91062361af tweaks 2024-03-01 20:38:06 -06:00
mrq
f3c59c3e7e cleaner replacement code (because I realized BitNet had an implementation for it too), added calculating gradient norm and performing gradient clipping in local trainer (non-deepspeed) 2024-03-01 20:18:43 -06:00
mrq
47435207f7 Added cfg.bitsandbytes.replace as a less intrusive alternative to cfg.bitsandbytes.inject to replace all Linear modules in a model 2024-03-01 19:20:10 -06:00
mrq
35d78a2bb0 Yet Another Underlying Transformer Implementation (BitNet, will give it a few days to see how it fares) 2024-02-29 20:29:17 -06:00
mrq
3da1518ace added Mistral (non-Mixtral) backend, useless optimization when not training, proper adjustment of the LR for Prodigyopt through d_coeff (maybe), recurrent sampling for LLaMA/Mistral/Mixtral backends (again, doesn't actually work) 2024-01-31 21:48:36 -06:00
mrq
e799665759 experimental weighting of prom/resp embeds 2024-01-25 12:18:48 -06:00
mrq
c690aa509d fixes and compat (MoE-fying an existing model and retraining from there just ruins it after a second of audio...) 2023-12-25 21:20:32 -06:00
mrq
0db3203b21 added LLaMA/Mixtral (if experts>1) model arches, utilize XMoE's loss as well, set MoE frequency to 1 to make every layer MoE'd for RetNet, etc. (going to do tests without burning out again to see how things go) 2023-12-22 19:27:36 -06:00
mrq
9c198eb75a added torchscale XMOE integration (because Mixtral 8x7B seems very promising and I want to see if it works) 2023-12-20 18:45:58 -06:00
mrq
ed54f4ebec un 'experimental' the better target sequence preparation 2023-10-22 09:06:59 -05:00
mrq
09cda7d3f9 added sampling by speaker group name (might be better to de-emphasize the LibriVox/Audiobooks that are in large numbers, and emphasize the smaller pools), log cleanup 2023-10-16 19:30:38 -05:00
mrq
a539f6889f mucked around with the loss calculation, this seems better? 2023-10-13 18:22:21 -05:00
mrq
08bae355eb actually use langs from the dataloader 2023-10-11 21:21:50 -05:00
mrq
8740cdefc6 added initial support for languages (still testing, marked as model version 3), added experimental 'context extend by limiting the resp context' (untested) 2023-10-11 20:38:40 -05:00
mrq
7facacf7c9 separated samplers into its own file, don't bother copying the logits back to the GPU after sampling, it's not necessary 2023-10-11 12:25:31 -05:00
mrq
e727b6e5c1 changed dynamic temperature trigger to be a min-(n)ar-temp value between [0,(n)ar-temp), flags to set min temp, checkbox in web UI to request it 2023-10-10 17:02:33 -05:00
mrq
87db03dd93 trim the input prompt to 3 seconds when training NAR tasks (marked as experimental; the paper mentions doing so, but I don't know how much this would harm the retention heads) 2023-10-09 22:03:58 -05:00
mrq
27483e56f0 disabled preparing of SpeechX tasks, added dynamic temperature testing (to-do: test it, credited in the function) 2023-10-09 13:01:40 -05:00
mrq
777ba43305 oops 2023-10-03 15:01:37 -05:00
mrq
d12877ee09 added option to set probability of selecting the AR during training under a monolithic AR+NAR, added some more to-dos while I have them in mind 2023-10-02 16:52:42 -05:00
mrq
c0b25541e3 restructured some things with the model to remove dead weights 2023-09-20 19:10:59 -05:00
mrq
a6bfe43590 added mirostat sampling (given a partially trained model, it got far decent output than I expected, need to test on a better trained model) 2023-09-18 18:55:41 -05:00
mrq
4aef798135 added picking final candidate based on sum of score instead of first candidate (this changes nothing). 2023-09-13 13:19:11 -05:00
mrq
23a5fdd645 implemented a naive beam search (I really should be taking a break) 2023-09-12 21:28:07 -05:00
mrq
a6ae344e5b some comments 2023-09-12 16:04:45 -05:00
mrq
d07c63b9d8 unified more things with training the AR+NAR monolothic model 2023-09-12 15:54:41 -05:00
mrq
40ef34e1ca this embedding class definitely works, and migrating from the previous embedding weights seems to work. 2023-09-11 14:13:42 -05:00
mrq
a1f250ffac set default max_levels for NAR to 0 and implicitly set it to max resps levels because the previous way was implicitly assuming all models were outputting at 1+7 RVQ bins. 2023-09-10 20:33:33 -05:00
mrq
671dca88ee throw error when no reference audio is provided in the web UI because someone keeps doing that in the HF space 2023-09-10 15:50:50 -05:00
mrq
ba71020318 added option to limit (or exceed) inferenced RVQ-bin levels through the NAR 2023-09-10 13:50:13 -05:00
mrq
10c34c5b98 added a length-based decay factor for repetition penalty 2023-09-08 21:02:00 -05:00
mrq
14c78bae39 added lots of sampling options (top-k/top-p, repetition penalty, length penalty) 2023-09-08 20:30:54 -05:00
mrq
f69aad9c65 some day I'll get it right 2023-09-08 15:36:26 -05:00
mrq
b2907ae7e0 seems that my PromEmbedding/RespEmbedding doesn't actually work all that well, naively using dedicated MultiEmbeddings for AR/NAR in the monolithic model is the best way to go 2023-09-08 01:03:24 -05:00
mrq
ab5134f385 tweaks and fixes 2023-09-07 17:08:38 -05:00
mrq
b2c2dec291 added homebrewed per-RVQ-bin embedding solutions 2023-09-07 16:48:02 -05:00
mrq
e7a67410d1 oops 2023-09-07 09:14:03 -05:00
mrq
712808494f added support for optional prodigy optimizer (https://github.com/konstmish/prodigy) although it consumes a lot more VRAM per parameter 2023-09-06 20:33:16 -05:00
mrq
7ce06432fd fixed the AR+NAR dual model, the resp_emb has to be split up (classifier might too) 2023-09-06 19:33:39 -05:00
mrq
100ca6b7d0 added option to use SGD optimizer through the YAML, added option to pass in additional optimizer parameters through the YAML, added experimental unified AR+NAR model (does not seem fruitful in testing) 2023-09-06 18:58:35 -05:00