|
e0886c5a78
|
re-added mamba as a possible non-experimental arch backend (test trainer will set it as AR only, doing any NAR tasks lobotomizes it)
|
2024-06-04 22:41:22 -05:00 |
|
|
687c71e028
|
disable accuracy calc because it breaks with actual batched training even though it shouldn't
|
2024-06-04 22:13:44 -05:00 |
|
|
d005e24953
|
oops
|
2024-06-04 22:10:04 -05:00 |
|
|
0f7f3ae754
|
added loss calc split and acc for experimental model
|
2024-06-04 22:04:40 -05:00 |
|
|
014e565c4b
|
tweaks
|
2024-06-04 20:41:13 -05:00 |
|
|
6d5bd0156a
|
fixes
|
2024-06-04 18:50:48 -05:00 |
|
|
ed3aeaf3a1
|
copy pasted from test to actual trainer
|
2024-06-04 18:40:30 -05:00 |
|
|
0aa01ba31a
|
forgot one crucial detail (you *need* the previous RVQ level to keep coherence between all RVQ levels) (experimental deinterleaved is a bit crusty though)
|
2024-06-04 18:30:30 -05:00 |
|
|
2ffad5cb6f
|
typo
|
2024-06-04 14:20:57 -05:00 |
|
|
406ff7bbe1
|
re-implemented config.model.interleave for the HF-compat experimental method
|
2024-06-04 14:19:52 -05:00 |
|
|
c93d5863fd
|
fixes
|
2024-06-04 00:07:00 -05:00 |
|
|
934672252b
|
feverish cleanup
|
2024-06-03 21:28:49 -05:00 |
|
|
7feeb944a0
|
probably insane with even entertaining going this route
|
2024-06-03 20:26:27 -05:00 |
|
|
b482ca19ff
|
added model config option to set KV head count for MQA/GQA instead of MHA for llama-based models (i think its very negligible both ways on such a small model size)
|
2024-05-31 19:32:37 -05:00 |
|
|
e15c6c74c3
|
correctness
|
2024-05-30 20:50:45 -05:00 |
|
|
da473295b7
|
better way to compute per-segment losses
|
2024-05-28 19:29:54 -05:00 |
|
|
6c49ad06a3
|
forgot to reinclude mult by loss factors
|
2024-05-27 20:40:21 -05:00 |
|
|
b82f0d5c0c
|
finally nailed the issue that caused logging to break on one machine but not another (bitnet includes zetascale which is a parasite that will break logging)
|
2024-05-27 19:47:58 -05:00 |
|
|
c0ac84c795
|
uh
|
2024-05-27 19:05:56 -05:00 |
|
|
197d517181
|
ugh
|
2024-05-27 17:09:35 -05:00 |
|
|
5af6f41c94
|
added loss calcs against prom (requires the right settings for not shit results, disabled by default)
|
2024-05-27 08:43:00 -05:00 |
|
|
ddbacde0d1
|
DAC just doesn't work well enough......
|
2024-05-25 11:07:52 -05:00 |
|
|
e3ef89f5aa
|
100x better for subtrain/eval to be by group instead
|
2024-05-19 16:40:14 -05:00 |
|
|
458b95d196
|
added option to split between text loss and audio loss (to-do: document this better), because it may or may not be a problem with LLaMA-backed models because my loss hovers around 3.9 / 56% accuracy despite sounding decent at the moment
|
2024-05-19 11:23:56 -05:00 |
|
|
917eeb40d2
|
ughhh
|
2024-05-12 08:22:39 -05:00 |
|
|
9910c75d5a
|
checkpointing for bitnet impl
|
2024-05-12 07:52:54 -05:00 |
|
|
14709ac67f
|
ughh
|
2024-05-12 07:30:59 -05:00 |
|
|
a755eb3c62
|
ugh
|
2024-05-11 17:34:45 -05:00 |
|
|
88e9b9caff
|
local ddp fix
|
2024-05-11 17:29:01 -05:00 |
|
|
3337c69e5a
|
leverage between xformers and torch.backends.cuda.sdp_kernel for attention
|
2024-05-11 17:14:05 -05:00 |
|
|
d33c7bb7cf
|
ugh
|
2024-05-11 16:47:19 -05:00 |
|
|
0b6499601b
|
sanitizing
|
2024-05-11 16:31:05 -05:00 |
|
|
2109712e5b
|
resolve deprecation warning that doesn't show on my old training rig but does on my new one
|
2024-05-09 23:25:44 -05:00 |
|
|
1547de5020
|
haha...
|
2024-05-09 23:15:52 -05:00 |
|
|
0d5d545a40
|
crammed in DAdaptation (doesn't seem worth it) and ScheduleFree (forgot I wanted to weeks ago, seems promising), optimization wrapper cleanup, test trainer changes, etc.
|
2024-05-09 20:28:20 -05:00 |
|
|
33b7f81b94
|
small cleanups
|
2024-05-04 22:37:22 -05:00 |
|
|
253441b750
|
forgot to disable verbose flag
|
2024-05-04 13:13:52 -05:00 |
|
|
3dca1125f5
|
implemented xformers in HF's Llama (because theres no flash attention for Volta cards)
|
2024-05-04 13:07:45 -05:00 |
|
|
ffa200eec7
|
added option to specify frames per second for the given audio representation (Encodec is 75Hz, DAC is 41Hz (at 24K sources))
|
2024-05-04 12:05:41 -05:00 |
|
|
c494894261
|
simple DDP wrapper (for my NVlink test)
|
2024-05-04 11:48:26 -05:00 |
|
|
a7b43b98b5
|
renamed cfg.bitsandbytes to cfg.optimizations (and having it serve as cfg.optimizations.bitsandbytes)
|
2024-05-02 20:08:59 -05:00 |
|
|
b5d1456a09
|
backwards compat for my shitty old weights (was testing if disabling AudioEmbedding summing magically made things better (it did not))
|
2024-04-29 22:14:01 -05:00 |
|
|
5120ffdda7
|
god it would be nice to know the best way to handle audio embeddings, because I genuinely don't know without skimming through papers or devoting X amount of GPU hours in training
|
2024-04-29 18:24:05 -05:00 |
|
|
caad7ee3c9
|
final tweaks, hopefully
|
2024-04-28 22:28:29 -05:00 |
|
|
b251669536
|
forgot to fix up the test trainer
|
2024-04-21 14:58:04 -05:00 |
|
|
4f5c9e518a
|
actually use the passed-through sample rate from encode for DAC because it does its own resampling I guess
|
2024-04-18 13:32:41 -05:00 |
|
|
5ff2b4aab5
|
finally swallowing the Descript-Audio-Codec pill (I guess I'm going to have to regenerate my entire dataset)
|
2024-04-17 20:39:35 -05:00 |
|
|
b0bd88833c
|
refractor cleanup, had a revelation on how I can handle a batch of varying tasks
|
2024-04-16 21:04:48 -05:00 |
|
|
467fa1c5ee
|
wrapper fixes
|
2024-04-16 10:19:02 -05:00 |
|
|
aa1e25fbf5
|
backwards compat for old YAMLs with models , option to set flash attention 2 for Llama (and derivatives), included syncdoth/RetNet s torchscale retnet for shits and grins, etc.
|
2024-04-16 10:02:31 -05:00 |
|
|
545162195b
|
deprecate sole AR/NAR model by only keeping the AR+NAR (the beauty of no one using this is that I can break compat as much as I want), add tone token for when I classify my dataset with tone/emotion in the future, some other things
|
2024-04-15 19:54:32 -05:00 |
|
|
d69a00e389
|
Properly pass retention_mask for retnet-HF, attempt to fix recurrent forward for retnet (doesn't work still)
|
2024-04-14 13:12:50 -05:00 |
|
|
f0c4baeb25
|
added Adagrad (experimenting with it), added 'extended' model size (16 layers instead of 12, experimenting with it)
|
2024-04-09 22:04:01 -05:00 |
|
|
4d75ee066c
|
actually do the Linear replacement with TE's Linear
|
2024-04-09 14:41:13 -05:00 |
|
|
9d97eb5104
|
added FP8 support through NVIDIA/TransformerEngine , added RetNet_HF through syncdoth/RetNet (as an alternative to branch away from torchscale)
|
2024-04-08 20:14:51 -05:00 |
|
|
7075c2a5f0
|
added an option to allow injecting embeddings from another model, because it dawned upon me how valuable embeddings from a good model can be for subsequent trainings (defined under cfg.models._embeddings as a relative path to the yaml)
|
2024-04-04 19:11:49 -05:00 |
|
|
91062361af
|
tweaks
|
2024-03-01 20:38:06 -06:00 |
|
|
f3c59c3e7e
|
cleaner replacement code (because I realized BitNet had an implementation for it too), added calculating gradient norm and performing gradient clipping in local trainer (non-deepspeed)
|
2024-03-01 20:18:43 -06:00 |
|
|
47435207f7
|
Added cfg.bitsandbytes.replace as a less intrusive alternative to cfg.bitsandbytes.inject to replace all Linear modules in a model
|
2024-03-01 19:20:10 -06:00 |
|
|
35d78a2bb0
|
Yet Another Underlying Transformer Implementation (BitNet, will give it a few days to see how it fares)
|
2024-02-29 20:29:17 -06:00 |
|
|
3da1518ace
|
added Mistral (non-Mixtral) backend, useless optimization when not training, proper adjustment of the LR for Prodigyopt through d_coeff (maybe), recurrent sampling for LLaMA/Mistral/Mixtral backends (again, doesn't actually work)
|
2024-01-31 21:48:36 -06:00 |
|
|
cce929e136
|
nasty hotfix for transformer's Mixtral throwing an error when batch sizes > 1
|
2024-01-26 19:41:12 -06:00 |
|
|
e799665759
|
experimental weighting of prom/resp embeds
|
2024-01-25 12:18:48 -06:00 |
|
|
c690aa509d
|
fixes and compat (MoE-fying an existing model and retraining from there just ruins it after a second of audio...)
|
2023-12-25 21:20:32 -06:00 |
|
|
e513d2ef19
|
experts weren't forwarded into constructer (wasted a few days of training garbage)
|
2023-12-23 16:08:17 -06:00 |
|
|
0db3203b21
|
added LLaMA/Mixtral (if experts>1) model arches, utilize XMoE's loss as well, set MoE frequency to 1 to make every layer MoE'd for RetNet, etc. (going to do tests without burning out again to see how things go)
|
2023-12-22 19:27:36 -06:00 |
|
|
9c198eb75a
|
added torchscale XMOE integration (because Mixtral 8x7B seems very promising and I want to see if it works)
|
2023-12-20 18:45:58 -06:00 |
|
|
ed54f4ebec
|
un 'experimental' the better target sequence preparation
|
2023-10-22 09:06:59 -05:00 |
|
|
9a6040383e
|
make validation samplers ignore sampler type
|
2023-10-22 09:01:47 -05:00 |
|
|
09cda7d3f9
|
added sampling by speaker group name (might be better to de-emphasize the LibriVox/Audiobooks that are in large numbers, and emphasize the smaller pools), log cleanup
|
2023-10-16 19:30:38 -05:00 |
|
|
a539f6889f
|
mucked around with the loss calculation, this seems better?
|
2023-10-13 18:22:21 -05:00 |
|
|
65f500083d
|
tweaks to try and get deepspeed quantized inferencing, validating bitsandbytes and deepspeed quantization, nothing seems to work
|
2023-10-12 22:21:43 -05:00 |
|
|
08bae355eb
|
actually use langs from the dataloader
|
2023-10-11 21:21:50 -05:00 |
|
|
3af19d79fd
|
oops
|
2023-10-11 20:49:54 -05:00 |
|
|
8740cdefc6
|
added initial support for languages (still testing, marked as model version 3), added experimental 'context extend by limiting the resp context' (untested)
|
2023-10-11 20:38:40 -05:00 |
|
|
7facacf7c9
|
separated samplers into its own file, don't bother copying the logits back to the GPU after sampling, it's not necessary
|
2023-10-11 12:25:31 -05:00 |
|
|
47b3077415
|
fixed mirostat issue
|
2023-10-10 18:09:49 -05:00 |
|
|
99e980d323
|
documentation and more better-er attribution
|
2023-10-10 17:15:16 -05:00 |
|
|
e727b6e5c1
|
changed dynamic temperature trigger to be a min-(n)ar-temp value between [0,(n)ar-temp), flags to set min temp, checkbox in web UI to request it
|
2023-10-10 17:02:33 -05:00 |
|
|
ec25f56bd9
|
used torch.max fixes things, somehow, for dynamic temp sampling
|
2023-10-10 16:42:24 -05:00 |
|
|
87db03dd93
|
trim the input prompt to 3 seconds when training NAR tasks (marked as experimental; the paper mentions doing so, but I don't know how much this would harm the retention heads)
|
2023-10-09 22:03:58 -05:00 |
|
|
26fbb92ec6
|
reduced dynamic temperature threshold to > 1.0, as it seems to not quite be useful for audio LMs, sped up any sampling that touches logits by copying them to CPU first, as accessing tensors on the GPU is slow as balls)
|
2023-10-09 14:46:17 -05:00 |
|
|
27483e56f0
|
disabled preparing of SpeechX tasks, added dynamic temperature testing (to-do: test it, credited in the function)
|
2023-10-09 13:01:40 -05:00 |
|
|
2deb995cc9
|
updated setup script
|
2023-10-06 20:08:28 -05:00 |
|
|
63cc9cf37a
|
added compat flags for torchscale because the maintainer for torchscale broke compat for existing models
|
2023-10-05 16:39:46 -05:00 |
|
|
777ba43305
|
oops
|
2023-10-03 15:01:37 -05:00 |
|
|
d12877ee09
|
added option to set probability of selecting the AR during training under a monolithic AR+NAR, added some more to-dos while I have them in mind
|
2023-10-02 16:52:42 -05:00 |
|
|
c0b25541e3
|
restructured some things with the model to remove dead weights
|
2023-09-20 19:10:59 -05:00 |
|
|
a6bfe43590
|
added mirostat sampling (given a partially trained model, it got far decent output than I expected, need to test on a better trained model)
|
2023-09-18 18:55:41 -05:00 |
|
|
2567e082b5
|
UGH
|
2023-09-16 00:26:13 -05:00 |
|
|
22ffaf3a33
|
have loss for the NAR not-ignore the text prompt, I imagine this should help the NAR and explain why it's always had a bit of an issue with training
|
2023-09-15 19:08:44 -05:00 |
|
|
4aef798135
|
added picking final candidate based on sum of score instead of first candidate (this changes nothing).
|
2023-09-13 13:19:11 -05:00 |
|
|
23a5fdd645
|
implemented a naive beam search (I really should be taking a break)
|
2023-09-12 21:28:07 -05:00 |
|
|
a6ae344e5b
|
some comments
|
2023-09-12 16:04:45 -05:00 |
|
|
d07c63b9d8
|
unified more things with training the AR+NAR monolothic model
|
2023-09-12 15:54:41 -05:00 |
|
|
40ef34e1ca
|
this embedding class definitely works, and migrating from the previous embedding weights seems to work.
|
2023-09-11 14:13:42 -05:00 |
|
|
a1f250ffac
|
set default max_levels for NAR to 0 and implicitly set it to max resps levels because the previous way was implicitly assuming all models were outputting at 1+7 RVQ bins.
|
2023-09-10 20:33:33 -05:00 |
|
|
671dca88ee
|
throw error when no reference audio is provided in the web UI because someone keeps doing that in the HF space
|
2023-09-10 15:50:50 -05:00 |
|
|
ba71020318
|
added option to limit (or exceed) inferenced RVQ-bin levels through the NAR
|
2023-09-10 13:50:13 -05:00 |
|
|
10c34c5b98
|
added a length-based decay factor for repetition penalty
|
2023-09-08 21:02:00 -05:00 |
|