|
b2194b859a
|
re-added loading multiple models because I'm now entertaining having split AR/NAR models again (and need a way to load both at once)
|
2024-06-06 09:48:43 -05:00 |
|
|
b05a905b95
|
ugh
|
2024-06-05 21:02:05 -05:00 |
|
|
4073656293
|
oops
|
2024-06-05 20:53:10 -05:00 |
|
|
ff6fe6f1bc
|
cleanup
|
2024-06-05 20:30:43 -05:00 |
|
|
880b4ecd1b
|
cleanup, putting some thoughts in comments before I forget about them
|
2024-06-05 19:50:06 -05:00 |
|
|
3cfc8a96bb
|
oops
|
2024-06-05 10:30:04 -05:00 |
|
|
48cd1054f9
|
madness
|
2024-06-04 23:48:51 -05:00 |
|
|
9e3f2e300f
|
experimental "just have a token for what rvq level we're on" that seems to help all models (mamba almost works, but it might just have to be relegated as a pure AR model)
|
2024-06-04 23:23:31 -05:00 |
|
|
e0886c5a78
|
re-added mamba as a possible non-experimental arch backend (test trainer will set it as AR only, doing any NAR tasks lobotomizes it)
|
2024-06-04 22:41:22 -05:00 |
|
|
934672252b
|
feverish cleanup
|
2024-06-03 21:28:49 -05:00 |
|
|
7feeb944a0
|
probably insane with even entertaining going this route
|
2024-06-03 20:26:27 -05:00 |
|
|
b482ca19ff
|
added model config option to set KV head count for MQA/GQA instead of MHA for llama-based models (i think its very negligible both ways on such a small model size)
|
2024-05-31 19:32:37 -05:00 |
|
|
e15c6c74c3
|
correctness
|
2024-05-30 20:50:45 -05:00 |
|
|
da473295b7
|
better way to compute per-segment losses
|
2024-05-28 19:29:54 -05:00 |
|
|
6c49ad06a3
|
forgot to reinclude mult by loss factors
|
2024-05-27 20:40:21 -05:00 |
|
|
b82f0d5c0c
|
finally nailed the issue that caused logging to break on one machine but not another (bitnet includes zetascale which is a parasite that will break logging)
|
2024-05-27 19:47:58 -05:00 |
|
|
c0ac84c795
|
uh
|
2024-05-27 19:05:56 -05:00 |
|
|
197d517181
|
ugh
|
2024-05-27 17:09:35 -05:00 |
|
|
5af6f41c94
|
added loss calcs against prom (requires the right settings for not shit results, disabled by default)
|
2024-05-27 08:43:00 -05:00 |
|
|
458b95d196
|
added option to split between text loss and audio loss (to-do: document this better), because it may or may not be a problem with LLaMA-backed models because my loss hovers around 3.9 / 56% accuracy despite sounding decent at the moment
|
2024-05-19 11:23:56 -05:00 |
|
|
917eeb40d2
|
ughhh
|
2024-05-12 08:22:39 -05:00 |
|
|
9910c75d5a
|
checkpointing for bitnet impl
|
2024-05-12 07:52:54 -05:00 |
|
|
14709ac67f
|
ughh
|
2024-05-12 07:30:59 -05:00 |
|
|
a755eb3c62
|
ugh
|
2024-05-11 17:34:45 -05:00 |
|
|
88e9b9caff
|
local ddp fix
|
2024-05-11 17:29:01 -05:00 |
|
|
3337c69e5a
|
leverage between xformers and torch.backends.cuda.sdp_kernel for attention
|
2024-05-11 17:14:05 -05:00 |
|
|
d33c7bb7cf
|
ugh
|
2024-05-11 16:47:19 -05:00 |
|
|
2109712e5b
|
resolve deprecation warning that doesn't show on my old training rig but does on my new one
|
2024-05-09 23:25:44 -05:00 |
|
|
1547de5020
|
haha...
|
2024-05-09 23:15:52 -05:00 |
|
|
0d5d545a40
|
crammed in DAdaptation (doesn't seem worth it) and ScheduleFree (forgot I wanted to weeks ago, seems promising), optimization wrapper cleanup, test trainer changes, etc.
|
2024-05-09 20:28:20 -05:00 |
|
|
33b7f81b94
|
small cleanups
|
2024-05-04 22:37:22 -05:00 |
|
|
253441b750
|
forgot to disable verbose flag
|
2024-05-04 13:13:52 -05:00 |
|
|
3dca1125f5
|
implemented xformers in HF's Llama (because theres no flash attention for Volta cards)
|
2024-05-04 13:07:45 -05:00 |
|
|
ffa200eec7
|
added option to specify frames per second for the given audio representation (Encodec is 75Hz, DAC is 41Hz (at 24K sources))
|
2024-05-04 12:05:41 -05:00 |
|
|
b5d1456a09
|
backwards compat for my shitty old weights (was testing if disabling AudioEmbedding summing magically made things better (it did not))
|
2024-04-29 22:14:01 -05:00 |
|
|
5120ffdda7
|
god it would be nice to know the best way to handle audio embeddings, because I genuinely don't know without skimming through papers or devoting X amount of GPU hours in training
|
2024-04-29 18:24:05 -05:00 |
|
|
b0bd88833c
|
refractor cleanup, had a revelation on how I can handle a batch of varying tasks
|
2024-04-16 21:04:48 -05:00 |
|
|
467fa1c5ee
|
wrapper fixes
|
2024-04-16 10:19:02 -05:00 |
|
|
aa1e25fbf5
|
backwards compat for old YAMLs with models , option to set flash attention 2 for Llama (and derivatives), included syncdoth/RetNet s torchscale retnet for shits and grins, etc.
|
2024-04-16 10:02:31 -05:00 |
|
|
545162195b
|
deprecate sole AR/NAR model by only keeping the AR+NAR (the beauty of no one using this is that I can break compat as much as I want), add tone token for when I classify my dataset with tone/emotion in the future, some other things
|
2024-04-15 19:54:32 -05:00 |
|
|
d69a00e389
|
Properly pass retention_mask for retnet-HF, attempt to fix recurrent forward for retnet (doesn't work still)
|
2024-04-14 13:12:50 -05:00 |
|
|
9d97eb5104
|
added FP8 support through NVIDIA/TransformerEngine , added RetNet_HF through syncdoth/RetNet (as an alternative to branch away from torchscale)
|
2024-04-08 20:14:51 -05:00 |
|
|
7075c2a5f0
|
added an option to allow injecting embeddings from another model, because it dawned upon me how valuable embeddings from a good model can be for subsequent trainings (defined under cfg.models._embeddings as a relative path to the yaml)
|
2024-04-04 19:11:49 -05:00 |
|
|
35d78a2bb0
|
Yet Another Underlying Transformer Implementation (BitNet, will give it a few days to see how it fares)
|
2024-02-29 20:29:17 -06:00 |
|
|
3da1518ace
|
added Mistral (non-Mixtral) backend, useless optimization when not training, proper adjustment of the LR for Prodigyopt through d_coeff (maybe), recurrent sampling for LLaMA/Mistral/Mixtral backends (again, doesn't actually work)
|
2024-01-31 21:48:36 -06:00 |
|
|
cce929e136
|
nasty hotfix for transformer's Mixtral throwing an error when batch sizes > 1
|
2024-01-26 19:41:12 -06:00 |
|
|
e799665759
|
experimental weighting of prom/resp embeds
|
2024-01-25 12:18:48 -06:00 |
|
|
c690aa509d
|
fixes and compat (MoE-fying an existing model and retraining from there just ruins it after a second of audio...)
|
2023-12-25 21:20:32 -06:00 |
|
|
0db3203b21
|
added LLaMA/Mixtral (if experts>1) model arches, utilize XMoE's loss as well, set MoE frequency to 1 to make every layer MoE'd for RetNet, etc. (going to do tests without burning out again to see how things go)
|
2023-12-22 19:27:36 -06:00 |
|
|
9c198eb75a
|
added torchscale XMOE integration (because Mixtral 8x7B seems very promising and I want to see if it works)
|
2023-12-20 18:45:58 -06:00 |
|