Commit Graph

11 Commits

Author SHA1 Message Date
mrq
c6a38693a2 This better work 2024-11-09 18:04:59 -06:00
mrq
a22534e8f4 layer skip training implemented (need to gut the inferencing from the repo, and to actually see if the model can benefit from this) 2024-10-30 20:05:45 -05:00
mrq
0d706ec6a1 added fused_attn (triton-based fused attention) and simply just query for flash_attn under rocm 2024-08-26 19:13:34 -05:00
mrq
6b0891448c pain (some shit to try and get some flash attention for ROCm (gfx1100) through triton fused attention but no good) 2024-08-25 20:07:27 -05:00
mrq
debcc93e7e add adapted MixtralAttention for when I make a bad decision to actually train a MoE 2024-08-04 22:03:22 -05:00
mrq
10aaf840e7 added export option to convert Llama to MixtralMoE for another dumb experiment 2024-08-04 20:25:06 -05:00
mrq
11fa3da665 some cleanup, fixed the wrapper attention to explicitly use other sdpa backends 2024-08-03 19:51:00 -05:00
mrq
ccb14c06ef mamba2-hf using vasqu/mamba2-torch because it lets me use mamba2 without triton ops (training with my 4xV100s are not happy with mamba2 because of triton) 2024-06-14 19:42:17 -05:00
mrq
83eab4fa59 actually going for the suggested "2x layers, no intermediate scaling" is wrong for VALL-E, directly copying the normal transformer structure fixes mamba2 performance in the test trainer 2024-06-13 20:08:22 -05:00
mrq
65a8960305 option to split classifier per-level instead of sharing one (at this point I'm just scrambling to try and cope with training a DAC model, the NAR is being a pain) 2024-06-11 22:28:59 -05:00
mrq
ff6fe6f1bc cleanup 2024-06-05 20:30:43 -05:00