|
5d80a2d0d4
|
fixed NAR-len issues with non-english maybe (langs weren't being passed), added interface to inference in batches through tts.batched_inference (no support for rolling context/prefixes because there's no way to do that), demo page uses batched inferencing now
|
2024-12-07 19:21:05 -06:00 |
|
|
1f54bf5b40
|
revert sageattn back to optional dependency because it's not on windows, force resize_modules on by default because I broke something
|
2024-12-07 17:09:39 -06:00 |
|
|
218d0e29fd
|
ugh (batchmean actually expects batch=seq_len, and not the actual batch)
|
2024-12-07 12:39:01 -06:00 |
|
|
61ed662856
|
ACTUALLY actually fix KD-loss (the -inf in the logits was caused by cringecode)
|
2024-12-07 12:31:54 -06:00 |
|
|
f97e8b0c7f
|
ACTUALLY do KD-loss because of an oversight with masked_select outputting 1D tensors that get softmax'd in total
|
2024-12-07 09:52:51 -06:00 |
|
|
34a66e1052
|
agnostified KD
|
2024-12-06 23:53:46 -06:00 |
|
|
953d3eb030
|
ugh
|
2024-12-06 22:35:30 -06:00 |
|
|
42fafbaaca
|
actually fixed knowledge distillation because of errant -inf logits causing problems and needed to be filtered (and splitting text language / output audio language because it helps)
|
2024-12-06 21:55:20 -06:00 |
|
|
23d402bf01
|
added knowledge distillation in the trainer (sadly it is not agnostic because of the grave mistake of further processing the batch within the forward pass, so subsequent calls do not match......)
|
2024-12-05 23:05:52 -06:00 |
|
|
4e21df8092
|
oops
|
2024-12-04 21:24:22 -06:00 |
|
|
c66a53492c
|
forgot to add NTLK as a dependency, promoted sageattn as a default dependency since it works fine enough and seems agnostic
|
2024-12-04 20:33:25 -06:00 |
|
|
93d27be539
|
rolling context finally (use last N utterances as the prefix for the next gen), option to split input text prompt by sentences instead of lines (or no splitting)
|
2024-12-04 20:31:44 -06:00 |
|
|
9dff68c0c5
|
NAR-len tweaks (remasks a small amount of tokens per step, it seems to help with reducing the number of steps needed some of the time?, disable CFG for the first half to speed things up)
|
2024-12-04 09:30:29 -06:00 |
|
|
cf97560e70
|
minimum CFG of 3 for NAR-len because it seems the model will auto-default to NAR-len now
|
2024-12-03 19:40:05 -06:00 |
|
|
ca31da0a95
|
sageattn (forgot to bother with testing this the other day, seems ifne)
|
2024-12-03 15:14:57 -06:00 |
|
|
31ab90d84a
|
cringe code to convert to LlamaForCausalLM-happy weights + tokenizer dict (still need to write logic to actually use these weights for proper inferencing)
|
2024-12-03 10:18:58 -06:00 |
|
|
84a05acb6d
|
touch ups in docs
|
2024-12-02 19:10:42 -06:00 |
|
|
dcaf38b359
|
fixed training tqdm being stubborn
|
2024-11-23 09:45:23 -06:00 |
|
|
41d7c30ea5
|
added much cleaner non-causal mask generation
|
2024-11-22 19:43:32 -06:00 |
|
|
c99a74e834
|
actually generate a causal mask because it seems sometimes it does not actually generate one because it makes assumptions
|
2024-11-22 18:30:24 -06:00 |
|
|
ccee5fc11c
|
that was actually all pointless since sdpa always had an attention mask fed to it and does not need is_causal to implicitly generate one
|
2024-11-22 16:51:50 -06:00 |
|
|
4aa685e749
|
what has science done
|
2024-11-22 16:45:40 -06:00 |
|
|
147219a5e0
|
huge oversight in the attention masking......... (i realized I have not been providing a non-causal mask to non-causal tasks)
|
2024-11-22 13:44:43 -06:00 |
|
|
24d888c47c
|
temporarily dropping support for xformers because it's breaking when using an attention mask (which i dont remember commenting it out when being passed), default to not use wandb because it's being a pain when doing tests and not actual sessionsS)
|
2024-11-22 11:29:12 -06:00 |
|
|
8aafae91fd
|
dont use timeembedding
|
2024-11-21 23:14:52 -06:00 |
|
|
2cef97e43f
|
cleanup
|
2024-11-21 23:08:43 -06:00 |
|
|
3fc0540f49
|
m
|
2024-11-21 15:07:46 -06:00 |
|
|
6845c447c9
|
added more harvard sentences to load from a text file
|
2024-11-21 13:18:11 -06:00 |
|
|
2a084544e8
|
moved duration padding for NAR-len to be a scalar instead (since it seems longer utterances need it much more so than shorter utterances)
|
2024-11-21 13:04:07 -06:00 |
|
|
6aee08f9c0
|
moved stuff in the web UI around (un-experimented the max NAR-len steps because its kind of important to adjust this value for better sounding audio / quicker generated audio)
|
2024-11-20 20:37:33 -06:00 |
|
|
dfdba3f190
|
oops
|
2024-11-20 19:21:03 -06:00 |
|
|
cd6e9ba2f2
|
oops
|
2024-11-20 16:27:51 -06:00 |
|
|
1a73ac6a20
|
I cannot believe it's not actually called Wand DB (added wandb logging support since I think it would have been a much better way to look at my metrics)
|
2024-11-20 16:10:47 -06:00 |
|
|
67f7bad168
|
added mixed modality AR+NAR-len to generate a short prefix through the AR, then inference with said prefix through the NAR-len (need to experiment with it more to ensure that the masked off tokens are the only tokens getting updated)
|
2024-11-20 14:22:12 -06:00 |
|
|
db64e6cb59
|
dependency updates (gradio 5.x now works on my machine)
|
2024-11-20 12:33:01 -06:00 |
|
|
efeb55e1b7
|
documentation update
|
2024-11-19 19:19:34 -06:00 |
|
|
b1369e7824
|
better modality selection (pick AR+NAR by default for the ar+nar model, pick NAR-len by default for the nar-len model), lowered default CFG because it makes the AR+NAR output sped up (but can't be too low since it's required for the NAR-len)
|
2024-11-19 18:51:17 -06:00 |
|
|
190a917b3e
|
I did it.
|
2024-11-19 12:24:33 -06:00 |
|
|
0e621354e7
|
cleaned up classifier-free guidance logit processing (in order to try and cope with a bad nar-len model)
|
2024-11-19 10:30:05 -06:00 |
|
|
5ba80686e1
|
two weeks of agony concludes
|
2024-11-18 21:29:28 -06:00 |
|
|
2b29790173
|
oops
|
2024-11-18 14:12:26 -06:00 |
|
|
4a71981456
|
normalize sampler index by batch size (if not using batched sampler), add option to cap out utterances for a speaker, some other things
|
2024-11-18 12:46:50 -06:00 |
|
|
6cfdf94bf9
|
swap priority to use nar-len if available, added notes
|
2024-11-18 09:40:04 -06:00 |
|
|
069b27570f
|
set option to set training masking ratio (I don't think for tts a fixed masking ratio is beneficial since the magic of the AR+NAR is being able to still reference the prior sequence of tokens for predicting things)
|
2024-11-17 17:04:07 -06:00 |
|
|
88d840218d
|
default set cfg strength to 3.0 since the reference model is updated
|
2024-11-17 10:23:40 -06:00 |
|
|
a3e1fa3518
|
ugh
|
2024-11-17 09:28:33 -06:00 |
|
|
23fdba0c98
|
tweaks and changes
|
2024-11-16 15:49:06 -06:00 |
|
|
2fbeacfe92
|
ugh
|
2024-11-14 22:18:33 -06:00 |
|
|
39096f8ff3
|
redid loss calculation to be cleaner, and position ID generation, and other things (I might need to train the NAR-len from scratch and not resume from an existing checkpoint.........)
|
2024-11-14 22:17:47 -06:00 |
|
|
ef05c951ff
|
adjust fp16 loss scaling since I fried a model overnight when it hit 8K scale
|
2024-11-14 09:23:52 -06:00 |
|