|
dcaf38b359
|
fixed training tqdm being stubborn
|
2024-11-23 09:45:23 -06:00 |
|
|
41d7c30ea5
|
added much cleaner non-causal mask generation
|
2024-11-22 19:43:32 -06:00 |
|
|
c99a74e834
|
actually generate a causal mask because it seems sometimes it does not actually generate one because it makes assumptions
|
2024-11-22 18:30:24 -06:00 |
|
|
ccee5fc11c
|
that was actually all pointless since sdpa always had an attention mask fed to it and does not need is_causal to implicitly generate one
|
2024-11-22 16:51:50 -06:00 |
|
|
4aa685e749
|
what has science done
|
2024-11-22 16:45:40 -06:00 |
|
|
147219a5e0
|
huge oversight in the attention masking......... (i realized I have not been providing a non-causal mask to non-causal tasks)
|
2024-11-22 13:44:43 -06:00 |
|
|
24d888c47c
|
temporarily dropping support for xformers because it's breaking when using an attention mask (which i dont remember commenting it out when being passed), default to not use wandb because it's being a pain when doing tests and not actual sessionsS)
|
2024-11-22 11:29:12 -06:00 |
|
|
8aafae91fd
|
dont use timeembedding
|
2024-11-21 23:14:52 -06:00 |
|
|
2cef97e43f
|
cleanup
|
2024-11-21 23:08:43 -06:00 |
|
|
3fc0540f49
|
m
|
2024-11-21 15:07:46 -06:00 |
|
|
6845c447c9
|
added more harvard sentences to load from a text file
|
2024-11-21 13:18:11 -06:00 |
|
|
2a084544e8
|
moved duration padding for NAR-len to be a scalar instead (since it seems longer utterances need it much more so than shorter utterances)
|
2024-11-21 13:04:07 -06:00 |
|
|
6aee08f9c0
|
moved stuff in the web UI around (un-experimented the max NAR-len steps because its kind of important to adjust this value for better sounding audio / quicker generated audio)
|
2024-11-20 20:37:33 -06:00 |
|
|
dfdba3f190
|
oops
|
2024-11-20 19:21:03 -06:00 |
|
|
cd6e9ba2f2
|
oops
|
2024-11-20 16:27:51 -06:00 |
|
|
1a73ac6a20
|
I cannot believe it's not actually called Wand DB (added wandb logging support since I think it would have been a much better way to look at my metrics)
|
2024-11-20 16:10:47 -06:00 |
|
|
67f7bad168
|
added mixed modality AR+NAR-len to generate a short prefix through the AR, then inference with said prefix through the NAR-len (need to experiment with it more to ensure that the masked off tokens are the only tokens getting updated)
|
2024-11-20 14:22:12 -06:00 |
|
|
db64e6cb59
|
dependency updates (gradio 5.x now works on my machine)
|
2024-11-20 12:33:01 -06:00 |
|
|
efeb55e1b7
|
documentation update
|
2024-11-19 19:19:34 -06:00 |
|
|
b1369e7824
|
better modality selection (pick AR+NAR by default for the ar+nar model, pick NAR-len by default for the nar-len model), lowered default CFG because it makes the AR+NAR output sped up (but can't be too low since it's required for the NAR-len)
|
2024-11-19 18:51:17 -06:00 |
|
|
190a917b3e
|
I did it.
|
2024-11-19 12:24:33 -06:00 |
|
|
0e621354e7
|
cleaned up classifier-free guidance logit processing (in order to try and cope with a bad nar-len model)
|
2024-11-19 10:30:05 -06:00 |
|
|
5ba80686e1
|
two weeks of agony concludes
|
2024-11-18 21:29:28 -06:00 |
|
|
2b29790173
|
oops
|
2024-11-18 14:12:26 -06:00 |
|
|
4a71981456
|
normalize sampler index by batch size (if not using batched sampler), add option to cap out utterances for a speaker, some other things
|
2024-11-18 12:46:50 -06:00 |
|
|
6cfdf94bf9
|
swap priority to use nar-len if available, added notes
|
2024-11-18 09:40:04 -06:00 |
|
|
069b27570f
|
set option to set training masking ratio (I don't think for tts a fixed masking ratio is beneficial since the magic of the AR+NAR is being able to still reference the prior sequence of tokens for predicting things)
|
2024-11-17 17:04:07 -06:00 |
|
|
88d840218d
|
default set cfg strength to 3.0 since the reference model is updated
|
2024-11-17 10:23:40 -06:00 |
|
|
a3e1fa3518
|
ugh
|
2024-11-17 09:28:33 -06:00 |
|
|
23fdba0c98
|
tweaks and changes
|
2024-11-16 15:49:06 -06:00 |
|
|
2fbeacfe92
|
ugh
|
2024-11-14 22:18:33 -06:00 |
|
|
39096f8ff3
|
redid loss calculation to be cleaner, and position ID generation, and other things (I might need to train the NAR-len from scratch and not resume from an existing checkpoint.........)
|
2024-11-14 22:17:47 -06:00 |
|
|
ef05c951ff
|
adjust fp16 loss scaling since I fried a model overnight when it hit 8K scale
|
2024-11-14 09:23:52 -06:00 |
|
|
e412e98125
|
ugh
|
2024-11-14 07:34:22 -06:00 |
|
|
c00fc18b62
|
actually use the right embedding for nar-len
|
2024-11-13 18:04:04 -06:00 |
|
|
3ea8a610d6
|
fix STT
|
2024-11-13 14:27:15 -06:00 |
|
|
910033343c
|
overhauled how the right resp level / classifier gets picked to avoid cringemath
|
2024-11-13 13:31:17 -06:00 |
|
|
269648605e
|
move NAR-len rvq level 0 to separate embedding
|
2024-11-13 11:38:58 -06:00 |
|
|
29e45be0b4
|
tweaks to bucket sampling
|
2024-11-13 11:09:24 -06:00 |
|
|
b2eca271a8
|
ugh
|
2024-11-13 10:35:44 -06:00 |
|
|
be83ddabaa
|
better causal-ness for split loss calc, and also do masking for NAR-len for it
|
2024-11-13 10:17:52 -06:00 |
|
|
6b76419123
|
ugh
|
2024-11-13 09:54:20 -06:00 |
|
|
ad7cfffc00
|
NAR-len RVQ-0 was being trained causally.............
|
2024-11-13 09:43:50 -06:00 |
|
|
976ee87f6f
|
resume iteration step in tqdm trainer, warn to logger if the sampler state dict was invalidated
|
2024-11-13 09:09:28 -06:00 |
|
|
8286aa54c8
|
do not pass timestep token/embedding since it doesn't seem to matter at all after all, fixed training masking rate to 80% because a paper said so
|
2024-11-13 09:07:10 -06:00 |
|
|
caf721c67b
|
set it to zero because it'll make the stop token hide more often than not
|
2024-11-12 22:30:50 -06:00 |
|
|
0f2584eba7
|
new meme sampler PogChamp new meme sampler PogChamp (it sort of helps?)
|
2024-11-12 22:30:09 -06:00 |
|
|
663f07038d
|
haha... (do not create a token dropout/noise mask when not training (this sadly didnt fix NAR-len output))
|
2024-11-12 16:41:58 -06:00 |
|
|
b09328069e
|
actually do CFG sampling for base AR+NAR tasks
|
2024-11-12 13:42:39 -06:00 |
|
|
2495a7ef67
|
Fixed STT in the web UI
|
2024-11-12 12:49:53 -06:00 |
|