Commit Graph

68 Commits

Author SHA1 Message Date
mrq
5fe01ffc6c more notes / re-enabled top-k/p samplers for new implementation 2025-04-19 14:04:34 -05:00
mrq
d9e18037cc new implementation tweaks and fixes to make it actually better (there were a lot of badwrong things being done that harmed the output quality, will evaluate the model further) 2025-04-18 20:36:44 -05:00
mrq
98d1d8cb1e added some more notes, tweaks (RIP DAC, it's over) 2025-04-17 20:24:40 -05:00
mrq
6d42c9ae23 how foolish of me, not having a softmax as float32 (maybe addresses an emergent regression where bfloat16 training shits the bed where float16+loss scaling doesnt) 2025-04-07 22:51:52 -05:00
mrq
d6cd848c32 goodbye nvidia/audio-codec-44khz, crossed fingers for DAC again 2025-04-06 21:05:29 -05:00
mrq
2e93438867 reintroduced sampler_type = speaker because I think this might salvage the nemo model to have better speaker similarities 2025-04-03 19:01:10 -05:00
mrq
0e995dbf2c is this my last cope (falling back to explicit duration prediction, as this regression just won't go away) (also the smaller model was lobotomized because of my ROCm setup having a botched SDPA for who knows why) 2025-04-02 17:01:24 -05:00
mrq
6ae282e090 re-added noise dataloader sampler whatever for the old implementation's other tasks that require it 2025-03-28 15:07:06 -05:00
mrq
90b3509404 I'll just cope and say I cannot apply segmented attention masks to the smaller model as it's too trained on not doing it, and the regression came from dumb python aliasing rules 2025-03-27 13:27:51 -05:00
mrq
2fd82a7a22 cannot get segmented mask to actually work without gradients exploding (need to find a different way to do duration prediction...) 2025-03-27 00:51:41 -05:00
mrq
4d777b5618 add remark that segmented attention actually might be broken (for some reason this only emerged recently, need to investigate) 2025-03-26 12:08:47 -05:00
mrq
8641c87611 nothing could go wrong part 2 (reverted and rewrote commits since there was a nasty regression) 2025-03-25 23:06:16 -05:00
mrq
aa8b32d97e added more notes (although I could have sworn I have had more notes that i can't recall) 2025-03-25 18:53:06 -05:00
mrq
df5b870908 added remark about not using sliding attention 2025-03-22 12:44:34 -05:00
mrq
9a7458cf17 fixed inferencing since I did delete the len_emb, some more notes on the model since it seems I just had bad experimental settings 2025-03-19 22:41:48 -05:00
mrq
81acd565b3 re-enable these 2025-03-18 20:59:33 -05:00
mrq
b0dba9db07 this may bite me in the ass 2025-03-17 21:46:50 -05:00
mrq
2dfef693c4 comments for clarity 2025-03-16 11:30:23 -05:00
mrq
9cfbf94b1c config-ify the len_loss_factor 2025-03-14 20:30:48 -05:00
mrq
ba5f3d19b4 use the FSQ-targeted encoder/decodede whole-ly as it works for EnCodec too, as the RVQ-targeted encoder/decoder doesnt (and some notes) 2025-03-12 22:47:19 -05:00
mrq
5c512717a6 len prediction for new model (and remove logit normalization since it kills inferencing) 2025-03-11 20:33:09 -05:00
mrq
5cd71ef238 QoL so I can stop having to manually inject different configs 2025-03-06 14:48:14 -06:00
mrq
2fb2b732fc wow that was fast 2025-03-04 23:17:18 -06:00
mrq
0451f75e33 now that the new model seems a little more promising, i can re-document things non-cynically 2025-03-03 13:21:41 -06:00
mrq
3f1070f575 tweaks 2025-03-02 22:36:25 -06:00
mrq
4afa4ccce5 at wits end (parhaps the semantic token approach is the toughest pill to swallow) 2025-03-01 21:03:25 -06:00
mrq
a174c33db6 a gorillionth time's the charm (aka: the encoder/decoder pill is a tough pill to swallow) 2025-02-28 17:56:50 -06:00
mrq
eff180248c decoupled llama backend to avoid any funny changes from transformers, removed other backends since i dont think i'll ever bother using them 2025-02-27 19:00:37 -06:00
mrq
95da4e9405 made muon actually work by actually utilizing param groups (thanks APOLLO for reminding me this is the sane way to handle this split) 2025-02-26 10:39:13 -06:00
mrq
92139b6da9 additional cruft, added a note in documentation to be aware of NUMA node topology when running vall_e.emb.process with more than one process 2025-02-18 19:56:30 -06:00
mrq
0dc49ef4d5 documentation update while I wait for more audio (between 4 and 8 seconds per utterance) quantize for nvidia/audio-codec-44khz (I was foolish to think I can get something servicable with just 4 seconds max for an utterance) 2025-02-15 17:42:06 -06:00
mrq
04fef5dad5 agony 2025-02-12 00:18:24 -06:00
mrq
1c0ed6abac added notes on this unfruitful experiment 2025-02-11 16:21:43 -06:00
mrq
9fa87c417a added option to use raw text rather than the IPA phonemes (it requires a model trained on raw text) 2025-01-06 00:10:43 -06:00
mrq
9b0d2ccbe1 2024-12-26 21:42:17 -06:00
mrq
59bf6b8b33 exposed additional task (ns, sr, vc) (vc is experimental) 2024-12-20 11:15:29 -06:00
mrq
8515038968 imagine my disappointment when the epoch finished just for it to throw an exception 2024-12-16 18:28:01 -06:00
mrq
f41251f648 more fixes for local engine backend 2024-12-12 14:38:42 -06:00
mrq
8568a93dad added WER/SIM-O metrics, added APOLLO but I need to test it 2024-12-10 20:13:21 -06:00
mrq
a6c745bafb chinese (mandarin?) support added (I guess I don't need pinyin, but tone markers are handled), korean validated, vocab adjusted 2024-12-09 14:26:19 -06:00
mrq
a032ff588f doc update, added automatically deducing language from a given text, also checks if the input is already phonemized text to allow direct control without being cringe (procrastinating adding WER/SIM-O) 2024-12-07 22:34:25 -06:00
mrq
93d27be539 rolling context finally (use last N utterances as the prefix for the next gen), option to split input text prompt by sentences instead of lines (or no splitting) 2024-12-04 20:31:44 -06:00
mrq
9dff68c0c5 NAR-len tweaks (remasks a small amount of tokens per step, it seems to help with reducing the number of steps needed some of the time?, disable CFG for the first half to speed things up) 2024-12-04 09:30:29 -06:00
mrq
ca31da0a95 sageattn (forgot to bother with testing this the other day, seems ifne) 2024-12-03 15:14:57 -06:00
mrq
31ab90d84a cringe code to convert to LlamaForCausalLM-happy weights + tokenizer dict (still need to write logic to actually use these weights for proper inferencing) 2024-12-03 10:18:58 -06:00
mrq
84a05acb6d touch ups in docs 2024-12-02 19:10:42 -06:00
mrq
67f7bad168 added mixed modality AR+NAR-len to generate a short prefix through the AR, then inference with said prefix through the NAR-len (need to experiment with it more to ensure that the masked off tokens are the only tokens getting updated) 2024-11-20 14:22:12 -06:00
mrq
efeb55e1b7 documentation update 2024-11-19 19:19:34 -06:00
mrq
190a917b3e I did it. 2024-11-19 12:24:33 -06:00
mrq
5ba80686e1 two weeks of agony concludes 2024-11-18 21:29:28 -06:00