Commit Graph

100 Commits

Author SHA1 Message Date
mrq
5788db849b added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 10:57:02 -06:00
mrq
59bf6b8b33 exposed additional task (ns, sr, vc) (vc is experimental) 2024-12-20 11:15:29 -06:00
mrq
ed152f78df tweaks to prompt duration to allow me to divorce how i use it for training with how I'm using it for the demo page, and demo page tweaks to make my life easier 2024-12-17 19:33:04 -06:00
mrq
c2e17e287b really shoddy voice conversion implementation (it sort of works...) 2024-12-16 22:54:53 -06:00
mrq
cddf8ca814 sort batches to try and reduce number of padded tokens in batched inference (also commented out F5 samples getting added to the demo page because I would have to regenerate them) 2024-12-11 22:45:38 -06:00
mrq
20b87bfbd0 store metrics and only recalculate them if the output file is newer than the metrics file 2024-12-11 20:55:43 -06:00
mrq
b81a98799b uplifting transformer's WavLM stuff to do speaker verification instead 2024-12-11 19:30:05 -06:00
mrq
8568a93dad added WER/SIM-O metrics, added APOLLO but I need to test it 2024-12-10 20:13:21 -06:00
mrq
1d460b9fe3 logic fixes, I feel like output is better? (also NAR can have a temperature, I imagine it couldn't because it was having a causal masked passed to it for the longest time before I caught it a month ago) 2024-12-08 14:52:47 -06:00
mrq
0c5a458b00 deduce language per line to allow for a cheap way to allow for cross-lingual switching, kinda 2024-12-07 22:57:29 -06:00
mrq
a032ff588f doc update, added automatically deducing language from a given text, also checks if the input is already phonemized text to allow direct control without being cringe (procrastinating adding WER/SIM-O) 2024-12-07 22:34:25 -06:00
mrq
5d80a2d0d4 fixed NAR-len issues with non-english maybe (langs weren't being passed), added interface to inference in batches through tts.batched_inference (no support for rolling context/prefixes because there's no way to do that), demo page uses batched inferencing now 2024-12-07 19:21:05 -06:00
mrq
42fafbaaca actually fixed knowledge distillation because of errant -inf logits causing problems and needed to be filtered (and splitting text language / output audio language because it helps) 2024-12-06 21:55:20 -06:00
mrq
93d27be539 rolling context finally (use last N utterances as the prefix for the next gen), option to split input text prompt by sentences instead of lines (or no splitting) 2024-12-04 20:31:44 -06:00
mrq
6845c447c9 added more harvard sentences to load from a text file 2024-11-21 13:18:11 -06:00
mrq
2a084544e8 moved duration padding for NAR-len to be a scalar instead (since it seems longer utterances need it much more so than shorter utterances) 2024-11-21 13:04:07 -06:00
mrq
6aee08f9c0 moved stuff in the web UI around (un-experimented the max NAR-len steps because its kind of important to adjust this value for better sounding audio / quicker generated audio) 2024-11-20 20:37:33 -06:00
mrq
67f7bad168 added mixed modality AR+NAR-len to generate a short prefix through the AR, then inference with said prefix through the NAR-len (need to experiment with it more to ensure that the masked off tokens are the only tokens getting updated) 2024-11-20 14:22:12 -06:00
mrq
b1369e7824 better modality selection (pick AR+NAR by default for the ar+nar model, pick NAR-len by default for the nar-len model), lowered default CFG because it makes the AR+NAR output sped up (but can't be too low since it's required for the NAR-len) 2024-11-19 18:51:17 -06:00
mrq
5ba80686e1 two weeks of agony concludes 2024-11-18 21:29:28 -06:00
mrq
6cfdf94bf9 swap priority to use nar-len if available, added notes 2024-11-18 09:40:04 -06:00
mrq
39096f8ff3 redid loss calculation to be cleaner, and position ID generation, and other things (I might need to train the NAR-len from scratch and not resume from an existing checkpoint.........) 2024-11-14 22:17:47 -06:00
mrq
2f56696506 overhauled inference/sampler kwargs to stop being a bloated mess 2024-11-11 20:21:16 -06:00
mrq
9cb0b6901b unified nar.py into ar_nar.py 2024-11-10 12:19:48 -06:00
mrq
a9d2faf2d7 all I can do now until I wait for the model to (re)train for pure NAR 2024-11-09 22:57:34 -06:00
mrq
77ff23e319 repeat extend the prom to fill the initial tokens for nar-len (it somewhat works, the model just needs to train more) 2024-11-06 23:29:53 -06:00
mrq
d229725c76 more adjustments (adjustments of early-exit entropy/varentropy thresholds, default rep pen being 1.5, experimental refine-on-stop, etc.) 2024-11-03 18:31:28 -06:00
mrq
aee08b7307 changed layerskip float16 training warning (since it didnt seem to fry on my 4xV100 system) 2024-11-03 09:58:29 -06:00
mrq
ec79230965 shuffled web UI options hidden by cfg.experimental to its own tab, expose early exit selection to inferencing (it kinda works naively, still need to implement self-speculation) 2024-11-01 21:30:06 -05:00
mrq
4049f51ba9 added option to load lora directly from the model file itself with --lora 2024-10-26 00:13:10 -05:00
mrq
ccf71dc1b6 added option to load from a model state dict directly instead of a yaml (to-do: do this for LoRAs too), automatically download the default model if none is provided 2024-10-25 22:15:15 -05:00
mrq
71731ed785 added prefixing with silence (was to test something, currently hidden under cfg.experimental=True) 2024-10-18 17:19:52 -05:00
mrq
6b04c13c56 print warning if audio promtpless inferencing with low AR temp (it really doesn't like low temps / greedy sampling) 2024-10-18 17:01:40 -05:00
mrq
c8f31db1de default to greedy sample AR (i should probably test this more but it seems to pass my harvard sentences and tongue twisters) 2024-10-18 16:58:56 -05:00
mrq
fc8dfd8617 made greedy AR sampling viable (and preferable), with caveats (per comment in vall_e.models.ar_nar) 2024-10-18 16:55:00 -05:00
mrq
8b6095f681 saner defaults, maybe 2024-10-17 14:37:21 -05:00
mrq
48461833c2 ugh 2024-10-15 19:30:43 -05:00
mrq
eea70f5698 kludge fix for an oversight in the model when trying to train for longer input prompt durations...... 2024-10-15 19:25:03 -05:00
mrq
04e983b86b modified demo page to be more modular with demoing comparisons, actually provide a path to use modified naive attention, entropix sampling is not tied to an experimental yaml flag now 2024-10-12 11:27:55 -05:00
mrq
d0ab7d755a added min-p (really does not seem useful since it's very sensitive), more tweaks to entropix 2024-10-11 22:36:06 -05:00
mrq
75a4c866d6 more demo page tweaks, added arg to force enable/disable LoRAs for inferencing (to-do: setup arg flags to handle this, and checkbox in web UI) 2024-10-10 19:04:12 -05:00
mrq
2ea978f318 added --eval-random-text-prompts to use random text prompts for eval pass, added --random-prompts for demo page and --lora to use a sample with the lora disabled, probably finally fixed validation dataloader breaking on eval 2024-10-10 13:40:25 -05:00
mrq
4a8e3ccf06 README tweaks, added --input-prompt-prefix as an experiment (its literally better to just not do this, but i'll retain it in case i have a revelation on how to improve it) 2024-10-04 18:57:19 -05:00
mrq
4f3c7a37c8 also do text similarities (dont know what use I'll have for this) 2024-09-10 16:45:59 -05:00
mrq
1c615a0f52 helper script (vall_e.emb.similar) to figure out the best way to compute similarity scores for audio (iunno how to go about it desu) 2024-09-10 16:34:23 -05:00
mrq
54203c059d validated rep pen for STT (sometimes needed to wrangle the model) 2024-09-08 08:30:30 -05:00
mrq
a6ad0577b8 cleanup the resultant text from STT 2024-09-06 18:44:25 -05:00
mrq
4bd9bb39c8 webui for STT (still need to bake the model to handle it better, a few hours so far has it generate what looks like a normal transcription but does not correlate to the audio right now) 2024-09-06 15:13:04 -05:00
mrq
94cf81d38c tweak 2024-09-05 23:21:18 -05:00
mrq
32287710a2 moved prints to use logger, edited readme (fused_attn doesnt seem stable for training) 2024-08-29 13:27:16 -05:00