Commit Graph

96 Commits

Author SHA1 Message Date
mrq
cddf8ca814 sort batches to try and reduce number of padded tokens in batched inference (also commented out F5 samples getting added to the demo page because I would have to regenerate them) 2024-12-11 22:45:38 -06:00
mrq
20b87bfbd0 store metrics and only recalculate them if the output file is newer than the metrics file 2024-12-11 20:55:43 -06:00
mrq
b81a98799b uplifting transformer's WavLM stuff to do speaker verification instead 2024-12-11 19:30:05 -06:00
mrq
8568a93dad added WER/SIM-O metrics, added APOLLO but I need to test it 2024-12-10 20:13:21 -06:00
mrq
1d460b9fe3 logic fixes, I feel like output is better? (also NAR can have a temperature, I imagine it couldn't because it was having a causal masked passed to it for the longest time before I caught it a month ago) 2024-12-08 14:52:47 -06:00
mrq
0c5a458b00 deduce language per line to allow for a cheap way to allow for cross-lingual switching, kinda 2024-12-07 22:57:29 -06:00
mrq
a032ff588f doc update, added automatically deducing language from a given text, also checks if the input is already phonemized text to allow direct control without being cringe (procrastinating adding WER/SIM-O) 2024-12-07 22:34:25 -06:00
mrq
5d80a2d0d4 fixed NAR-len issues with non-english maybe (langs weren't being passed), added interface to inference in batches through tts.batched_inference (no support for rolling context/prefixes because there's no way to do that), demo page uses batched inferencing now 2024-12-07 19:21:05 -06:00
mrq
42fafbaaca actually fixed knowledge distillation because of errant -inf logits causing problems and needed to be filtered (and splitting text language / output audio language because it helps) 2024-12-06 21:55:20 -06:00
mrq
93d27be539 rolling context finally (use last N utterances as the prefix for the next gen), option to split input text prompt by sentences instead of lines (or no splitting) 2024-12-04 20:31:44 -06:00
mrq
6845c447c9 added more harvard sentences to load from a text file 2024-11-21 13:18:11 -06:00
mrq
2a084544e8 moved duration padding for NAR-len to be a scalar instead (since it seems longer utterances need it much more so than shorter utterances) 2024-11-21 13:04:07 -06:00
mrq
6aee08f9c0 moved stuff in the web UI around (un-experimented the max NAR-len steps because its kind of important to adjust this value for better sounding audio / quicker generated audio) 2024-11-20 20:37:33 -06:00
mrq
67f7bad168 added mixed modality AR+NAR-len to generate a short prefix through the AR, then inference with said prefix through the NAR-len (need to experiment with it more to ensure that the masked off tokens are the only tokens getting updated) 2024-11-20 14:22:12 -06:00
mrq
b1369e7824 better modality selection (pick AR+NAR by default for the ar+nar model, pick NAR-len by default for the nar-len model), lowered default CFG because it makes the AR+NAR output sped up (but can't be too low since it's required for the NAR-len) 2024-11-19 18:51:17 -06:00
mrq
5ba80686e1 two weeks of agony concludes 2024-11-18 21:29:28 -06:00
mrq
6cfdf94bf9 swap priority to use nar-len if available, added notes 2024-11-18 09:40:04 -06:00
mrq
39096f8ff3 redid loss calculation to be cleaner, and position ID generation, and other things (I might need to train the NAR-len from scratch and not resume from an existing checkpoint.........) 2024-11-14 22:17:47 -06:00
mrq
2f56696506 overhauled inference/sampler kwargs to stop being a bloated mess 2024-11-11 20:21:16 -06:00
mrq
9cb0b6901b unified nar.py into ar_nar.py 2024-11-10 12:19:48 -06:00
mrq
a9d2faf2d7 all I can do now until I wait for the model to (re)train for pure NAR 2024-11-09 22:57:34 -06:00
mrq
77ff23e319 repeat extend the prom to fill the initial tokens for nar-len (it somewhat works, the model just needs to train more) 2024-11-06 23:29:53 -06:00
mrq
d229725c76 more adjustments (adjustments of early-exit entropy/varentropy thresholds, default rep pen being 1.5, experimental refine-on-stop, etc.) 2024-11-03 18:31:28 -06:00
mrq
aee08b7307 changed layerskip float16 training warning (since it didnt seem to fry on my 4xV100 system) 2024-11-03 09:58:29 -06:00
mrq
ec79230965 shuffled web UI options hidden by cfg.experimental to its own tab, expose early exit selection to inferencing (it kinda works naively, still need to implement self-speculation) 2024-11-01 21:30:06 -05:00
mrq
4049f51ba9 added option to load lora directly from the model file itself with --lora 2024-10-26 00:13:10 -05:00
mrq
ccf71dc1b6 added option to load from a model state dict directly instead of a yaml (to-do: do this for LoRAs too), automatically download the default model if none is provided 2024-10-25 22:15:15 -05:00
mrq
71731ed785 added prefixing with silence (was to test something, currently hidden under cfg.experimental=True) 2024-10-18 17:19:52 -05:00
mrq
6b04c13c56 print warning if audio promtpless inferencing with low AR temp (it really doesn't like low temps / greedy sampling) 2024-10-18 17:01:40 -05:00
mrq
c8f31db1de default to greedy sample AR (i should probably test this more but it seems to pass my harvard sentences and tongue twisters) 2024-10-18 16:58:56 -05:00
mrq
fc8dfd8617 made greedy AR sampling viable (and preferable), with caveats (per comment in vall_e.models.ar_nar) 2024-10-18 16:55:00 -05:00
mrq
8b6095f681 saner defaults, maybe 2024-10-17 14:37:21 -05:00
mrq
48461833c2 ugh 2024-10-15 19:30:43 -05:00
mrq
eea70f5698 kludge fix for an oversight in the model when trying to train for longer input prompt durations...... 2024-10-15 19:25:03 -05:00
mrq
04e983b86b modified demo page to be more modular with demoing comparisons, actually provide a path to use modified naive attention, entropix sampling is not tied to an experimental yaml flag now 2024-10-12 11:27:55 -05:00
mrq
d0ab7d755a added min-p (really does not seem useful since it's very sensitive), more tweaks to entropix 2024-10-11 22:36:06 -05:00
mrq
75a4c866d6 more demo page tweaks, added arg to force enable/disable LoRAs for inferencing (to-do: setup arg flags to handle this, and checkbox in web UI) 2024-10-10 19:04:12 -05:00
mrq
2ea978f318 added --eval-random-text-prompts to use random text prompts for eval pass, added --random-prompts for demo page and --lora to use a sample with the lora disabled, probably finally fixed validation dataloader breaking on eval 2024-10-10 13:40:25 -05:00
mrq
4a8e3ccf06 README tweaks, added --input-prompt-prefix as an experiment (its literally better to just not do this, but i'll retain it in case i have a revelation on how to improve it) 2024-10-04 18:57:19 -05:00
mrq
4f3c7a37c8 also do text similarities (dont know what use I'll have for this) 2024-09-10 16:45:59 -05:00
mrq
1c615a0f52 helper script (vall_e.emb.similar) to figure out the best way to compute similarity scores for audio (iunno how to go about it desu) 2024-09-10 16:34:23 -05:00
mrq
54203c059d validated rep pen for STT (sometimes needed to wrangle the model) 2024-09-08 08:30:30 -05:00
mrq
a6ad0577b8 cleanup the resultant text from STT 2024-09-06 18:44:25 -05:00
mrq
4bd9bb39c8 webui for STT (still need to bake the model to handle it better, a few hours so far has it generate what looks like a normal transcription but does not correlate to the audio right now) 2024-09-06 15:13:04 -05:00
mrq
94cf81d38c tweak 2024-09-05 23:21:18 -05:00
mrq
32287710a2 moved prints to use logger, edited readme (fused_attn doesnt seem stable for training) 2024-08-29 13:27:16 -05:00
mrq
b7b99a25f1 added ability to specify attention backend for CLI and webui (because im tired of editing the yaml) 2024-08-26 19:33:51 -05:00
mrq
d7c6be6f78 fix weird regression in handling checkpoints when backend is local, but deepspeed checkpoints are in (it was handled with LoRA loading but not real loading...) 2024-07-30 22:15:56 -05:00
mrq
c2f5b916fc added what I think is DRY sampling 2024-07-29 19:15:07 -05:00
mrq
75b04686f8 added prom-less training / inferencing, some other things 2024-07-22 19:36:07 -05:00