Commit Graph

550 Commits

Author SHA1 Message Date
mrq
4049f51ba9 added option to load lora directly from the model file itself with --lora 2024-10-26 00:13:10 -05:00
mrq
ccf71dc1b6 added option to load from a model state dict directly instead of a yaml (to-do: do this for LoRAs too), automatically download the default model if none is provided 2024-10-25 22:15:15 -05:00
mrq
a96f5aee32 adjusted how i want to pass eval kwargs 2024-10-25 20:38:09 -05:00
mrq
92e6bff6dc actually ar temp 0.5 with rep pen 1.125 seems to have the benefits of better outputs without it degrading some of the time but not all the time 2024-10-23 00:03:35 -05:00
mrq
8920e5e86b actually have beam_width in the webUI work 2024-10-22 22:06:22 -05:00
mrq
910571ad34 too brainlet to diagnose why low temp / greedy sampling is randomly unstable some of the time 2024-10-22 20:13:54 -05:00
mrq
8eb9a4056b modified default arguments (ar temp = 0 and rep pen = 1.125 seems to be stable, at least given the few things i tested), do not pass top k/top p/min p to NAR even though technically none of those things should matter when greedy sampling 2024-10-22 18:12:39 -05:00
mrq
1a02cd5bce modify demo template to say F5 instead of YourTTS, swap LoRA comparison around to make the lora'd the base file, and the no-lora the suffix'd file 2024-10-21 19:52:02 -05:00
mrq
02dfc60ac3 ugh 2024-10-18 17:23:22 -05:00
mrq
71731ed785 added prefixing with silence (was to test something, currently hidden under cfg.experimental=True) 2024-10-18 17:19:52 -05:00
mrq
6b04c13c56 print warning if audio promtpless inferencing with low AR temp (it really doesn't like low temps / greedy sampling) 2024-10-18 17:01:40 -05:00
mrq
c8f31db1de default to greedy sample AR (i should probably test this more but it seems to pass my harvard sentences and tongue twisters) 2024-10-18 16:58:56 -05:00
mrq
fc8dfd8617 made greedy AR sampling viable (and preferable), with caveats (per comment in vall_e.models.ar_nar) 2024-10-18 16:55:00 -05:00
mrq
07f4935a75 more tweaks 2024-10-18 13:19:36 -05:00
mrq
0dfab973e7 oops 2024-10-18 09:40:06 -05:00
mrq
75b90be325 cleaned up unused config flags, allow less strict yaml by pruning missing keys, renamed some dataset configs to be more unified 2024-10-17 17:06:48 -05:00
mrq
8b6095f681 saner defaults, maybe 2024-10-17 14:37:21 -05:00
mrq
f88097ccf6 add config option to set the rate of sampling randomly vs similar speakers during training 2024-10-16 14:27:58 -05:00
mrq
48461833c2 ugh 2024-10-15 19:30:43 -05:00
mrq
eea70f5698 kludge fix for an oversight in the model when trying to train for longer input prompt durations...... 2024-10-15 19:25:03 -05:00
mrq
84005c5b00 entropix apparently processes the entire sequence of logits but it falls apart when doing that 2024-10-13 12:01:12 -05:00
mrq
c800d28bb8 respect attention defined in the yaml for web UI (which might explain why theres been a discrepancy in outputs for me) 2024-10-13 11:02:24 -05:00
mrq
ed6b7a690f ugh......... 2024-10-13 00:26:46 -05:00
mrq
d405f243d4 at wits end in trying to output the right attention scores 2024-10-12 23:53:13 -05:00
mrq
70cf694cfd output attention scores for SDPA/flash, since naive attention seems broken 2024-10-12 12:09:17 -05:00
mrq
541e45263c ugh 2024-10-12 11:29:16 -05:00
mrq
04e983b86b modified demo page to be more modular with demoing comparisons, actually provide a path to use modified naive attention, entropix sampling is not tied to an experimental yaml flag now 2024-10-12 11:27:55 -05:00
mrq
666e8038fb ugh 2024-10-12 10:41:35 -05:00
mrq
3d6ef9666b overridden naive llama attention to get the right score values that entropix needs 2024-10-12 10:05:47 -05:00
mrq
40b089daf3 lol 2024-10-12 09:57:34 -05:00
mrq
d6f7c86a5c entropix tweaks (it doesn't output garbage but it loves to go for silence) 2024-10-12 09:46:18 -05:00
mrq
d0ab7d755a added min-p (really does not seem useful since it's very sensitive), more tweaks to entropix 2024-10-11 22:36:06 -05:00
mrq
bef43a0c18 added experimental entropix sampling support 2024-10-11 21:18:26 -05:00
mrq
85d85c1351 more arg creep for demo page 2024-10-10 19:40:01 -05:00
mrq
301468f519 << 2024-10-10 19:13:52 -05:00
mrq
75a4c866d6 more demo page tweaks, added arg to force enable/disable LoRAs for inferencing (to-do: setup arg flags to handle this, and checkbox in web UI) 2024-10-10 19:04:12 -05:00
mrq
96d05be73c demo page tweaks 2024-10-10 13:52:37 -05:00
mrq
2ea978f318 added --eval-random-text-prompts to use random text prompts for eval pass, added --random-prompts for demo page and --lora to use a sample with the lora disabled, probably finally fixed validation dataloader breaking on eval 2024-10-10 13:40:25 -05:00
mrq
52299127ab fix vall_e.emb.process 2024-10-08 20:00:34 -05:00
mrq
0656a762af fix vall_e.emb.transcriber 2024-10-08 19:24:43 -05:00
mrq
acdce66d4e readme tweaks, set the (unused) default model download URL back to the base ar+nar-llama-8 model, as ar+nar-tts+stt-llama-8 was renamed back to it since it performs well 2024-10-05 22:53:53 -05:00
mrq
84c7419001 faster 2024-10-04 22:30:47 -05:00
mrq
a507b769a1 sped up inferencing by not doing .tolist() for rep pen / length pen (and a bug fix in the web UI from prev commit) 2024-10-04 22:18:20 -05:00
mrq
4a8e3ccf06 README tweaks, added --input-prompt-prefix as an experiment (its literally better to just not do this, but i'll retain it in case i have a revelation on how to improve it) 2024-10-04 18:57:19 -05:00
mrq
a9fa0898a9 tweaked demo page script to sample speakers instead 2024-09-28 10:50:26 -05:00
mrq
2f1dca3089 added language selection in web UI, tweaked demo script 2024-09-28 09:49:45 -05:00
mrq
10df2ef5f3 fixed oversight where input audio does not resample (lol...) 2024-09-27 20:27:53 -05:00
mrq
039482a48e don't do eval on stt because it's so slow and I don't even bother doing any metrics against it anyways (to-do: make this a flag) 2024-09-26 18:56:57 -05:00
mrq
ff7a1b4163 coerce into path for other sampler_types (it's required for sampling for similar utterances) 2024-09-26 18:37:56 -05:00
mrq
f24547ad4e add top_k sampling / offset for prompt similar utterance sampling 2024-09-26 16:26:40 -05:00
mrq
9da630f73a swap order of demo entries, as the model prioritizes adhering to the speaker prompt more (instead of trying to match the ground truth magically) 2024-09-25 23:31:24 -05:00
mrq
e84d466261 vall_e.plot tweaks 2024-09-24 20:05:10 -05:00
mrq
c5e9142863 added option to retokenize phonemes for hdf5 (to save having to remake my hdf5 file) 2024-09-21 13:08:01 -05:00
mrq
536c11c4ac actually validated and fixed sampling similar utterances for the prompt (hopefully nothing else is needed) 2024-09-21 12:59:51 -05:00
mrq
d31f27119a regex replace out the (lang) markers in espeak, updated tokenizer vocab as lazily as possible to not have unk tokens 2024-09-21 12:29:28 -05:00
mrq
769f67dcfe actually fix validation of phonemes in the symmap 2024-09-21 12:19:34 -05:00
mrq
c8d4716a9f ugh 2024-09-18 21:40:57 -05:00
mrq
fe241f6a99 support for wildcard in training/validation/noise dataset array (to-do: a better way to query between metadata folder and data folder) 2024-09-18 21:34:43 -05:00
mrq
b5bec0c9ce oops, turns out these are not split by speaker names already........ (also added sampling the dataset in the webui for easy viewing) 2024-09-18 20:19:46 -05:00
mrq
fa9d3f6c06 lang fixes / reworked phoneme symmap validation 2024-09-18 19:36:03 -05:00
mrq
84647f588a more tweaks 2024-09-18 16:43:57 -05:00
mrq
ebac1db16c maybe final tweaks, I really needed to unify my json read/write and orjson is proven to be fast enough for me to try and rely on it more 2024-09-17 22:57:04 -05:00
mrq
6ceed866b5 *faster* 2024-09-17 22:44:36 -05:00
mrq
f00283440c faster 2024-09-17 22:26:31 -05:00
mrq
be22b65300 solved my problem 2024-09-17 21:58:44 -05:00
mrq
8f41d1b324 more tweaks 2024-09-17 16:26:30 -05:00
mrq
804ddb5182 optimizations (6 hours to do cosine similarities on a speaker set of just 17k utterances................) 2024-09-17 15:51:45 -05:00
mrq
a9fbe81f98 oops 2024-09-17 15:25:12 -05:00
mrq
c440c4fe7e relegated processing similarity data into vall_e.emb.similarity since it's easier, seems to work? 2024-09-17 14:37:21 -05:00
mrq
56f25f7a9b more stuff for similar-speaker prompt sampling (to-do: actually test if this works...) 2024-09-16 23:10:29 -05:00
mrq
69f140ba45 fix oversight with phonemizing french because espeak defines french as fr-fr instead of fr (even though spain spanish is es and not es-sp or some shit, but portugal portuguese is pt-pt) 2024-09-13 12:53:36 -05:00
mrq
4f3c7a37c8 also do text similarities (dont know what use I'll have for this) 2024-09-10 16:45:59 -05:00
mrq
1c615a0f52 helper script (vall_e.emb.similar) to figure out the best way to compute similarity scores for audio (iunno how to go about it desu) 2024-09-10 16:34:23 -05:00
mrq
d059f6f56d added helper script to process Emilia (amphion/Emilia-Dataset), clean up espeak phonemes for non-English transcriptions with English words (because for some reason espeak injects (en){word}(lang) markers and it's annoying) 2024-09-09 09:57:32 -05:00
mrq
31e8b7edb8 tweaks and fixes for lora stuffs 2024-09-08 18:05:21 -05:00
mrq
54203c059d validated rep pen for STT (sometimes needed to wrangle the model) 2024-09-08 08:30:30 -05:00
mrq
6a967f91b9 oops 2024-09-07 22:13:49 -05:00
mrq
5d66a7db52 webui cleanup, more tweaks, default to safetensors in config 2024-09-07 21:45:05 -05:00
mrq
a6ad0577b8 cleanup the resultant text from STT 2024-09-06 18:44:25 -05:00
mrq
fa93061b3e more fixes, moved sampler state dict to a better place, eval works again 2024-09-06 16:59:56 -05:00
mrq
4bd9bb39c8 webui for STT (still need to bake the model to handle it better, a few hours so far has it generate what looks like a normal transcription but does not correlate to the audio right now) 2024-09-06 15:13:04 -05:00
mrq
d33a906119 cleanup for AR_NAR inferencing to allow both TTS and STT tasks simultaneously (need to have training eval do this to though) 2024-09-06 14:30:12 -05:00
mrq
341e19162b fixes, again 2024-09-06 11:41:41 -05:00
mrq
94cf81d38c tweak 2024-09-05 23:21:18 -05:00
mrq
413097f5f7 fixes 2024-09-05 21:42:59 -05:00
mrq
54547b74d8 experimental implementation of STT (need to actually test on a model, test trainer seems to work) 2024-09-05 20:43:20 -05:00
mrq
d319d33368 haha 2024-09-04 14:52:26 -05:00
mrq
619369236b ugh 2024-08-30 21:10:57 -05:00
mrq
168e203942 ugh 2024-08-30 14:39:07 -05:00
mrq
685f4faec0 ugh 2024-08-30 10:46:26 -05:00
mrq
32287710a2 moved prints to use logger, edited readme (fused_attn doesnt seem stable for training) 2024-08-29 13:27:16 -05:00
mrq
d423bc03c2 fixed attentions for MoE 2024-08-27 17:02:42 -05:00
mrq
b7b99a25f1 added ability to specify attention backend for CLI and webui (because im tired of editing the yaml) 2024-08-26 19:33:51 -05:00
mrq
0d706ec6a1 added fused_attn (triton-based fused attention) and simply just query for flash_attn under rocm 2024-08-26 19:13:34 -05:00
mrq
6b0891448c pain (some shit to try and get some flash attention for ROCm (gfx1100) through triton fused attention but no good) 2024-08-25 20:07:27 -05:00
mrq
40e1799adc fixed xformers and flash_attn to actually work now 2024-08-19 01:03:35 -05:00
mrq
29c35528e5 the sooner I accept there's no FA for V100s the sooner I'll go to bed 2024-08-18 23:54:33 -05:00
mrq
d636edd3a2 added flash_attn LlamaAttention (including flash_attn==1.0.9) 2024-08-18 20:51:14 -05:00
mrq
054d28573a my DAC dataset again managed to only have some utterances with only 8 of 9 RVQ levels, this fixes an oversight from it 2024-08-09 21:18:01 -05:00
mrq
2a1794c084 ughghghhhh 2024-08-09 21:15:01 -05:00