Commit Graph

583 Commits

Author SHA1 Message Date
mrq
f88097ccf6 add config option to set the rate of sampling randomly vs similar speakers during training 2024-10-16 14:27:58 -05:00
mrq
48461833c2 ugh 2024-10-15 19:30:43 -05:00
mrq
eea70f5698 kludge fix for an oversight in the model when trying to train for longer input prompt durations...... 2024-10-15 19:25:03 -05:00
mrq
84005c5b00 entropix apparently processes the entire sequence of logits but it falls apart when doing that 2024-10-13 12:01:12 -05:00
mrq
c800d28bb8 respect attention defined in the yaml for web UI (which might explain why theres been a discrepancy in outputs for me) 2024-10-13 11:02:24 -05:00
mrq
ed6b7a690f ugh......... 2024-10-13 00:26:46 -05:00
mrq
d405f243d4 at wits end in trying to output the right attention scores 2024-10-12 23:53:13 -05:00
mrq
70cf694cfd output attention scores for SDPA/flash, since naive attention seems broken 2024-10-12 12:09:17 -05:00
mrq
541e45263c ugh 2024-10-12 11:29:16 -05:00
mrq
04e983b86b modified demo page to be more modular with demoing comparisons, actually provide a path to use modified naive attention, entropix sampling is not tied to an experimental yaml flag now 2024-10-12 11:27:55 -05:00
mrq
666e8038fb ugh 2024-10-12 10:41:35 -05:00
mrq
3d6ef9666b overridden naive llama attention to get the right score values that entropix needs 2024-10-12 10:05:47 -05:00
mrq
40b089daf3 lol 2024-10-12 09:57:34 -05:00
mrq
d6f7c86a5c entropix tweaks (it doesn't output garbage but it loves to go for silence) 2024-10-12 09:46:18 -05:00
mrq
d0ab7d755a added min-p (really does not seem useful since it's very sensitive), more tweaks to entropix 2024-10-11 22:36:06 -05:00
mrq
bef43a0c18 added experimental entropix sampling support 2024-10-11 21:18:26 -05:00
mrq
85d85c1351 more arg creep for demo page 2024-10-10 19:40:01 -05:00
mrq
301468f519 << 2024-10-10 19:13:52 -05:00
mrq
75a4c866d6 more demo page tweaks, added arg to force enable/disable LoRAs for inferencing (to-do: setup arg flags to handle this, and checkbox in web UI) 2024-10-10 19:04:12 -05:00
mrq
96d05be73c demo page tweaks 2024-10-10 13:52:37 -05:00
mrq
2ea978f318 added --eval-random-text-prompts to use random text prompts for eval pass, added --random-prompts for demo page and --lora to use a sample with the lora disabled, probably finally fixed validation dataloader breaking on eval 2024-10-10 13:40:25 -05:00
mrq
52299127ab fix vall_e.emb.process 2024-10-08 20:00:34 -05:00
mrq
0656a762af fix vall_e.emb.transcriber 2024-10-08 19:24:43 -05:00
mrq
acdce66d4e readme tweaks, set the (unused) default model download URL back to the base ar+nar-llama-8 model, as ar+nar-tts+stt-llama-8 was renamed back to it since it performs well 2024-10-05 22:53:53 -05:00
mrq
84c7419001 faster 2024-10-04 22:30:47 -05:00
mrq
a507b769a1 sped up inferencing by not doing .tolist() for rep pen / length pen (and a bug fix in the web UI from prev commit) 2024-10-04 22:18:20 -05:00
mrq
4a8e3ccf06 README tweaks, added --input-prompt-prefix as an experiment (its literally better to just not do this, but i'll retain it in case i have a revelation on how to improve it) 2024-10-04 18:57:19 -05:00
mrq
a9fa0898a9 tweaked demo page script to sample speakers instead 2024-09-28 10:50:26 -05:00
mrq
2f1dca3089 added language selection in web UI, tweaked demo script 2024-09-28 09:49:45 -05:00
mrq
10df2ef5f3 fixed oversight where input audio does not resample (lol...) 2024-09-27 20:27:53 -05:00
mrq
039482a48e don't do eval on stt because it's so slow and I don't even bother doing any metrics against it anyways (to-do: make this a flag) 2024-09-26 18:56:57 -05:00
mrq
ff7a1b4163 coerce into path for other sampler_types (it's required for sampling for similar utterances) 2024-09-26 18:37:56 -05:00
mrq
f24547ad4e add top_k sampling / offset for prompt similar utterance sampling 2024-09-26 16:26:40 -05:00
mrq
9da630f73a swap order of demo entries, as the model prioritizes adhering to the speaker prompt more (instead of trying to match the ground truth magically) 2024-09-25 23:31:24 -05:00
mrq
e84d466261 vall_e.plot tweaks 2024-09-24 20:05:10 -05:00
mrq
c5e9142863 added option to retokenize phonemes for hdf5 (to save having to remake my hdf5 file) 2024-09-21 13:08:01 -05:00
mrq
536c11c4ac actually validated and fixed sampling similar utterances for the prompt (hopefully nothing else is needed) 2024-09-21 12:59:51 -05:00
mrq
d31f27119a regex replace out the (lang) markers in espeak, updated tokenizer vocab as lazily as possible to not have unk tokens 2024-09-21 12:29:28 -05:00
mrq
769f67dcfe actually fix validation of phonemes in the symmap 2024-09-21 12:19:34 -05:00
mrq
c8d4716a9f ugh 2024-09-18 21:40:57 -05:00
mrq
fe241f6a99 support for wildcard in training/validation/noise dataset array (to-do: a better way to query between metadata folder and data folder) 2024-09-18 21:34:43 -05:00
mrq
b5bec0c9ce oops, turns out these are not split by speaker names already........ (also added sampling the dataset in the webui for easy viewing) 2024-09-18 20:19:46 -05:00
mrq
fa9d3f6c06 lang fixes / reworked phoneme symmap validation 2024-09-18 19:36:03 -05:00
mrq
84647f588a more tweaks 2024-09-18 16:43:57 -05:00
mrq
ebac1db16c maybe final tweaks, I really needed to unify my json read/write and orjson is proven to be fast enough for me to try and rely on it more 2024-09-17 22:57:04 -05:00
mrq
6ceed866b5 *faster* 2024-09-17 22:44:36 -05:00
mrq
f00283440c faster 2024-09-17 22:26:31 -05:00
mrq
be22b65300 solved my problem 2024-09-17 21:58:44 -05:00
mrq
8f41d1b324 more tweaks 2024-09-17 16:26:30 -05:00
mrq
804ddb5182 optimizations (6 hours to do cosine similarities on a speaker set of just 17k utterances................) 2024-09-17 15:51:45 -05:00
mrq
a9fbe81f98 oops 2024-09-17 15:25:12 -05:00
mrq
c440c4fe7e relegated processing similarity data into vall_e.emb.similarity since it's easier, seems to work? 2024-09-17 14:37:21 -05:00
mrq
56f25f7a9b more stuff for similar-speaker prompt sampling (to-do: actually test if this works...) 2024-09-16 23:10:29 -05:00
mrq
69f140ba45 fix oversight with phonemizing french because espeak defines french as fr-fr instead of fr (even though spain spanish is es and not es-sp or some shit, but portugal portuguese is pt-pt) 2024-09-13 12:53:36 -05:00
mrq
4f3c7a37c8 also do text similarities (dont know what use I'll have for this) 2024-09-10 16:45:59 -05:00
mrq
1c615a0f52 helper script (vall_e.emb.similar) to figure out the best way to compute similarity scores for audio (iunno how to go about it desu) 2024-09-10 16:34:23 -05:00
mrq
d059f6f56d added helper script to process Emilia (amphion/Emilia-Dataset), clean up espeak phonemes for non-English transcriptions with English words (because for some reason espeak injects (en){word}(lang) markers and it's annoying) 2024-09-09 09:57:32 -05:00
mrq
31e8b7edb8 tweaks and fixes for lora stuffs 2024-09-08 18:05:21 -05:00
mrq
54203c059d validated rep pen for STT (sometimes needed to wrangle the model) 2024-09-08 08:30:30 -05:00
mrq
6a967f91b9 oops 2024-09-07 22:13:49 -05:00
mrq
5d66a7db52 webui cleanup, more tweaks, default to safetensors in config 2024-09-07 21:45:05 -05:00
mrq
a6ad0577b8 cleanup the resultant text from STT 2024-09-06 18:44:25 -05:00
mrq
fa93061b3e more fixes, moved sampler state dict to a better place, eval works again 2024-09-06 16:59:56 -05:00
mrq
4bd9bb39c8 webui for STT (still need to bake the model to handle it better, a few hours so far has it generate what looks like a normal transcription but does not correlate to the audio right now) 2024-09-06 15:13:04 -05:00
mrq
d33a906119 cleanup for AR_NAR inferencing to allow both TTS and STT tasks simultaneously (need to have training eval do this to though) 2024-09-06 14:30:12 -05:00
mrq
341e19162b fixes, again 2024-09-06 11:41:41 -05:00
mrq
94cf81d38c tweak 2024-09-05 23:21:18 -05:00
mrq
413097f5f7 fixes 2024-09-05 21:42:59 -05:00
mrq
54547b74d8 experimental implementation of STT (need to actually test on a model, test trainer seems to work) 2024-09-05 20:43:20 -05:00
mrq
d319d33368 haha 2024-09-04 14:52:26 -05:00
mrq
619369236b ugh 2024-08-30 21:10:57 -05:00
mrq
168e203942 ugh 2024-08-30 14:39:07 -05:00
mrq
685f4faec0 ugh 2024-08-30 10:46:26 -05:00
mrq
32287710a2 moved prints to use logger, edited readme (fused_attn doesnt seem stable for training) 2024-08-29 13:27:16 -05:00
mrq
d423bc03c2 fixed attentions for MoE 2024-08-27 17:02:42 -05:00
mrq
b7b99a25f1 added ability to specify attention backend for CLI and webui (because im tired of editing the yaml) 2024-08-26 19:33:51 -05:00
mrq
0d706ec6a1 added fused_attn (triton-based fused attention) and simply just query for flash_attn under rocm 2024-08-26 19:13:34 -05:00
mrq
6b0891448c pain (some shit to try and get some flash attention for ROCm (gfx1100) through triton fused attention but no good) 2024-08-25 20:07:27 -05:00
mrq
40e1799adc fixed xformers and flash_attn to actually work now 2024-08-19 01:03:35 -05:00
mrq
29c35528e5 the sooner I accept there's no FA for V100s the sooner I'll go to bed 2024-08-18 23:54:33 -05:00
mrq
d636edd3a2 added flash_attn LlamaAttention (including flash_attn==1.0.9) 2024-08-18 20:51:14 -05:00
mrq
054d28573a my DAC dataset again managed to only have some utterances with only 8 of 9 RVQ levels, this fixes an oversight from it 2024-08-09 21:18:01 -05:00
mrq
2a1794c084 ughghghhhh 2024-08-09 21:15:01 -05:00
mrq
ed373957e2 maybe not 2024-08-09 11:38:08 -05:00
mrq
c658a7b440 make loss scaling opt-in rather than automatically determined (because it seems a DAC-based model really doesnt like loss scaling) 2024-08-09 10:51:36 -05:00
mrq
d04f6911b4 oops 2024-08-08 19:38:55 -05:00
mrq
0aa59e6f3f uncommented block that writes the metadata on HDF5 creation 2024-08-08 19:21:29 -05:00
mrq
79a6781c9e fix vall_e.data --action=hdf5 actually transcribing because past me completely forgot it tried to already put the transcribe/process dataset scripts inside the module before 2024-08-08 07:51:42 -05:00
mrq
949339a3fa do not include SDPA attention if there's no available SDPA backends 2024-08-06 20:42:39 -05:00
mrq
613024ec0d ugh 2024-08-06 20:35:15 -05:00
mrq
eac353cd0b busy work and cleanup while I wait for 1TB of audio to quantize... again. 2024-08-06 20:23:33 -05:00
mrq
f284c7ea9c do mixed-precision for AMP inside the compress function itself, because the loudness function gripes when using a float16 (non-power of 2 lengths) or bfloat16 (something about views for bfloat16) 2024-08-06 15:08:37 -05:00
mrq
b6ba2cc8e7 tweaked vall_e.emb.process to instead process audio one file at a time instead of all the files for a given speaker to avoid OOMing on less-memory-filled systems with --low-memory 2024-08-06 14:24:40 -05:00
mrq
9710b06b74 tweaks and things 2024-08-06 08:17:25 -05:00
mrq
134dac8c2b re-adapted process_libritts.py to a 'better' way (better because it processed without needing to shuffle a bunch of things and adapt to cope or something) 2024-08-05 20:34:58 -05:00
mrq
3f73fcca29 oops 2024-08-05 20:12:13 -05:00
mrq
597441e48b moved transcribe and process dataset scripts to vall_e/emb within the module itself, argparse-ified transcription script 2024-08-05 19:40:50 -05:00
mrq
7cdfa3dc0c updated process_datasets.py, added argparsing so I can mostly stop manually editing things, and some other cleanup 2024-08-05 15:59:25 -05:00
mrq
debcc93e7e add adapted MixtralAttention for when I make a bad decision to actually train a MoE 2024-08-04 22:03:22 -05:00
mrq
10aaf840e7 added export option to convert Llama to MixtralMoE for another dumb experiment 2024-08-04 20:25:06 -05:00
mrq
3a65cc4b22 fix issue with sft and shared tensors... 2024-08-04 19:56:21 -05:00
mrq
23f3b56fda oops 2024-08-04 08:18:57 -05:00
mrq
d19f93a2c0 documentation update 2024-08-04 00:14:49 -05:00
mrq
2cb465018b implicitly load either normal pickled weights or safetensors on loading the model 2024-08-03 23:34:18 -05:00
mrq
c09133d00f added safetensors support (with metadata) and feed whatever torch.load/torch.save into it 2024-08-03 23:15:20 -05:00
mrq
6a733eb2ed changed torch.Tensor().to(device, dtype) to just torch.tensor(..., device, dtype) because it's been bothering my autism that I'm creating tensors then converting rather than creating with the right device/dtype, some 'optimization' to compile the model but it doesnt seem to do anything useful 2024-08-03 22:10:21 -05:00
mrq
ab673e0426 add cap for NAR-len training, to avoid any weird cases in early training where it'll just mess up and generate long lengths 2024-08-03 21:00:32 -05:00
mrq
4d2b88b164 throw exception if training, but no model is set to train (because i ran into this wondering what the hell was happening) 2024-08-03 20:51:23 -05:00
mrq
d0a5c7eca2 more coping with the NAR len 2024-08-03 20:23:36 -05:00
mrq
11fa3da665 some cleanup, fixed the wrapper attention to explicitly use other sdpa backends 2024-08-03 19:51:00 -05:00
mrq
9564ecda43 wrapper attention class for other sdpa backends + xformers seems to have broke... 2024-08-03 15:12:11 -05:00
mrq
9e1989be1b tweaked initial NAR pass's initial token embeddings to use a different value, or osmething 2024-08-03 09:01:37 -05:00
mrq
26f74c5739 somehow fixed non-unified position IDs for the NAR-len 2024-08-03 08:43:42 -05:00
mrq
66407e5bdb tweaks for the NAR-len model, maybe 2024-08-03 08:40:39 -05:00
mrq
97c5241bef fixes, throw an exception when using NAR only model with non-unified position IDs, since for some reason it outputs garbage for the NAR 2024-08-02 22:25:49 -05:00
mrq
4456d3172b that's what I get for testing without hdf5 on my previous machine.... 2024-08-02 20:44:01 -05:00
mrq
7a77978096 oversight with using resize_modules 2024-08-02 20:28:49 -05:00
mrq
808a79ebaf oops 2024-08-01 22:56:04 -05:00
mrq
443422ecb5 ugh, finally got some form of offloading working (need to test if it works on different GPUs, but GPU and CPU offloading seems to work in the test trainer) 2024-08-01 22:43:39 -05:00
mrq
c9ec6b28ef it actually wasn't working because Engines.__init__() automatically moves the entire module to the requested device, which was being called after offloading the model in the test trainer (and it seems I cant do it without injecting a bunch of shit in modeling_llama.py) 2024-08-01 20:56:28 -05:00
mrq
b4c895114c naive model offloading support (handles automatically splitting parts of the model to requested device per memory constraints, either inferred or requested in the yaml, input tensors are automatically migrated to the right device, it SEEMS to work for training under the test trainer when split between GPU and CPU) (this was specifically only because that Flux imagegen model released so I can test it there) 2024-08-01 20:12:06 -05:00
mrq
387358bc8a fixes for the NAR-len model, and documentation some config options, and a better way to handle resizing modules on state_dict load 2024-07-31 20:35:09 -05:00
mrq
52d13b321f I rather have it default to non-strict loading instead so I can clean up YAMLs 2024-07-30 22:24:38 -05:00
mrq
d7c6be6f78 fix weird regression in handling checkpoints when backend is local, but deepspeed checkpoints are in (it was handled with LoRA loading but not real loading...) 2024-07-30 22:15:56 -05:00
mrq
07f8e2ad06 added option to set the causal size (how many tokens to sample per AR step), but requires the model to be trained for this (which explains why recurrent chunk sampling just doesn't work for the retnet tests, obvious in hindsight) 2024-07-30 20:53:51 -05:00
mrq
ebf848d249 possible speedup for samplers that require a list of previous tokens (the DRY sampler made me realize that I should copy the tolist() thing from the rep pen sampler for everything else) 2024-07-29 20:23:26 -05:00
mrq
55b0121b1a trying (and failing) to nail a weird regression in fancier attentions 2024-07-29 19:53:37 -05:00
mrq
c2f5b916fc added what I think is DRY sampling 2024-07-29 19:15:07 -05:00
mrq
ce8bb1e4f7 sanity cleanups with weird off-by-one-ness, cleaned up and validated vall_e.models.experimental works again 2024-07-27 15:36:05 -05:00
mrq
06e948aec1 suppress warning on exit about distributed not being cleaned up (because I updated my system) 2024-07-25 16:50:47 -05:00
mrq
682e4387dc oops (fixed proms being erased from a config oversight) 2024-07-25 12:39:57 -05:00
mrq
1acb0e9c84 added experimental training setting to perform token dropout to MAYBE compensate for errors from the preceding RVQ level (two types: token error offset, token dropout embedding replace) 2024-07-24 19:35:17 -05:00
mrq
611a1c4bdc might help 2024-07-22 20:57:01 -05:00
mrq
188d116222 some weird fixes for an equally weird regression with LoRA loading 2024-07-22 20:47:24 -05:00
mrq
e33c4b0cb1 oops 2024-07-22 19:38:39 -05:00
mrq
75b04686f8 added prom-less training / inferencing, some other things 2024-07-22 19:36:07 -05:00
mrq
491ae2a684 some insanity for sanity checks (some phonemes from phonemizing japanese are not in my tokenizer...) 2024-07-22 00:30:40 -05:00
mrq
ad024f400f actually pass language into dataset process script, fix coercing japanese into hiragana because espeak does not like kanji 2024-07-21 23:21:37 -05:00
mrq
3e5ca3a201 more demo page tweaks 2024-07-21 19:31:13 -05:00
mrq
7366f36f81 oops 2024-07-21 19:17:25 -05:00
mrq
e19aa643a6 cleaned up demo page creation, added option to pass in RVQ level sampling distribution for training 2024-07-21 19:12:03 -05:00
mrq
ba7ee8c0ee added demo link to readme 2024-07-19 21:22:30 -05:00
mrq
9ec88d9444 validated passing URI path for assets instead of base64 encoding them 2024-07-19 21:07:17 -05:00
mrq
d87b492295 added rudimentary demo page creator (currently just embeds base64 wavs into the page, need to test not doing that) 2024-07-19 20:49:40 -05:00
mrq
d53038a9e4 actually have split classifiers working 2024-07-19 15:33:31 -05:00
mrq
692d09f9c1 eval/validation fix for SpeechX tasks 2024-07-19 09:16:37 -05:00
mrq
28a674e0f1 fixes... 2024-07-18 23:25:32 -05:00
mrq
39f961abcd test trainer (vall_e.models.ar_nar) tests some SpeechX features 2024-07-18 18:46:45 -05:00
mrq
83a0954f85 fixes for re-introducing SpeechX tasks (need to actually validate if these all do the right things) 2024-07-18 17:16:32 -05:00
mrq
bccbb77a1a added option to either naively concat codes to concat audio waveforms (prior behavior) or to decode => concat => encode instead (although this only currently happens for prom sampling if an utternace is too small) 2024-07-18 16:48:41 -05:00