vall-e

mrq/vall-e

Author	SHA1	Message	Date
mrq	d606a693ff	eval fix for nar-len	2024-11-06 23:14:16 -06:00
mrq	105ed51159	I guess I'll fall for the NAR-len meme again (I don't know where my previous weights are, so I need to train it again to test something)	2024-11-06 19:17:12 -06:00
mrq	bcabde3454	more notes	2024-11-06 13:51:28 -06:00
mrq	bfc5e1d723	agony	2024-11-05 22:30:49 -06:00
mrq	aefe8fcdad	UGH	2024-11-05 22:13:58 -06:00
mrq	556d9db0d5	web UI support for HF ZeroGPU	2024-11-05 21:38:02 -06:00
mrq	e58a9469a3	move layerskip to experimental settings.......	2024-11-05 20:37:06 -06:00
mrq	bbc2de3713	ugh	2024-11-05 11:50:05 -06:00
mrq	9e65e05e83	more windows specific fixes, limit gradio to <5.0.0 on linux (it works on windows, but not on my linux machine tm)	2024-11-04 18:00:33 -06:00
mrq	c83670c38c	Windows specific fixes (to-do: find libespeak-ng.dll automatically because it cannot be trusted to do it by default)	2024-11-03 19:19:15 -06:00
mrq	d229725c76	more adjustments (adjustments of early-exit entropy/varentropy thresholds, default rep pen being 1.5, experimental refine-on-stop, etc.)	2024-11-03 18:31:28 -06:00
mrq	aee08b7307	changed layerskip float16 training warning (since it didnt seem to fry on my 4xV100 system)	2024-11-03 09:58:29 -06:00
mrq	3826f9bae4	saner mask creation? (it doesnt matter, kv cache wont work)	2024-11-02 21:00:21 -05:00
mrq	ded746e157	very, very naive layerskip speculative sampling (it just checks if the current layer's state is good enough)	2024-11-02 11:49:05 -05:00
mrq	62fe5b0943	ughh	2024-11-01 22:36:48 -05:00
mrq	ec79230965	shuffled web UI options hidden by cfg.experimental to its own tab, expose early exit selection to inferencing (it kinda works naively, still need to implement self-speculation)	2024-11-01 21:30:06 -05:00
mrq	ef1c17430f	skip step on nan loss (ironically I have not had a nan loss after adding this), throw exception with invalid cfg.dataset.sample_type and sample_order combination (because I was tricked by this in my yaml and had inconsistent vram usage)	2024-11-01 20:54:53 -05:00
mrq	fb8faa295b	actually float16(+AMP) and layerskip is bad and will kill the model......	2024-11-01 18:36:44 -05:00
mrq	edf1e66bf9	layerskip_r=6 fries the model so hard the loss is sub-1...	2024-11-01 17:06:07 -05:00
mrq	9b6c57bc57	third time's the charm (for some reason it escaped me that I should treat early exit loss as an aux_loss to be used with the normal loss, as if I was training a MoE's router)	2024-11-01 12:50:37 -05:00
mrq	76ebef45dc	off-by-one...	2024-10-31 13:24:48 -05:00
mrq	b63293cbbe	ugh	2024-10-30 22:49:11 -05:00
mrq	a22534e8f4	layer skip training implemented (need to gut the inferencing from the repo, and to actually see if the model can benefit from this)	2024-10-30 20:05:45 -05:00
mrq	4049f51ba9	added option to load lora directly from the model file itself with --lora	2024-10-26 00:13:10 -05:00
mrq	ccf71dc1b6	added option to load from a model state dict directly instead of a yaml (to-do: do this for LoRAs too), automatically download the default model if none is provided	2024-10-25 22:15:15 -05:00
mrq	a96f5aee32	adjusted how i want to pass eval kwargs	2024-10-25 20:38:09 -05:00
mrq	92e6bff6dc	actually ar temp 0.5 with rep pen 1.125 seems to have the benefits of better outputs without it degrading some of the time but not all the time	2024-10-23 00:03:35 -05:00
mrq	8920e5e86b	actually have beam_width in the webUI work	2024-10-22 22:06:22 -05:00
mrq	910571ad34	too brainlet to diagnose why low temp / greedy sampling is randomly unstable some of the time	2024-10-22 20:13:54 -05:00
mrq	8eb9a4056b	modified default arguments (ar temp = 0 and rep pen = 1.125 seems to be stable, at least given the few things i tested), do not pass top k/top p/min p to NAR even though technically none of those things should matter when greedy sampling	2024-10-22 18:12:39 -05:00
mrq	1a02cd5bce	modify demo template to say F5 instead of YourTTS, swap LoRA comparison around to make the lora'd the base file, and the no-lora the suffix'd file	2024-10-21 19:52:02 -05:00
mrq	02dfc60ac3	ugh	2024-10-18 17:23:22 -05:00
mrq	71731ed785	added prefixing with silence (was to test something, currently hidden under cfg.experimental=True)	2024-10-18 17:19:52 -05:00
mrq	6b04c13c56	print warning if audio promtpless inferencing with low AR temp (it really doesn't like low temps / greedy sampling)	2024-10-18 17:01:40 -05:00
mrq	c8f31db1de	default to greedy sample AR (i should probably test this more but it seems to pass my harvard sentences and tongue twisters)	2024-10-18 16:58:56 -05:00
mrq	fc8dfd8617	made greedy AR sampling viable (and preferable), with caveats (per comment in vall_e.models.ar_nar)	2024-10-18 16:55:00 -05:00
mrq	07f4935a75	more tweaks	2024-10-18 13:19:36 -05:00
mrq	0dfab973e7	oops	2024-10-18 09:40:06 -05:00
mrq	75b90be325	cleaned up unused config flags, allow less strict yaml by pruning missing keys, renamed some dataset configs to be more unified	2024-10-17 17:06:48 -05:00
mrq	8b6095f681	saner defaults, maybe	2024-10-17 14:37:21 -05:00
mrq	f88097ccf6	add config option to set the rate of sampling randomly vs similar speakers during training	2024-10-16 14:27:58 -05:00
mrq	48461833c2	ugh	2024-10-15 19:30:43 -05:00
mrq	eea70f5698	kludge fix for an oversight in the model when trying to train for longer input prompt durations......	2024-10-15 19:25:03 -05:00
mrq	84005c5b00	entropix apparently processes the entire sequence of logits but it falls apart when doing that	2024-10-13 12:01:12 -05:00
mrq	c800d28bb8	respect attention defined in the yaml for web UI (which might explain why theres been a discrepancy in outputs for me)	2024-10-13 11:02:24 -05:00
mrq	ed6b7a690f	ugh.........	2024-10-13 00:26:46 -05:00
mrq	d405f243d4	at wits end in trying to output the right attention scores	2024-10-12 23:53:13 -05:00
mrq	70cf694cfd	output attention scores for SDPA/flash, since naive attention seems broken	2024-10-12 12:09:17 -05:00
mrq	541e45263c	ugh	2024-10-12 11:29:16 -05:00
mrq	04e983b86b	modified demo page to be more modular with demoing comparisons, actually provide a path to use modified naive attention, entropix sampling is not tied to an experimental yaml flag now	2024-10-12 11:27:55 -05:00
mrq	666e8038fb	ugh	2024-10-12 10:41:35 -05:00
mrq	3d6ef9666b	overridden naive llama attention to get the right score values that entropix needs	2024-10-12 10:05:47 -05:00
mrq	40b089daf3	lol	2024-10-12 09:57:34 -05:00
mrq	d6f7c86a5c	entropix tweaks (it doesn't output garbage but it loves to go for silence)	2024-10-12 09:46:18 -05:00
mrq	d0ab7d755a	added min-p (really does not seem useful since it's very sensitive), more tweaks to entropix	2024-10-11 22:36:06 -05:00
mrq	bef43a0c18	added experimental entropix sampling support	2024-10-11 21:18:26 -05:00
mrq	85d85c1351	more arg creep for demo page	2024-10-10 19:40:01 -05:00
mrq	301468f519	<<	2024-10-10 19:13:52 -05:00
mrq	75a4c866d6	more demo page tweaks, added arg to force enable/disable LoRAs for inferencing (to-do: setup arg flags to handle this, and checkbox in web UI)	2024-10-10 19:04:12 -05:00
mrq	96d05be73c	demo page tweaks	2024-10-10 13:52:37 -05:00
mrq	2ea978f318	added --eval-random-text-prompts to use random text prompts for eval pass, added --random-prompts for demo page and --lora to use a sample with the lora disabled, probably finally fixed validation dataloader breaking on eval	2024-10-10 13:40:25 -05:00
mrq	52299127ab	fix vall_e.emb.process	2024-10-08 20:00:34 -05:00
mrq	0656a762af	fix vall_e.emb.transcriber	2024-10-08 19:24:43 -05:00
mrq	acdce66d4e	readme tweaks, set the (unused) default model download URL back to the base ar+nar-llama-8 model, as ar+nar-tts+stt-llama-8 was renamed back to it since it performs well	2024-10-05 22:53:53 -05:00
mrq	84c7419001	faster	2024-10-04 22:30:47 -05:00
mrq	a507b769a1	sped up inferencing by not doing .tolist() for rep pen / length pen (and a bug fix in the web UI from prev commit)	2024-10-04 22:18:20 -05:00
mrq	4a8e3ccf06	README tweaks, added --input-prompt-prefix as an experiment (its literally better to just not do this, but i'll retain it in case i have a revelation on how to improve it)	2024-10-04 18:57:19 -05:00
mrq	a9fa0898a9	tweaked demo page script to sample speakers instead	2024-09-28 10:50:26 -05:00
mrq	2f1dca3089	added language selection in web UI, tweaked demo script	2024-09-28 09:49:45 -05:00
mrq	10df2ef5f3	fixed oversight where input audio does not resample (lol...)	2024-09-27 20:27:53 -05:00
mrq	039482a48e	don't do eval on stt because it's so slow and I don't even bother doing any metrics against it anyways (to-do: make this a flag)	2024-09-26 18:56:57 -05:00
mrq	ff7a1b4163	coerce into path for other sampler_types (it's required for sampling for similar utterances)	2024-09-26 18:37:56 -05:00
mrq	f24547ad4e	add top_k sampling / offset for prompt similar utterance sampling	2024-09-26 16:26:40 -05:00
mrq	9da630f73a	swap order of demo entries, as the model prioritizes adhering to the speaker prompt more (instead of trying to match the ground truth magically)	2024-09-25 23:31:24 -05:00
mrq	e84d466261	vall_e.plot tweaks	2024-09-24 20:05:10 -05:00
mrq	c5e9142863	added option to retokenize phonemes for hdf5 (to save having to remake my hdf5 file)	2024-09-21 13:08:01 -05:00
mrq	536c11c4ac	actually validated and fixed sampling similar utterances for the prompt (hopefully nothing else is needed)	2024-09-21 12:59:51 -05:00
mrq	d31f27119a	regex replace out the (lang) markers in espeak, updated tokenizer vocab as lazily as possible to not have unk tokens	2024-09-21 12:29:28 -05:00
mrq	769f67dcfe	actually fix validation of phonemes in the symmap	2024-09-21 12:19:34 -05:00
mrq	c8d4716a9f	ugh	2024-09-18 21:40:57 -05:00
mrq	fe241f6a99	support for wildcard in training/validation/noise dataset array (to-do: a better way to query between metadata folder and data folder)	2024-09-18 21:34:43 -05:00
mrq	b5bec0c9ce	oops, turns out these are not split by speaker names already........ (also added sampling the dataset in the webui for easy viewing)	2024-09-18 20:19:46 -05:00
mrq	fa9d3f6c06	lang fixes / reworked phoneme symmap validation	2024-09-18 19:36:03 -05:00
mrq	84647f588a	more tweaks	2024-09-18 16:43:57 -05:00
mrq	ebac1db16c	maybe final tweaks, I really needed to unify my json read/write and orjson is proven to be fast enough for me to try and rely on it more	2024-09-17 22:57:04 -05:00
mrq	6ceed866b5	faster	2024-09-17 22:44:36 -05:00
mrq	f00283440c	faster	2024-09-17 22:26:31 -05:00
mrq	be22b65300	solved my problem	2024-09-17 21:58:44 -05:00
mrq	8f41d1b324	more tweaks	2024-09-17 16:26:30 -05:00
mrq	804ddb5182	optimizations (6 hours to do cosine similarities on a speaker set of just 17k utterances................)	2024-09-17 15:51:45 -05:00
mrq	a9fbe81f98	oops	2024-09-17 15:25:12 -05:00
mrq	c440c4fe7e	relegated processing similarity data into vall_e.emb.similarity since it's easier, seems to work?	2024-09-17 14:37:21 -05:00
mrq	56f25f7a9b	more stuff for similar-speaker prompt sampling (to-do: actually test if this works...)	2024-09-16 23:10:29 -05:00
mrq	69f140ba45	fix oversight with phonemizing french because espeak defines french as fr-fr instead of fr (even though spain spanish is es and not es-sp or some shit, but portugal portuguese is pt-pt)	2024-09-13 12:53:36 -05:00
mrq	4f3c7a37c8	also do text similarities (dont know what use I'll have for this)	2024-09-10 16:45:59 -05:00
mrq	1c615a0f52	helper script (vall_e.emb.similar) to figure out the best way to compute similarity scores for audio (iunno how to go about it desu)	2024-09-10 16:34:23 -05:00
mrq	d059f6f56d	added helper script to process Emilia (amphion/Emilia-Dataset), clean up espeak phonemes for non-English transcriptions with English words (because for some reason espeak injects (en){word}(lang) markers and it's annoying)	2024-09-09 09:57:32 -05:00
mrq	31e8b7edb8	tweaks and fixes for lora stuffs	2024-09-08 18:05:21 -05:00
mrq	54203c059d	validated rep pen for STT (sometimes needed to wrangle the model)	2024-09-08 08:30:30 -05:00
mrq	6a967f91b9	oops	2024-09-07 22:13:49 -05:00
mrq	5d66a7db52	webui cleanup, more tweaks, default to safetensors in config	2024-09-07 21:45:05 -05:00
mrq	a6ad0577b8	cleanup the resultant text from STT	2024-09-06 18:44:25 -05:00
mrq	fa93061b3e	more fixes, moved sampler state dict to a better place, eval works again	2024-09-06 16:59:56 -05:00
mrq	4bd9bb39c8	webui for STT (still need to bake the model to handle it better, a few hours so far has it generate what looks like a normal transcription but does not correlate to the audio right now)	2024-09-06 15:13:04 -05:00
mrq	d33a906119	cleanup for AR_NAR inferencing to allow both TTS and STT tasks simultaneously (need to have training eval do this to though)	2024-09-06 14:30:12 -05:00
mrq	341e19162b	fixes, again	2024-09-06 11:41:41 -05:00
mrq	94cf81d38c	tweak	2024-09-05 23:21:18 -05:00
mrq	413097f5f7	fixes	2024-09-05 21:42:59 -05:00
mrq	54547b74d8	experimental implementation of STT (need to actually test on a model, test trainer seems to work)	2024-09-05 20:43:20 -05:00
mrq	d319d33368	haha	2024-09-04 14:52:26 -05:00
mrq	619369236b	ugh	2024-08-30 21:10:57 -05:00
mrq	168e203942	ugh	2024-08-30 14:39:07 -05:00
mrq	685f4faec0	ugh	2024-08-30 10:46:26 -05:00
mrq	32287710a2	moved prints to use logger, edited readme (fused_attn doesnt seem stable for training)	2024-08-29 13:27:16 -05:00
mrq	d423bc03c2	fixed attentions for MoE	2024-08-27 17:02:42 -05:00
mrq	b7b99a25f1	added ability to specify attention backend for CLI and webui (because im tired of editing the yaml)	2024-08-26 19:33:51 -05:00
mrq	0d706ec6a1	added fused_attn (triton-based fused attention) and simply just query for flash_attn under rocm	2024-08-26 19:13:34 -05:00
mrq	6b0891448c	pain (some shit to try and get some flash attention for ROCm (gfx1100) through triton fused attention but no good)	2024-08-25 20:07:27 -05:00
mrq	40e1799adc	fixed xformers and flash_attn to actually work now	2024-08-19 01:03:35 -05:00
mrq	29c35528e5	the sooner I accept there's no FA for V100s the sooner I'll go to bed	2024-08-18 23:54:33 -05:00
mrq	d636edd3a2	added flash_attn LlamaAttention (including flash_attn==1.0.9)	2024-08-18 20:51:14 -05:00
mrq	054d28573a	my DAC dataset again managed to only have some utterances with only 8 of 9 RVQ levels, this fixes an oversight from it	2024-08-09 21:18:01 -05:00
mrq	2a1794c084	ughghghhhh	2024-08-09 21:15:01 -05:00
mrq	ed373957e2	maybe not	2024-08-09 11:38:08 -05:00
mrq	c658a7b440	make loss scaling opt-in rather than automatically determined (because it seems a DAC-based model really doesnt like loss scaling)	2024-08-09 10:51:36 -05:00
mrq	d04f6911b4	oops	2024-08-08 19:38:55 -05:00
mrq	0aa59e6f3f	uncommented block that writes the metadata on HDF5 creation	2024-08-08 19:21:29 -05:00
mrq	79a6781c9e	fix vall_e.data --action=hdf5 actually transcribing because past me completely forgot it tried to already put the transcribe/process dataset scripts inside the module before	2024-08-08 07:51:42 -05:00
mrq	949339a3fa	do not include SDPA attention if there's no available SDPA backends	2024-08-06 20:42:39 -05:00
mrq	613024ec0d	ugh	2024-08-06 20:35:15 -05:00
mrq	eac353cd0b	busy work and cleanup while I wait for 1TB of audio to quantize... again.	2024-08-06 20:23:33 -05:00
mrq	f284c7ea9c	do mixed-precision for AMP inside the compress function itself, because the loudness function gripes when using a float16 (non-power of 2 lengths) or bfloat16 (something about views for bfloat16)	2024-08-06 15:08:37 -05:00
mrq	b6ba2cc8e7	tweaked vall_e.emb.process to instead process audio one file at a time instead of all the files for a given speaker to avoid OOMing on less-memory-filled systems with --low-memory	2024-08-06 14:24:40 -05:00
mrq	9710b06b74	tweaks and things	2024-08-06 08:17:25 -05:00
mrq	134dac8c2b	re-adapted process_libritts.py to a 'better' way (better because it processed without needing to shuffle a bunch of things and adapt to cope or something)	2024-08-05 20:34:58 -05:00
mrq	3f73fcca29	oops	2024-08-05 20:12:13 -05:00
mrq	597441e48b	moved transcribe and process dataset scripts to vall_e/emb within the module itself, argparse-ified transcription script	2024-08-05 19:40:50 -05:00
mrq	7cdfa3dc0c	updated process_datasets.py, added argparsing so I can mostly stop manually editing things, and some other cleanup	2024-08-05 15:59:25 -05:00
mrq	debcc93e7e	add adapted MixtralAttention for when I make a bad decision to actually train a MoE	2024-08-04 22:03:22 -05:00
mrq	10aaf840e7	added export option to convert Llama to MixtralMoE for another dumb experiment	2024-08-04 20:25:06 -05:00
mrq	3a65cc4b22	fix issue with sft and shared tensors...	2024-08-04 19:56:21 -05:00
mrq	23f3b56fda	oops	2024-08-04 08:18:57 -05:00
mrq	d19f93a2c0	documentation update	2024-08-04 00:14:49 -05:00
mrq	2cb465018b	implicitly load either normal pickled weights or safetensors on loading the model	2024-08-03 23:34:18 -05:00
mrq	c09133d00f	added safetensors support (with metadata) and feed whatever torch.load/torch.save into it	2024-08-03 23:15:20 -05:00
mrq	6a733eb2ed	changed torch.Tensor().to(device, dtype) to just torch.tensor(..., device, dtype) because it's been bothering my autism that I'm creating tensors then converting rather than creating with the right device/dtype, some 'optimization' to compile the model but it doesnt seem to do anything useful	2024-08-03 22:10:21 -05:00
mrq	ab673e0426	add cap for NAR-len training, to avoid any weird cases in early training where it'll just mess up and generate long lengths	2024-08-03 21:00:32 -05:00
mrq	4d2b88b164	throw exception if training, but no model is set to train (because i ran into this wondering what the hell was happening)	2024-08-03 20:51:23 -05:00
mrq	d0a5c7eca2	more coping with the NAR len	2024-08-03 20:23:36 -05:00
mrq	11fa3da665	some cleanup, fixed the wrapper attention to explicitly use other sdpa backends	2024-08-03 19:51:00 -05:00

1 2 3 4 5 ...

623 Commits