vall-e

mrq/vall-e

Author	SHA1	Message	Date
mrq	67a9401cce	oops	2025-02-06 15:14:14 -06:00
mrq	712ce4af5d	maybe fixed errors with DAC backend, added option to limit by duration in emb.process (because I only really need short utternaces right now and I'm not ready to spend a week on processing everything again)	2025-02-06 12:37:18 -06:00
mrq	299cc88821	re-added amp encoding/decoding for audio, possible bad idea to ignore using amp instead if requested	2025-02-05 21:55:06 -06:00
mrq	7592befc53	updated vall_e.emb.process to allow for batched processing, some typo fixes (it's painfully slow on my 7900XTX...)	2025-02-05 21:13:20 -06:00
mrq	79c504c278	cleaned up encode/decode functions to make them a little more coherent, added option to batch encode/decode (would have been very nice in the past, but this should speed things up for me when i fall for the latest meme codec)	2025-02-05 20:54:31 -06:00
mrq	84174c1c1b	oops	2025-02-05 10:25:03 -06:00
mrq	bb2ebe1ca2	fixed issues that may rise from updating transformers with attention, added nvidia/audio-codec-44khz backend support (by gutting everything necessary because I do NOT want to install more dependencies	2025-02-04 20:30:07 -06:00
mrq	0841f366e8	I should really just grab modelling_llama wholesale (fix for the adapted attention class)	2025-01-28 21:55:05 -06:00
mrq	e5f9da2221	oops	2025-01-21 11:59:24 -06:00
mrq	69c1d2991f	updated mixtral backend (need this for something else)	2025-01-20 21:50:56 -06:00
mrq	1a26f789a5	added option to playback audio directly, removed no-phonemize option since I swear it worked in testing but it doesn't actually work	2025-01-12 21:52:49 -06:00
mrq	9fa87c417a	added option to use raw text rather than the IPA phonemes (it requires a model trained on raw text)	2025-01-06 00:10:43 -06:00
mrq	3ab11bdc7b	oops	2025-01-05 23:53:17 -06:00
mrq	b445f4abb6	experimental	2025-01-05 19:05:00 -06:00
mrq	2e6a7625e4	experimental	2025-01-05 12:47:03 -06:00
mrq	31cfef59c4	when you do more training thinking the original model that can do NS/SR got deleted but it was actually a string not having its quotes in the right place.......	2024-12-27 18:16:57 -06:00
mrq	9b0d2ccbe1		2024-12-26 21:42:17 -06:00
mrq	59f56ad099	cleaup	2024-12-24 23:14:32 -06:00
mrq	82e8592f2a	working vall_e.cpp	2024-12-24 17:54:48 -06:00
mrq	497bdfc67b	more work (the wall is non-causal decoding......)	2024-12-22 20:11:31 -06:00
mrq	5f289db275	ugh	2024-12-22 16:15:24 -06:00
mrq	0d4329d2e3	sanity cleanup	2024-12-22 15:05:45 -06:00
mrq	353e478e68	agony	2024-12-21 22:52:10 -06:00
mrq	5788db849b	added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much	2024-12-21 10:57:02 -06:00
mrq	91caf00212	ugh	2024-12-20 17:13:37 -06:00
mrq	d85273609e	corrected export.py's --hf	2024-12-20 15:17:13 -06:00
mrq	59bf6b8b33	exposed additional task (ns, sr, vc) (vc is experimental)	2024-12-20 11:15:29 -06:00
mrq	53230efd74	changed prompt_inject_noise to prompt_inject_noise_p so I can have another reason to do this post-training	2024-12-19 19:28:50 -06:00
mrq	e7e7f48043	livid	2024-12-19 19:25:27 -06:00
mrq	8838babcba	sanity checks (and I realized that the model actually had langs set to 4 in the yaml for KO/ZH so................	2024-12-19 19:08:57 -06:00
mrq	7617b6485f	instead just compute a bunch of stuff on the transcriptions to store later in different names so I can just retrieve what I want, also added tongue twisters for nefarious reasons	2024-12-18 23:43:11 -06:00
mrq	4775edaa41	added text cleaning/normalization for wer purposes but it amounts to nothing desu	2024-12-18 19:58:53 -06:00
mrq	9090c34f10	cringe script to process seed-tts-eval's eval dataset into something i can easily use	2024-12-17 22:47:12 -06:00
mrq	ed152f78df	tweaks to prompt duration to allow me to divorce how i use it for training with how I'm using it for the demo page, and demo page tweaks to make my life easier	2024-12-17 19:33:04 -06:00
mrq	7129582303	actually do proper wer/cer calculation by un-normalizing the scores	2024-12-17 14:22:30 -06:00
mrq	c2c6d912ac	actually do speaker verification	2024-12-17 10:11:14 -06:00
mrq	c2e17e287b	really shoddy voice conversion implementation (it sort of works...)	2024-12-16 22:54:53 -06:00
mrq	8515038968	imagine my disappointment when the epoch finished just for it to throw an exception	2024-12-16 18:28:01 -06:00
mrq	4a65ac9eb7	oops	2024-12-15 17:21:51 -06:00
mrq	cd4a5f427c	KO/ZH model soon	2024-12-15 17:01:14 -06:00
mrq	4800e7179a	remove nan checks because it causes problems in distributed training because I'm not syncing between GPUs (and nan losses gets ignored anyways with loss scaling)	2024-12-15 09:42:54 -06:00
mrq	2ba6b483dc	ugh	2024-12-14 22:43:51 -06:00
mrq	3dd31e74d1	finally figured out a clean way to handle "resuming" the tqdm bar	2024-12-14 18:44:43 -06:00
mrq	35389481ee	move lazy-stored ortho matrix to the grad device for apollo because agony	2024-12-13 23:22:26 -06:00
mrq	09804ecc16	APOLLO tweaks to make it work with deepspeed	2024-12-13 23:03:52 -06:00
mrq	64c67160a3	tweaks	2024-12-13 19:00:35 -06:00
mrq	0fbfb8bbe8	actually save the optimizer for the local engine backend because safetensors doesn't save it	2024-12-12 17:12:59 -06:00
mrq	f41251f648	more fixes for local engine backend	2024-12-12 14:38:42 -06:00
mrq	6b237ae5e3	tweaks for the local engine orchestrator (that I never caught since I always used the deepspeed backend)	2024-12-12 13:37:38 -06:00
mrq	9a62e3b824	APOLLO cringe (doesn't want to work with deepspeed)	2024-12-12 00:31:58 -06:00
mrq	cddf8ca814	sort batches to try and reduce number of padded tokens in batched inference (also commented out F5 samples getting added to the demo page because I would have to regenerate them)	2024-12-11 22:45:38 -06:00
mrq	20b87bfbd0	store metrics and only recalculate them if the output file is newer than the metrics file	2024-12-11 20:55:43 -06:00
mrq	0c69e798f7	template cleanup	2024-12-11 20:06:55 -06:00
mrq	7e54e897f7	also shifted to transformer's pipeline for transcribing	2024-12-11 19:57:53 -06:00
mrq	b81a98799b	uplifting transformer's WavLM stuff to do speaker verification instead	2024-12-11 19:30:05 -06:00
mrq	6468e5d124	lol	2024-12-11 19:10:32 -06:00
mrq	6f1ee0c6fa	Added CER, transcription/similarity model args in demo	2024-12-10 21:00:51 -06:00
mrq	8568a93dad	added WER/SIM-O metrics, added APOLLO but I need to test it	2024-12-10 20:13:21 -06:00
mrq	a6c745bafb	chinese (mandarin?) support added (I guess I don't need pinyin, but tone markers are handled), korean validated, vocab adjusted	2024-12-09 14:26:19 -06:00
mrq	3ef8894290	oops	2024-12-08 15:24:21 -06:00
mrq	1d460b9fe3	logic fixes, I feel like output is better? (also NAR can have a temperature, I imagine it couldn't because it was having a causal masked passed to it for the longest time before I caught it a month ago)	2024-12-08 14:52:47 -06:00
mrq	0c5a458b00	deduce language per line to allow for a cheap way to allow for cross-lingual switching, kinda	2024-12-07 22:57:29 -06:00
mrq	a032ff588f	doc update, added automatically deducing language from a given text, also checks if the input is already phonemized text to allow direct control without being cringe (procrastinating adding WER/SIM-O)	2024-12-07 22:34:25 -06:00
mrq	5d80a2d0d4	fixed NAR-len issues with non-english maybe (langs weren't being passed), added interface to inference in batches through tts.batched_inference (no support for rolling context/prefixes because there's no way to do that), demo page uses batched inferencing now	2024-12-07 19:21:05 -06:00
mrq	1f54bf5b40	revert sageattn back to optional dependency because it's not on windows, force resize_modules on by default because I broke something	2024-12-07 17:09:39 -06:00
mrq	218d0e29fd	ugh (batchmean actually expects batch=seq_len, and not the actual batch)	2024-12-07 12:39:01 -06:00
mrq	61ed662856	ACTUALLY actually fix KD-loss (the -inf in the logits was caused by cringecode)	2024-12-07 12:31:54 -06:00
mrq	f97e8b0c7f	ACTUALLY do KD-loss because of an oversight with masked_select outputting 1D tensors that get softmax'd in total	2024-12-07 09:52:51 -06:00
mrq	34a66e1052	agnostified KD	2024-12-06 23:53:46 -06:00
mrq	953d3eb030	ugh	2024-12-06 22:35:30 -06:00
mrq	42fafbaaca	actually fixed knowledge distillation because of errant -inf logits causing problems and needed to be filtered (and splitting text language / output audio language because it helps)	2024-12-06 21:55:20 -06:00
mrq	23d402bf01	added knowledge distillation in the trainer (sadly it is not agnostic because of the grave mistake of further processing the batch within the forward pass, so subsequent calls do not match......)	2024-12-05 23:05:52 -06:00
mrq	4e21df8092	oops	2024-12-04 21:24:22 -06:00
mrq	93d27be539	rolling context finally (use last N utterances as the prefix for the next gen), option to split input text prompt by sentences instead of lines (or no splitting)	2024-12-04 20:31:44 -06:00
mrq	9dff68c0c5	NAR-len tweaks (remasks a small amount of tokens per step, it seems to help with reducing the number of steps needed some of the time?, disable CFG for the first half to speed things up)	2024-12-04 09:30:29 -06:00
mrq	cf97560e70	minimum CFG of 3 for NAR-len because it seems the model will auto-default to NAR-len now	2024-12-03 19:40:05 -06:00
mrq	ca31da0a95	sageattn (forgot to bother with testing this the other day, seems ifne)	2024-12-03 15:14:57 -06:00
mrq	31ab90d84a	cringe code to convert to LlamaForCausalLM-happy weights + tokenizer dict (still need to write logic to actually use these weights for proper inferencing)	2024-12-03 10:18:58 -06:00
mrq	84a05acb6d	touch ups in docs	2024-12-02 19:10:42 -06:00
mrq	dcaf38b359	fixed training tqdm being stubborn	2024-11-23 09:45:23 -06:00
mrq	41d7c30ea5	added much cleaner non-causal mask generation	2024-11-22 19:43:32 -06:00
mrq	c99a74e834	actually generate a causal mask because it seems sometimes it does not actually generate one because it makes assumptions	2024-11-22 18:30:24 -06:00
mrq	ccee5fc11c	that was actually all pointless since sdpa always had an attention mask fed to it and does not need is_causal to implicitly generate one	2024-11-22 16:51:50 -06:00
mrq	4aa685e749	what has science done	2024-11-22 16:45:40 -06:00
mrq	147219a5e0	huge oversight in the attention masking......... (i realized I have not been providing a non-causal mask to non-causal tasks)	2024-11-22 13:44:43 -06:00
mrq	24d888c47c	temporarily dropping support for xformers because it's breaking when using an attention mask (which i dont remember commenting it out when being passed), default to not use wandb because it's being a pain when doing tests and not actual sessionsS)	2024-11-22 11:29:12 -06:00
mrq	8aafae91fd	dont use timeembedding	2024-11-21 23:14:52 -06:00
mrq	2cef97e43f	cleanup	2024-11-21 23:08:43 -06:00
mrq	3fc0540f49	m	2024-11-21 15:07:46 -06:00
mrq	6845c447c9	added more harvard sentences to load from a text file	2024-11-21 13:18:11 -06:00
mrq	2a084544e8	moved duration padding for NAR-len to be a scalar instead (since it seems longer utterances need it much more so than shorter utterances)	2024-11-21 13:04:07 -06:00
mrq	6aee08f9c0	moved stuff in the web UI around (un-experimented the max NAR-len steps because its kind of important to adjust this value for better sounding audio / quicker generated audio)	2024-11-20 20:37:33 -06:00
mrq	dfdba3f190	oops	2024-11-20 19:21:03 -06:00
mrq	cd6e9ba2f2	oops	2024-11-20 16:27:51 -06:00
mrq	1a73ac6a20	I cannot believe it's not actually called Wand DB (added wandb logging support since I think it would have been a much better way to look at my metrics)	2024-11-20 16:10:47 -06:00
mrq	67f7bad168	added mixed modality AR+NAR-len to generate a short prefix through the AR, then inference with said prefix through the NAR-len (need to experiment with it more to ensure that the masked off tokens are the only tokens getting updated)	2024-11-20 14:22:12 -06:00
mrq	db64e6cb59	dependency updates (gradio 5.x now works on my machine)	2024-11-20 12:33:01 -06:00
mrq	b1369e7824	better modality selection (pick AR+NAR by default for the ar+nar model, pick NAR-len by default for the nar-len model), lowered default CFG because it makes the AR+NAR output sped up (but can't be too low since it's required for the NAR-len)	2024-11-19 18:51:17 -06:00
mrq	190a917b3e	I did it.	2024-11-19 12:24:33 -06:00
mrq	0e621354e7	cleaned up classifier-free guidance logit processing (in order to try and cope with a bad nar-len model)	2024-11-19 10:30:05 -06:00
mrq	5ba80686e1	two weeks of agony concludes	2024-11-18 21:29:28 -06:00
mrq	2b29790173	oops	2024-11-18 14:12:26 -06:00
mrq	4a71981456	normalize sampler index by batch size (if not using batched sampler), add option to cap out utterances for a speaker, some other things	2024-11-18 12:46:50 -06:00
mrq	6cfdf94bf9	swap priority to use nar-len if available, added notes	2024-11-18 09:40:04 -06:00
mrq	069b27570f	set option to set training masking ratio (I don't think for tts a fixed masking ratio is beneficial since the magic of the AR+NAR is being able to still reference the prior sequence of tokens for predicting things)	2024-11-17 17:04:07 -06:00
mrq	88d840218d	default set cfg strength to 3.0 since the reference model is updated	2024-11-17 10:23:40 -06:00
mrq	a3e1fa3518	ugh	2024-11-17 09:28:33 -06:00
mrq	23fdba0c98	tweaks and changes	2024-11-16 15:49:06 -06:00
mrq	2fbeacfe92	ugh	2024-11-14 22:18:33 -06:00
mrq	39096f8ff3	redid loss calculation to be cleaner, and position ID generation, and other things (I might need to train the NAR-len from scratch and not resume from an existing checkpoint.........)	2024-11-14 22:17:47 -06:00
mrq	ef05c951ff	adjust fp16 loss scaling since I fried a model overnight when it hit 8K scale	2024-11-14 09:23:52 -06:00
mrq	e412e98125	ugh	2024-11-14 07:34:22 -06:00
mrq	c00fc18b62	actually use the right embedding for nar-len	2024-11-13 18:04:04 -06:00
mrq	3ea8a610d6	fix STT	2024-11-13 14:27:15 -06:00
mrq	910033343c	overhauled how the right resp level / classifier gets picked to avoid cringemath	2024-11-13 13:31:17 -06:00
mrq	269648605e	move NAR-len rvq level 0 to separate embedding	2024-11-13 11:38:58 -06:00
mrq	29e45be0b4	tweaks to bucket sampling	2024-11-13 11:09:24 -06:00
mrq	b2eca271a8	ugh	2024-11-13 10:35:44 -06:00
mrq	be83ddabaa	better causal-ness for split loss calc, and also do masking for NAR-len for it	2024-11-13 10:17:52 -06:00
mrq	6b76419123	ugh	2024-11-13 09:54:20 -06:00
mrq	ad7cfffc00	NAR-len RVQ-0 was being trained causally.............	2024-11-13 09:43:50 -06:00
mrq	976ee87f6f	resume iteration step in tqdm trainer, warn to logger if the sampler state dict was invalidated	2024-11-13 09:09:28 -06:00
mrq	8286aa54c8	do not pass timestep token/embedding since it doesn't seem to matter at all after all, fixed training masking rate to 80% because a paper said so	2024-11-13 09:07:10 -06:00
mrq	caf721c67b	set it to zero because it'll make the stop token hide more often than not	2024-11-12 22:30:50 -06:00
mrq	0f2584eba7	new meme sampler PogChamp new meme sampler PogChamp (it sort of helps?)	2024-11-12 22:30:09 -06:00
mrq	663f07038d	haha... (do not create a token dropout/noise mask when not training (this sadly didnt fix NAR-len output))	2024-11-12 16:41:58 -06:00
mrq	b09328069e	actually do CFG sampling for base AR+NAR tasks	2024-11-12 13:42:39 -06:00
mrq	2495a7ef67	Fixed STT in the web UI	2024-11-12 12:49:53 -06:00
mrq	8927bad7bc	actually fixed rep pen (for ar and nar, it seems to help with nar unmasking)	2024-11-11 21:40:19 -06:00
mrq	ec92613847	actually pass input prompt length size to inference	2024-11-11 20:39:48 -06:00
mrq	b1df6a7bed	reverted rep pen sampler due to a regression	2024-11-11 20:35:08 -06:00
mrq	b1f4db39c8	threw in CFG sampling for normal model as well to experiment with	2024-11-11 20:27:38 -06:00
mrq	2f56696506	overhauled inference/sampler kwargs to stop being a bloated mess	2024-11-11 20:21:16 -06:00
mrq	354f8e059d	store dataset hash alongside state dict so it can be ignored if mismatched	2024-11-11 18:16:56 -06:00
mrq	f7b8b1e825	dropped subtrain dataloader since its useless to duplicate	2024-11-11 17:00:49 -06:00
mrq	cf9df71f2c	use homwbrewed caching system for dataloader paths / durations (I'm pretty sure I am now triggering OOM killers with my entire dataset used)	2024-11-11 16:32:08 -06:00
mrq	a748e223ce	tweaks	2024-11-11 12:40:41 -06:00
mrq	48490757da	fixes	2024-11-10 20:37:50 -06:00
mrq	9def34cd66	lol	2024-11-10 12:48:41 -06:00
mrq	9cb0b6901b	unified nar.py into ar_nar.py	2024-11-10 12:19:48 -06:00
mrq	a9d2faf2d7	all I can do now until I wait for the model to (re)train for pure NAR	2024-11-09 22:57:34 -06:00
mrq	ad7e290a5e	ugh (ROCm seems to silently clamp any token value >= logits.shape[-1] for loss calculation, while cuda will throw an assert, making it hard to find this dumb fuckup)	2024-11-09 19:40:02 -06:00
mrq	943fe70c10	I don't know why this fixes an assert thrown but it does	2024-11-09 19:04:13 -06:00
mrq	f50d92ba6c	Almost made a mistake	2024-11-09 18:12:54 -06:00
mrq	c6a38693a2	This better work	2024-11-09 18:04:59 -06:00
mrq	8b3d1cf70a	Something's Wrong	2024-11-09 15:07:43 -06:00
mrq	dcd5fecff3	some cleanup while I wait for the NAR-len to train to an acceptable state (currently it performs okay, but only on audo after 3 seconds or so)	2024-11-09 12:12:46 -06:00
mrq	69b0b3b854	set timestep tensor to whatever the time embedding's dtype is because it'll gripe under amp	2024-11-09 00:11:16 -06:00
mrq	5a09a5f6e9	I forgot about the time embedding...	2024-11-08 22:46:26 -06:00
mrq	811b15d280	I suppose I just have a shit training method since the sampler is as solid as I can get it...............	2024-11-08 22:05:41 -06:00

1 2 3 4 5 ...

781 Commits