vall-e

mrq/vall-e

Author	SHA1	Message	Date
mrq	8568a93dad	added WER/SIM-O metrics, added APOLLO but I need to test it	2024-12-10 20:13:21 -06:00
mrq	a6c745bafb	chinese (mandarin?) support added (I guess I don't need pinyin, but tone markers are handled), korean validated, vocab adjusted	2024-12-09 14:26:19 -06:00
mrq	3ef8894290	oops	2024-12-08 15:24:21 -06:00
mrq	1d460b9fe3	logic fixes, I feel like output is better? (also NAR can have a temperature, I imagine it couldn't because it was having a causal masked passed to it for the longest time before I caught it a month ago)	2024-12-08 14:52:47 -06:00
mrq	0c5a458b00	deduce language per line to allow for a cheap way to allow for cross-lingual switching, kinda	2024-12-07 22:57:29 -06:00
mrq	a032ff588f	doc update, added automatically deducing language from a given text, also checks if the input is already phonemized text to allow direct control without being cringe (procrastinating adding WER/SIM-O)	2024-12-07 22:34:25 -06:00
mrq	5d80a2d0d4	fixed NAR-len issues with non-english maybe (langs weren't being passed), added interface to inference in batches through tts.batched_inference (no support for rolling context/prefixes because there's no way to do that), demo page uses batched inferencing now	2024-12-07 19:21:05 -06:00
mrq	1f54bf5b40	revert sageattn back to optional dependency because it's not on windows, force resize_modules on by default because I broke something	2024-12-07 17:09:39 -06:00
mrq	218d0e29fd	ugh (batchmean actually expects batch=seq_len, and not the actual batch)	2024-12-07 12:39:01 -06:00
mrq	61ed662856	ACTUALLY actually fix KD-loss (the -inf in the logits was caused by cringecode)	2024-12-07 12:31:54 -06:00
mrq	f97e8b0c7f	ACTUALLY do KD-loss because of an oversight with masked_select outputting 1D tensors that get softmax'd in total	2024-12-07 09:52:51 -06:00
mrq	34a66e1052	agnostified KD	2024-12-06 23:53:46 -06:00
mrq	953d3eb030	ugh	2024-12-06 22:35:30 -06:00
mrq	42fafbaaca	actually fixed knowledge distillation because of errant -inf logits causing problems and needed to be filtered (and splitting text language / output audio language because it helps)	2024-12-06 21:55:20 -06:00
mrq	23d402bf01	added knowledge distillation in the trainer (sadly it is not agnostic because of the grave mistake of further processing the batch within the forward pass, so subsequent calls do not match......)	2024-12-05 23:05:52 -06:00
mrq	4e21df8092	oops	2024-12-04 21:24:22 -06:00
mrq	93d27be539	rolling context finally (use last N utterances as the prefix for the next gen), option to split input text prompt by sentences instead of lines (or no splitting)	2024-12-04 20:31:44 -06:00
mrq	9dff68c0c5	NAR-len tweaks (remasks a small amount of tokens per step, it seems to help with reducing the number of steps needed some of the time?, disable CFG for the first half to speed things up)	2024-12-04 09:30:29 -06:00
mrq	cf97560e70	minimum CFG of 3 for NAR-len because it seems the model will auto-default to NAR-len now	2024-12-03 19:40:05 -06:00
mrq	ca31da0a95	sageattn (forgot to bother with testing this the other day, seems ifne)	2024-12-03 15:14:57 -06:00
mrq	31ab90d84a	cringe code to convert to LlamaForCausalLM-happy weights + tokenizer dict (still need to write logic to actually use these weights for proper inferencing)	2024-12-03 10:18:58 -06:00
mrq	84a05acb6d	touch ups in docs	2024-12-02 19:10:42 -06:00
mrq	dcaf38b359	fixed training tqdm being stubborn	2024-11-23 09:45:23 -06:00
mrq	41d7c30ea5	added much cleaner non-causal mask generation	2024-11-22 19:43:32 -06:00
mrq	c99a74e834	actually generate a causal mask because it seems sometimes it does not actually generate one because it makes assumptions	2024-11-22 18:30:24 -06:00
mrq	ccee5fc11c	that was actually all pointless since sdpa always had an attention mask fed to it and does not need is_causal to implicitly generate one	2024-11-22 16:51:50 -06:00
mrq	4aa685e749	what has science done	2024-11-22 16:45:40 -06:00
mrq	147219a5e0	huge oversight in the attention masking......... (i realized I have not been providing a non-causal mask to non-causal tasks)	2024-11-22 13:44:43 -06:00
mrq	24d888c47c	temporarily dropping support for xformers because it's breaking when using an attention mask (which i dont remember commenting it out when being passed), default to not use wandb because it's being a pain when doing tests and not actual sessionsS)	2024-11-22 11:29:12 -06:00
mrq	8aafae91fd	dont use timeembedding	2024-11-21 23:14:52 -06:00
mrq	2cef97e43f	cleanup	2024-11-21 23:08:43 -06:00
mrq	3fc0540f49	m	2024-11-21 15:07:46 -06:00
mrq	6845c447c9	added more harvard sentences to load from a text file	2024-11-21 13:18:11 -06:00
mrq	2a084544e8	moved duration padding for NAR-len to be a scalar instead (since it seems longer utterances need it much more so than shorter utterances)	2024-11-21 13:04:07 -06:00
mrq	6aee08f9c0	moved stuff in the web UI around (un-experimented the max NAR-len steps because its kind of important to adjust this value for better sounding audio / quicker generated audio)	2024-11-20 20:37:33 -06:00
mrq	dfdba3f190	oops	2024-11-20 19:21:03 -06:00
mrq	cd6e9ba2f2	oops	2024-11-20 16:27:51 -06:00
mrq	1a73ac6a20	I cannot believe it's not actually called Wand DB (added wandb logging support since I think it would have been a much better way to look at my metrics)	2024-11-20 16:10:47 -06:00
mrq	67f7bad168	added mixed modality AR+NAR-len to generate a short prefix through the AR, then inference with said prefix through the NAR-len (need to experiment with it more to ensure that the masked off tokens are the only tokens getting updated)	2024-11-20 14:22:12 -06:00
mrq	db64e6cb59	dependency updates (gradio 5.x now works on my machine)	2024-11-20 12:33:01 -06:00
mrq	b1369e7824	better modality selection (pick AR+NAR by default for the ar+nar model, pick NAR-len by default for the nar-len model), lowered default CFG because it makes the AR+NAR output sped up (but can't be too low since it's required for the NAR-len)	2024-11-19 18:51:17 -06:00
mrq	190a917b3e	I did it.	2024-11-19 12:24:33 -06:00
mrq	0e621354e7	cleaned up classifier-free guidance logit processing (in order to try and cope with a bad nar-len model)	2024-11-19 10:30:05 -06:00
mrq	5ba80686e1	two weeks of agony concludes	2024-11-18 21:29:28 -06:00
mrq	2b29790173	oops	2024-11-18 14:12:26 -06:00
mrq	4a71981456	normalize sampler index by batch size (if not using batched sampler), add option to cap out utterances for a speaker, some other things	2024-11-18 12:46:50 -06:00
mrq	6cfdf94bf9	swap priority to use nar-len if available, added notes	2024-11-18 09:40:04 -06:00
mrq	069b27570f	set option to set training masking ratio (I don't think for tts a fixed masking ratio is beneficial since the magic of the AR+NAR is being able to still reference the prior sequence of tokens for predicting things)	2024-11-17 17:04:07 -06:00
mrq	88d840218d	default set cfg strength to 3.0 since the reference model is updated	2024-11-17 10:23:40 -06:00
mrq	a3e1fa3518	ugh	2024-11-17 09:28:33 -06:00

1 2 3 4 5 ...

624 Commits