vall-e

mrq/vall-e

Author	SHA1	Message	Date
mrq	5fe01ffc6c	more notes / re-enabled top-k/p samplers for new implementation	2025-04-19 14:04:34 -05:00
mrq	d9e18037cc	new implementation tweaks and fixes to make it actually better (there were a lot of badwrong things being done that harmed the output quality, will evaluate the model further)	2025-04-18 20:36:44 -05:00
mrq	98d1d8cb1e	added some more notes, tweaks (RIP DAC, it's over)	2025-04-17 20:24:40 -05:00
mrq	6d42c9ae23	how foolish of me, not having a softmax as float32 (maybe addresses an emergent regression where bfloat16 training shits the bed where float16+loss scaling doesnt)	2025-04-07 22:51:52 -05:00
mrq	d6cd848c32	goodbye nvidia/audio-codec-44khz, crossed fingers for DAC again	2025-04-06 21:05:29 -05:00
mrq	2e93438867	reintroduced sampler_type = speaker because I think this might salvage the nemo model to have better speaker similarities	2025-04-03 19:01:10 -05:00
mrq	0e995dbf2c	is this my last cope (falling back to explicit duration prediction, as this regression just won't go away) (also the smaller model was lobotomized because of my ROCm setup having a botched SDPA for who knows why)	2025-04-02 17:01:24 -05:00
mrq	6ae282e090	re-added noise dataloader sampler whatever for the old implementation's other tasks that require it	2025-03-28 15:07:06 -05:00
mrq	90b3509404	I'll just cope and say I cannot apply segmented attention masks to the smaller model as it's too trained on not doing it, and the regression came from dumb python aliasing rules	2025-03-27 13:27:51 -05:00
mrq	2fd82a7a22	cannot get segmented mask to actually work without gradients exploding (need to find a different way to do duration prediction...)	2025-03-27 00:51:41 -05:00
mrq	4d777b5618	add remark that segmented attention actually might be broken (for some reason this only emerged recently, need to investigate)	2025-03-26 12:08:47 -05:00
mrq	8641c87611	nothing could go wrong part 2 (reverted and rewrote commits since there was a nasty regression)	2025-03-25 23:06:16 -05:00
mrq	aa8b32d97e	added more notes (although I could have sworn I have had more notes that i can't recall)	2025-03-25 18:53:06 -05:00
mrq	df5b870908	added remark about not using sliding attention	2025-03-22 12:44:34 -05:00
mrq	9a7458cf17	fixed inferencing since I did delete the len_emb, some more notes on the model since it seems I just had bad experimental settings	2025-03-19 22:41:48 -05:00
mrq	81acd565b3	re-enable these	2025-03-18 20:59:33 -05:00
mrq	b0dba9db07	this may bite me in the ass	2025-03-17 21:46:50 -05:00
mrq	2dfef693c4	comments for clarity	2025-03-16 11:30:23 -05:00
mrq	9cfbf94b1c	config-ify the len_loss_factor	2025-03-14 20:30:48 -05:00
mrq	ba5f3d19b4	use the FSQ-targeted encoder/decodede whole-ly as it works for EnCodec too, as the RVQ-targeted encoder/decoder doesnt (and some notes)	2025-03-12 22:47:19 -05:00
mrq	5c512717a6	len prediction for new model (and remove logit normalization since it kills inferencing)	2025-03-11 20:33:09 -05:00
mrq	5cd71ef238	QoL so I can stop having to manually inject different configs	2025-03-06 14:48:14 -06:00
mrq	2fb2b732fc	wow that was fast	2025-03-04 23:17:18 -06:00
mrq	0451f75e33	now that the new model seems a little more promising, i can re-document things non-cynically	2025-03-03 13:21:41 -06:00
mrq	3f1070f575	tweaks	2025-03-02 22:36:25 -06:00
mrq	4afa4ccce5	at wits end (parhaps the semantic token approach is the toughest pill to swallow)	2025-03-01 21:03:25 -06:00
mrq	a174c33db6	a gorillionth time's the charm (aka: the encoder/decoder pill is a tough pill to swallow)	2025-02-28 17:56:50 -06:00
mrq	eff180248c	decoupled llama backend to avoid any funny changes from transformers, removed other backends since i dont think i'll ever bother using them	2025-02-27 19:00:37 -06:00
mrq	95da4e9405	made muon actually work by actually utilizing param groups (thanks APOLLO for reminding me this is the sane way to handle this split)	2025-02-26 10:39:13 -06:00
mrq	92139b6da9	additional cruft, added a note in documentation to be aware of NUMA node topology when running vall_e.emb.process with more than one process	2025-02-18 19:56:30 -06:00
mrq	0dc49ef4d5	documentation update while I wait for more audio (between 4 and 8 seconds per utterance) quantize for nvidia/audio-codec-44khz (I was foolish to think I can get something servicable with just 4 seconds max for an utterance)	2025-02-15 17:42:06 -06:00
mrq	04fef5dad5	agony	2025-02-12 00:18:24 -06:00
mrq	1c0ed6abac	added notes on this unfruitful experiment	2025-02-11 16:21:43 -06:00
mrq	9fa87c417a	added option to use raw text rather than the IPA phonemes (it requires a model trained on raw text)	2025-01-06 00:10:43 -06:00
mrq	9b0d2ccbe1		2024-12-26 21:42:17 -06:00
mrq	59bf6b8b33	exposed additional task (ns, sr, vc) (vc is experimental)	2024-12-20 11:15:29 -06:00
mrq	8515038968	imagine my disappointment when the epoch finished just for it to throw an exception	2024-12-16 18:28:01 -06:00
mrq	f41251f648	more fixes for local engine backend	2024-12-12 14:38:42 -06:00
mrq	8568a93dad	added WER/SIM-O metrics, added APOLLO but I need to test it	2024-12-10 20:13:21 -06:00
mrq	a6c745bafb	chinese (mandarin?) support added (I guess I don't need pinyin, but tone markers are handled), korean validated, vocab adjusted	2024-12-09 14:26:19 -06:00
mrq	a032ff588f	doc update, added automatically deducing language from a given text, also checks if the input is already phonemized text to allow direct control without being cringe (procrastinating adding WER/SIM-O)	2024-12-07 22:34:25 -06:00
mrq	93d27be539	rolling context finally (use last N utterances as the prefix for the next gen), option to split input text prompt by sentences instead of lines (or no splitting)	2024-12-04 20:31:44 -06:00
mrq	9dff68c0c5	NAR-len tweaks (remasks a small amount of tokens per step, it seems to help with reducing the number of steps needed some of the time?, disable CFG for the first half to speed things up)	2024-12-04 09:30:29 -06:00
mrq	ca31da0a95	sageattn (forgot to bother with testing this the other day, seems ifne)	2024-12-03 15:14:57 -06:00
mrq	31ab90d84a	cringe code to convert to LlamaForCausalLM-happy weights + tokenizer dict (still need to write logic to actually use these weights for proper inferencing)	2024-12-03 10:18:58 -06:00
mrq	84a05acb6d	touch ups in docs	2024-12-02 19:10:42 -06:00
mrq	67f7bad168	added mixed modality AR+NAR-len to generate a short prefix through the AR, then inference with said prefix through the NAR-len (need to experiment with it more to ensure that the masked off tokens are the only tokens getting updated)	2024-11-20 14:22:12 -06:00
mrq	efeb55e1b7	documentation update	2024-11-19 19:19:34 -06:00
mrq	190a917b3e	I did it.	2024-11-19 12:24:33 -06:00
mrq	5ba80686e1	two weeks of agony concludes	2024-11-18 21:29:28 -06:00

1 2

68 Commits