vall-e

mrq/vall-e

Author	SHA1	Message	Date
mrq	4e7d885542	lol	2025-02-28 18:06:41 -06:00
mrq	a174c33db6	a gorillionth time's the charm (aka: the encoder/decoder pill is a tough pill to swallow)	2025-02-28 17:56:50 -06:00
mrq	09d82a26fe	ugh	2025-02-28 01:06:38 -06:00
mrq	93feb5660f	do not like that	2025-02-27 23:59:56 -06:00
mrq	f4f435d7f5	when you already had these ideas to stabilize training but you just ignored them	2025-02-27 23:39:20 -06:00
mrq	0a45c9c042	fix attention backend not being used	2025-02-27 21:38:38 -06:00
mrq	b8e9f3d785	maybe this will work	2025-02-27 20:42:12 -06:00
mrq	01e96bafc9	ugh	2025-02-27 19:05:32 -06:00
mrq	eff180248c	decoupled llama backend to avoid any funny changes from transformers, removed other backends since i dont think i'll ever bother using them	2025-02-27 19:00:37 -06:00
mrq	ceecac6ffe	I think I made resp_parallel_training=True faster with loss factoring?	2025-02-26 23:13:32 -06:00
mrq	cbd4d7d7f4	ugh	2025-02-26 21:31:10 -06:00
mrq	2ea387c08a	segregated experimental changes into its own streamlined file to avoid breaking the existing model, and it can pivot to the cleaned up code if it actually works (nothing is working)	2025-02-26 21:26:13 -06:00
mrq	95da4e9405	made muon actually work by actually utilizing param groups (thanks APOLLO for reminding me this is the sane way to handle this split)	2025-02-26 10:39:13 -06:00
mrq	de27115bb7	there's something wrong with it on my 4xV100 rig......	2025-02-25 15:14:08 -06:00
mrq	db181f8e88	only do auto=equal for nemo as its an FSQ	2025-02-24 21:07:44 -06:00
mrq	a5a04c39ef	when the	2025-02-24 21:03:23 -06:00
mrq	918e0dbac1	small slop cleanup	2025-02-24 19:03:53 -06:00
mrq	0f39f4d7a1	lol	2025-02-24 17:51:35 -06:00
mrq	33d5a7109a	its a miracle i was able to get a semblance of audio with the naive AudioEncoder (now it interleaves properly)	2025-02-24 14:39:12 -06:00
mrq	8f5a3997bd	another experimental flag	2025-02-24 13:50:41 -06:00
mrq	b640fabab5	borrowed muon since it might better work under deepspeed and not require cruft (even though it really does not like the masked-NAR, also make the masked-NAR faux-causal since it might better help out for cfg.model.version >= 7	2025-02-23 17:23:24 -06:00
mrq	8f3c3e01ee	oops	2025-02-23 12:09:56 -06:00
mrq	b39aaacd77	oops	2025-02-23 11:55:43 -06:00
mrq	3019c88799	separate mask token and stop token because this might cause issues	2025-02-23 11:36:32 -06:00
mrq	6634d07576	added muon optimizer through kludge hacks because it necessitates a second optimizer in tandum that seems to only sometimes work with deepspeed	2025-02-23 11:22:13 -06:00
mrq	67a6009555	(finally) added parallel AR for cfg.model.version >= 7 (nvidia/audio-codec-44khz is being a pain and it might require training purely AR first......)	2025-02-23 08:31:03 -06:00
mrq	ab0abd2b12	fixes fixes fixes (a quarter of my recently processed audio returned zero'd tensors......)	2025-02-22 09:07:33 -06:00
mrq	13c3a08853	nevermind thats slow	2025-02-14 16:35:17 -06:00
mrq	285e493b12	ugh..........	2025-02-14 16:24:34 -06:00
mrq	a65c8144f4	with the amount of tweaks I keep making I could have probably had the nvidia/audio-codec-44khz model realized already......	2025-02-13 18:38:40 -06:00
mrq	e3becec0e8	more better-er loss calc I suppose	2025-02-13 12:49:53 -06:00
mrq	e8f182b634	cleaned up loss calc code (it REALLY hates ignore_loss_for_inputs, but is fine with splitting with loss factors)	2025-02-13 09:35:27 -06:00
mrq	319ca09a4f	cleanup	2025-02-12 23:36:32 -06:00
mrq	b52c5c5d80	this seems to work in testing	2025-02-12 16:16:04 -06:00
mrq	e029a8804d	ironically none of this cruft gets the loss lower than the original way	2025-02-12 11:17:00 -06:00
mrq	4b31f5c808	this seems preferable	2025-02-12 00:36:50 -06:00
mrq	04fef5dad5	agony	2025-02-12 00:18:24 -06:00
mrq	075ffef68a	ugh	2025-02-09 13:02:51 -06:00
mrq	47eb498046	more tweaks	2025-02-06 23:26:26 -06:00
mrq	79c504c278	cleaned up encode/decode functions to make them a little more coherent, added option to batch encode/decode (would have been very nice in the past, but this should speed things up for me when i fall for the latest meme codec)	2025-02-05 20:54:31 -06:00
mrq	bb2ebe1ca2	fixed issues that may rise from updating transformers with attention, added nvidia/audio-codec-44khz backend support (by gutting everything necessary because I do NOT want to install more dependencies	2025-02-04 20:30:07 -06:00
mrq	0841f366e8	I should really just grab modelling_llama wholesale (fix for the adapted attention class)	2025-01-28 21:55:05 -06:00
mrq	e5f9da2221	oops	2025-01-21 11:59:24 -06:00
mrq	69c1d2991f	updated mixtral backend (need this for something else)	2025-01-20 21:50:56 -06:00
mrq	1a26f789a5	added option to playback audio directly, removed no-phonemize option since I swear it worked in testing but it doesn't actually work	2025-01-12 21:52:49 -06:00
mrq	3ab11bdc7b	oops	2025-01-05 23:53:17 -06:00
mrq	b445f4abb6	experimental	2025-01-05 19:05:00 -06:00
mrq	2e6a7625e4	experimental	2025-01-05 12:47:03 -06:00
mrq	9b0d2ccbe1		2024-12-26 21:42:17 -06:00
mrq	59f56ad099	cleaup	2024-12-24 23:14:32 -06:00
mrq	82e8592f2a	working vall_e.cpp	2024-12-24 17:54:48 -06:00
mrq	497bdfc67b	more work (the wall is non-causal decoding......)	2024-12-22 20:11:31 -06:00
mrq	5f289db275	ugh	2024-12-22 16:15:24 -06:00
mrq	0d4329d2e3	sanity cleanup	2024-12-22 15:05:45 -06:00
mrq	353e478e68	agony	2024-12-21 22:52:10 -06:00
mrq	91caf00212	ugh	2024-12-20 17:13:37 -06:00
mrq	59bf6b8b33	exposed additional task (ns, sr, vc) (vc is experimental)	2024-12-20 11:15:29 -06:00
mrq	e7e7f48043	livid	2024-12-19 19:25:27 -06:00
mrq	c2c6d912ac	actually do speaker verification	2024-12-17 10:11:14 -06:00
mrq	c2e17e287b	really shoddy voice conversion implementation (it sort of works...)	2024-12-16 22:54:53 -06:00
mrq	8515038968	imagine my disappointment when the epoch finished just for it to throw an exception	2024-12-16 18:28:01 -06:00
mrq	4a65ac9eb7	oops	2024-12-15 17:21:51 -06:00
mrq	9a62e3b824	APOLLO cringe (doesn't want to work with deepspeed)	2024-12-12 00:31:58 -06:00
mrq	cddf8ca814	sort batches to try and reduce number of padded tokens in batched inference (also commented out F5 samples getting added to the demo page because I would have to regenerate them)	2024-12-11 22:45:38 -06:00
mrq	6468e5d124	lol	2024-12-11 19:10:32 -06:00
mrq	3ef8894290	oops	2024-12-08 15:24:21 -06:00
mrq	1d460b9fe3	logic fixes, I feel like output is better? (also NAR can have a temperature, I imagine it couldn't because it was having a causal masked passed to it for the longest time before I caught it a month ago)	2024-12-08 14:52:47 -06:00
mrq	5d80a2d0d4	fixed NAR-len issues with non-english maybe (langs weren't being passed), added interface to inference in batches through tts.batched_inference (no support for rolling context/prefixes because there's no way to do that), demo page uses batched inferencing now	2024-12-07 19:21:05 -06:00
mrq	61ed662856	ACTUALLY actually fix KD-loss (the -inf in the logits was caused by cringecode)	2024-12-07 12:31:54 -06:00
mrq	34a66e1052	agnostified KD	2024-12-06 23:53:46 -06:00
mrq	953d3eb030	ugh	2024-12-06 22:35:30 -06:00
mrq	42fafbaaca	actually fixed knowledge distillation because of errant -inf logits causing problems and needed to be filtered (and splitting text language / output audio language because it helps)	2024-12-06 21:55:20 -06:00
mrq	23d402bf01	added knowledge distillation in the trainer (sadly it is not agnostic because of the grave mistake of further processing the batch within the forward pass, so subsequent calls do not match......)	2024-12-05 23:05:52 -06:00
mrq	93d27be539	rolling context finally (use last N utterances as the prefix for the next gen), option to split input text prompt by sentences instead of lines (or no splitting)	2024-12-04 20:31:44 -06:00
mrq	9dff68c0c5	NAR-len tweaks (remasks a small amount of tokens per step, it seems to help with reducing the number of steps needed some of the time?, disable CFG for the first half to speed things up)	2024-12-04 09:30:29 -06:00
mrq	cf97560e70	minimum CFG of 3 for NAR-len because it seems the model will auto-default to NAR-len now	2024-12-03 19:40:05 -06:00
mrq	ca31da0a95	sageattn (forgot to bother with testing this the other day, seems ifne)	2024-12-03 15:14:57 -06:00
mrq	84a05acb6d	touch ups in docs	2024-12-02 19:10:42 -06:00
mrq	dcaf38b359	fixed training tqdm being stubborn	2024-11-23 09:45:23 -06:00
mrq	41d7c30ea5	added much cleaner non-causal mask generation	2024-11-22 19:43:32 -06:00
mrq	c99a74e834	actually generate a causal mask because it seems sometimes it does not actually generate one because it makes assumptions	2024-11-22 18:30:24 -06:00
mrq	ccee5fc11c	that was actually all pointless since sdpa always had an attention mask fed to it and does not need is_causal to implicitly generate one	2024-11-22 16:51:50 -06:00
mrq	4aa685e749	what has science done	2024-11-22 16:45:40 -06:00
mrq	147219a5e0	huge oversight in the attention masking......... (i realized I have not been providing a non-causal mask to non-causal tasks)	2024-11-22 13:44:43 -06:00
mrq	24d888c47c	temporarily dropping support for xformers because it's breaking when using an attention mask (which i dont remember commenting it out when being passed), default to not use wandb because it's being a pain when doing tests and not actual sessionsS)	2024-11-22 11:29:12 -06:00
mrq	8aafae91fd	dont use timeembedding	2024-11-21 23:14:52 -06:00
mrq	2cef97e43f	cleanup	2024-11-21 23:08:43 -06:00
mrq	67f7bad168	added mixed modality AR+NAR-len to generate a short prefix through the AR, then inference with said prefix through the NAR-len (need to experiment with it more to ensure that the masked off tokens are the only tokens getting updated)	2024-11-20 14:22:12 -06:00
mrq	b1369e7824	better modality selection (pick AR+NAR by default for the ar+nar model, pick NAR-len by default for the nar-len model), lowered default CFG because it makes the AR+NAR output sped up (but can't be too low since it's required for the NAR-len)	2024-11-19 18:51:17 -06:00
mrq	190a917b3e	I did it.	2024-11-19 12:24:33 -06:00
mrq	0e621354e7	cleaned up classifier-free guidance logit processing (in order to try and cope with a bad nar-len model)	2024-11-19 10:30:05 -06:00
mrq	5ba80686e1	two weeks of agony concludes	2024-11-18 21:29:28 -06:00
mrq	2b29790173	oops	2024-11-18 14:12:26 -06:00
mrq	6cfdf94bf9	swap priority to use nar-len if available, added notes	2024-11-18 09:40:04 -06:00
mrq	069b27570f	set option to set training masking ratio (I don't think for tts a fixed masking ratio is beneficial since the magic of the AR+NAR is being able to still reference the prior sequence of tokens for predicting things)	2024-11-17 17:04:07 -06:00
mrq	88d840218d	default set cfg strength to 3.0 since the reference model is updated	2024-11-17 10:23:40 -06:00
mrq	a3e1fa3518	ugh	2024-11-17 09:28:33 -06:00
mrq	23fdba0c98	tweaks and changes	2024-11-16 15:49:06 -06:00
mrq	2fbeacfe92	ugh	2024-11-14 22:18:33 -06:00
mrq	39096f8ff3	redid loss calculation to be cleaner, and position ID generation, and other things (I might need to train the NAR-len from scratch and not resume from an existing checkpoint.........)	2024-11-14 22:17:47 -06:00

1 2 3 4 5 ...

472 Commits