vall-e

mrq/vall-e

Author	SHA1	Message	Date
mrq	0e995dbf2c	is this my last cope (falling back to explicit duration prediction, as this regression just won't go away) (also the smaller model was lobotomized because of my ROCm setup having a botched SDPA for who knows why)	2025-04-02 17:01:24 -05:00
mrq	6ee505cffd	fixed dac	2025-03-12 23:17:27 -05:00
mrq	1d3290b023	could have sworn this worked before, might have broke it when i decoupled from omegaconf	2025-03-01 19:30:26 -06:00
mrq	17094b8002	reticulating splines	2025-03-01 17:48:51 -06:00
mrq	b640fabab5	borrowed muon since it might better work under deepspeed and not require cruft (even though it really does not like the masked-NAR, also make the masked-NAR faux-causal since it might better help out for cfg.model.version >= 7	2025-02-23 17:23:24 -06:00
mrq	953015748f	ugh	2025-02-07 20:49:28 -06:00
mrq	299cc88821	re-added amp encoding/decoding for audio, possible bad idea to ignore using amp instead if requested	2025-02-05 21:55:06 -06:00
mrq	79c504c278	cleaned up encode/decode functions to make them a little more coherent, added option to batch encode/decode (would have been very nice in the past, but this should speed things up for me when i fall for the latest meme codec)	2025-02-05 20:54:31 -06:00
mrq	84174c1c1b	oops	2025-02-05 10:25:03 -06:00
mrq	bb2ebe1ca2	fixed issues that may rise from updating transformers with attention, added nvidia/audio-codec-44khz backend support (by gutting everything necessary because I do NOT want to install more dependencies	2025-02-04 20:30:07 -06:00
mrq	bbc2de3713	ugh	2024-11-05 11:50:05 -06:00
mrq	52299127ab	fix vall_e.emb.process	2024-10-08 20:00:34 -05:00
mrq	10df2ef5f3	fixed oversight where input audio does not resample (lol...)	2024-09-27 20:27:53 -05:00
mrq	56f25f7a9b	more stuff for similar-speaker prompt sampling (to-do: actually test if this works...)	2024-09-16 23:10:29 -05:00
mrq	32287710a2	moved prints to use logger, edited readme (fused_attn doesnt seem stable for training)	2024-08-29 13:27:16 -05:00
mrq	054d28573a	my DAC dataset again managed to only have some utterances with only 8 of 9 RVQ levels, this fixes an oversight from it	2024-08-09 21:18:01 -05:00
mrq	eac353cd0b	busy work and cleanup while I wait for 1TB of audio to quantize... again.	2024-08-06 20:23:33 -05:00
mrq	f284c7ea9c	do mixed-precision for AMP inside the compress function itself, because the loudness function gripes when using a float16 (non-power of 2 lengths) or bfloat16 (something about views for bfloat16)	2024-08-06 15:08:37 -05:00
mrq	b6ba2cc8e7	tweaked vall_e.emb.process to instead process audio one file at a time instead of all the files for a given speaker to avoid OOMing on less-memory-filled systems with --low-memory	2024-08-06 14:24:40 -05:00
mrq	9710b06b74	tweaks and things	2024-08-06 08:17:25 -05:00
mrq	75b04686f8	added prom-less training / inferencing, some other things	2024-07-22 19:36:07 -05:00
mrq	28a674e0f1	fixes...	2024-07-18 23:25:32 -05:00
mrq	bccbb77a1a	added option to either naively concat codes to concat audio waveforms (prior behavior) or to decode => concat => encode instead (although this only currently happens for prom sampling if an utternace is too small)	2024-07-18 16:48:41 -05:00
mrq	7b210d9738	sanity cleanup	2024-07-04 15:58:08 -05:00
mrq	1ecf2793f4	(commented-out) support for facebookresearch/AudioDec, but support really didn't wow me (so I commented it out until I figure out why my output audio is super crusty with AudioDec)	2024-07-04 15:40:51 -05:00
mrq	b21f74a5c5	added summing of external embeddings (at this point i dont think any amount of cope bandaids will get DAC to train nicely, I think the RVQ levels the NAR tends add too much noise if they're not accurate)	2024-06-29 23:42:30 -05:00
mrq	793ccb16fb	ugh	2024-06-29 22:14:35 -05:00
mrq	2808f881c8	cleaned up subjugated audio embedding into a flag, flag can also have it include the original, underlying embedding as well (it seems to do better when set to inclusive)	2024-06-29 21:46:35 -05:00
mrq	ec5eaebcbc	experimental method of using DACs quantizer ""embeddings"" to see if it helps with model quality	2024-06-29 19:46:11 -05:00
mrq	234f9efc6e	ugh	2024-06-09 11:39:43 -05:00
mrq	ddbacde0d1	DAC just doesn't work well enough......	2024-05-25 11:07:52 -05:00
mrq	74e531d391	ugh	2024-05-18 12:02:56 -05:00
mrq	5eb5db7f7f	just don't use DAC 24Khz, it's bad	2024-05-12 13:41:17 -05:00
mrq	230da8b559	should be the final things to scramble around for, DAC's 24KHz model is unusable for this, but both encodec's 24KHz and DAC's 44KHz work	2024-05-12 13:22:08 -05:00
mrq	2437a86efa	ugh	2024-05-12 13:02:15 -05:00
mrq	4f1593c8db	a bunch of shit to salvage my old encodec-quantized audio because dac-encoded audio just does not want to converge	2024-05-12 10:17:29 -05:00
mrq	14709ac67f	ughh	2024-05-12 07:30:59 -05:00
mrq	c4b696ebeb	oops	2024-05-09 22:33:40 -05:00
mrq	0d5d545a40	crammed in DAdaptation (doesn't seem worth it) and ScheduleFree (forgot I wanted to weeks ago, seems promising), optimization wrapper cleanup, test trainer changes, etc.	2024-05-09 20:28:20 -05:00
mrq	c6e0f905b5	final tweaks (again) before training restarts	2024-05-08 02:11:38 -05:00
mrq	215800484d	correcting my wrong of assuming I could just use raw 24Khz audio in the 44Khz DAC without too much of an issue (there are issues)	2024-05-04 23:49:15 -05:00
mrq	9f738fbd5b	seems I actually don't need RVQ bins 9-32 with the 24Khz DAC model........ (time to requantize my audio...)	2024-05-04 23:09:18 -05:00
mrq	a8ffa88844	it slipped my mind that technically DAC can be used at any sample rate, since it models waveforms; make it a config YAML option to allow this behavior	2024-04-19 18:36:54 -05:00
mrq	8214aa23d7	converting over to a different intermediary dataset format	2024-04-18 21:24:06 -05:00
mrq	4f5c9e518a	actually use the passed-through sample rate from encode for DAC because it does its own resampling I guess	2024-04-18 13:32:41 -05:00
mrq	2e9e6e68f7	Forgot I need to use the DAC's 44K model because 24K model has 32 codebooks instead of 9.	2024-04-17 20:59:25 -05:00
mrq	5ff2b4aab5	finally swallowing the Descript-Audio-Codec pill (I guess I'm going to have to regenerate my entire dataset)	2024-04-17 20:39:35 -05:00
mrq	545162195b	deprecate sole AR/NAR model by only keeping the AR+NAR (the beauty of no one using this is that I can break compat as much as I want), add tone token for when I classify my dataset with tone/emotion in the future, some other things	2024-04-15 19:54:32 -05:00
mrq	09cda7d3f9	added sampling by speaker group name (might be better to de-emphasize the LibriVox/Audiobooks that are in large numbers, and emphasize the smaller pools), log cleanup	2023-10-16 19:30:38 -05:00
mrq	2bc2d08b09	(need to verify) added modifying model size and config bool to align with VALL-E continuous' methodology	2023-09-01 17:19:34 -05:00

1 2

61 Commits