vall-e

Author	SHA1	Message	Date
mrq	6a733eb2ed	changed torch.Tensor().to(device, dtype) to just torch.tensor(..., device, dtype) because it's been bothering my autism that I'm creating tensors then converting rather than creating with the right device/dtype, some 'optimization' to compile the model but it doesnt seem to do anything useful	2024-08-03 22:10:21 -05:00
mrq	97c5241bef	fixes, throw an exception when using NAR only model with non-unified position IDs, since for some reason it outputs garbage for the NAR	2024-08-02 22:25:49 -05:00
mrq	4456d3172b	that's what I get for testing without hdf5 on my previous machine....	2024-08-02 20:44:01 -05:00
mrq	ce8bb1e4f7	sanity cleanups with weird off-by-one-ness, cleaned up and validated vall_e.models.experimental works again	2024-07-27 15:36:05 -05:00
mrq	682e4387dc	oops (fixed proms being erased from a config oversight)	2024-07-25 12:39:57 -05:00
mrq	75b04686f8	added prom-less training / inferencing, some other things	2024-07-22 19:36:07 -05:00
mrq	491ae2a684	some insanity for sanity checks (some phonemes from phonemizing japanese are not in my tokenizer...)	2024-07-22 00:30:40 -05:00
mrq	e19aa643a6	cleaned up demo page creation, added option to pass in RVQ level sampling distribution for training	2024-07-21 19:12:03 -05:00
mrq	d87b492295	added rudimentary demo page creator (currently just embeds base64 wavs into the page, need to test not doing that)	2024-07-19 20:49:40 -05:00
mrq	28a674e0f1	fixes...	2024-07-18 23:25:32 -05:00
mrq	39f961abcd	test trainer (vall_e.models.ar_nar) tests some SpeechX features	2024-07-18 18:46:45 -05:00
mrq	83a0954f85	fixes for re-introducing SpeechX tasks (need to actually validate if these all do the right things)	2024-07-18 17:16:32 -05:00
mrq	bccbb77a1a	added option to either naively concat codes to concat audio waveforms (prior behavior) or to decode => concat => encode instead (although this only currently happens for prom sampling if an utternace is too small)	2024-07-18 16:48:41 -05:00
mrq	97e768601c	re-introducing SpeechX tasks (need to validate them all, everything works with base tts anyways)	2024-07-18 16:16:14 -05:00
mrq	3acc54df22	allow loading a different model within the web ui (apparently I did not have the web UI in the documentation)	2024-07-15 19:59:48 -05:00
mrq	312a8e3ead	add shuffle to samplers that can support it	2024-06-30 11:36:46 -05:00
mrq	bc2a6fa756	sanity cleanup: moved experimental features under its own thing	2024-06-30 10:37:33 -05:00
mrq	793ccb16fb	ugh	2024-06-29 22:14:35 -05:00
mrq	c4dd523b6f	change from chunk-slicing paths for distributed dataloader to instead interleave	2024-06-29 10:10:35 -05:00
mrq	dd40463803	limit eval size because the training batch size seems to be used for the eval dataloader, somehow (bandaid)	2024-06-29 09:11:28 -05:00
mrq	591d3ac848	have eval dataloader use eval batch size for batchedordersampler	2024-06-28 22:44:00 -05:00
mrq	83075c1505	sort duration buckets to ensure that paths sorted-by-duration are actually sorted by duration (because i didnt know that python dicts can have non-strings as keys), added batching samples based on total duration to ensure best training throughput	2024-06-28 22:28:54 -05:00
mrq	8fffb94964	backport fix from tortoise_tts with local trainer + loading state when training lora	2024-06-25 13:41:29 -05:00
mrq	19410a919e	ugh	2024-06-15 12:29:03 -05:00
mrq	d343bde09b	residual_in_fp32=False for mamba arch backends because it breaks the classifier (output projection / lm head / what-have-you) under AMP	2024-06-15 12:08:03 -05:00
mrq	31f71fa134	sampler update (some brainworm just never actually had a sampler for sample_type=path)	2024-06-14 16:55:40 -05:00
mrq	b3b67f34ac	added option to sort paths by durations to better group equally lengthed sequences together (and there was maybe a logic error from creating the samplers and then interleave-reordering paths, desyncing them, maybe)	2024-06-13 22:37:34 -05:00
mrq	cca542a4c0	ugh	2024-06-11 23:59:28 -05:00
mrq	65a8960305	option to split classifier per-level instead of sharing one (at this point I'm just scrambling to try and cope with training a DAC model, the NAR is being a pain)	2024-06-11 22:28:59 -05:00
mrq	234f9efc6e	ugh	2024-06-09 11:39:43 -05:00
mrq	132a02c48b	sanity cleanup, backup config yaml for each log file	2024-06-09 11:22:52 -05:00
mrq	4ade2b60ee	ugh	2024-06-06 21:57:11 -05:00
mrq	014e565c4b	tweaks	2024-06-04 20:41:13 -05:00
mrq	6d5bd0156a	fixes	2024-06-04 18:50:48 -05:00
mrq	ed3aeaf3a1	copy pasted from test to actual trainer	2024-06-04 18:40:30 -05:00
mrq	0aa01ba31a	forgot one crucial detail (you need the previous RVQ level to keep coherence between all RVQ levels) (experimental deinterleaved is a bit crusty though)	2024-06-04 18:30:30 -05:00
mrq	406ff7bbe1	re-implemented config.model.interleave for the HF-compat experimental method	2024-06-04 14:19:52 -05:00
mrq	c93d5863fd	fixes	2024-06-04 00:07:00 -05:00
mrq	934672252b	feverish cleanup	2024-06-03 21:28:49 -05:00
mrq	8cf176ab46	ugh	2024-06-01 10:46:42 -05:00
mrq	d0ebce6bac	ugh	2024-06-01 10:30:13 -05:00
mrq	74df2f5332	split sampler dict by global_rank, also handle splitting dataset paths by global_rank if sampler_type == path (because I do not trust DistributedSampler) (need to test)	2024-06-01 09:29:49 -05:00
mrq	ddbacde0d1	DAC just doesn't work well enough......	2024-05-25 11:07:52 -05:00
mrq	e3ef89f5aa	100x better for subtrain/eval to be by group instead	2024-05-19 16:40:14 -05:00
mrq	4bc7e5a6d1	fix loading without needing an hdf5 dataset already prepped (and some other incidental speedups during dataloader prep)	2024-05-18 07:14:26 -05:00
mrq	d88a5ca183	ugh	2024-05-16 07:25:33 -05:00
mrq	d9aabfa3ae	final tweaks, hopefully, again	2024-05-15 23:04:19 -05:00
mrq	2437a86efa	ugh	2024-05-12 13:02:15 -05:00
mrq	4f1593c8db	a bunch of shit to salvage my old encodec-quantized audio because dac-encoded audio just does not want to converge	2024-05-12 10:17:29 -05:00
mrq	14709ac67f	ughh	2024-05-12 07:30:59 -05:00
mrq	3774fcbdee	ugh	2024-05-11 22:58:38 -05:00
mrq	4d93a16ef7	might just be better to explicitly define prompt duration ranges, especially under a "train small contexts then increase it" training paradigm	2024-05-11 09:50:54 -05:00
mrq	0d5d545a40	crammed in DAdaptation (doesn't seem worth it) and ScheduleFree (forgot I wanted to weeks ago, seems promising), optimization wrapper cleanup, test trainer changes, etc.	2024-05-09 20:28:20 -05:00
mrq	c6e0f905b5	final tweaks (again) before training restarts	2024-05-08 02:11:38 -05:00
mrq	33b7f81b94	small cleanups	2024-05-04 22:37:22 -05:00
mrq	ffa200eec7	added option to specify frames per second for the given audio representation (Encodec is 75Hz, DAC is 41Hz (at 24K sources))	2024-05-04 12:05:41 -05:00
mrq	b5d1456a09	backwards compat for my shitty old weights (was testing if disabling AudioEmbedding summing magically made things better (it did not))	2024-04-29 22:14:01 -05:00
mrq	6a11bc9cb6	update tokenizer because, for some reason, it had the wrong order for the special tokens to where eos = unk	2024-04-29 09:09:26 -05:00
mrq	57810e4ba4	metadata only path (might drop HDF5 since its giving file sizes twice as large as my actual unpacked dataset)	2024-04-28 23:03:09 -05:00
mrq	caad7ee3c9	final tweaks, hopefully	2024-04-28 22:28:29 -05:00
mrq	ffc334cf58	added dataset transcription helper script (now I don't ever have to touch ai-voice-cloning) (to-do: unify scripts into the module)	2024-04-21 17:43:20 -05:00
mrq	071fb97777	dataset preparation script updates, caved and am using HF tokenizer now	2024-04-21 14:49:18 -05:00
mrq	8214aa23d7	converting over to a different intermediary dataset format	2024-04-18 21:24:06 -05:00
mrq	4f5c9e518a	actually use the passed-through sample rate from encode for DAC because it does its own resampling I guess	2024-04-18 13:32:41 -05:00
mrq	545162195b	deprecate sole AR/NAR model by only keeping the AR+NAR (the beauty of no one using this is that I can break compat as much as I want), add tone token for when I classify my dataset with tone/emotion in the future, some other things	2024-04-15 19:54:32 -05:00
mrq	9c198eb75a	added torchscale XMOE integration (because Mixtral 8x7B seems very promising and I want to see if it works)	2023-12-20 18:45:58 -06:00
mrq	0aa2a3cc07	evaluation/validation passes language ID during training (oops)	2023-10-29 12:00:40 -05:00
mrq	9a6040383e	make validation samplers ignore sampler type	2023-10-22 09:01:47 -05:00
mrq	3195026dba	fixed issue with the 'add another target audio to artificially create longer sequences' for HDF5 just duplicating the utterance initially sampled	2023-10-18 20:38:33 -05:00
mrq	09cda7d3f9	added sampling by speaker group name (might be better to de-emphasize the LibriVox/Audiobooks that are in large numbers, and emphasize the smaller pools), log cleanup	2023-10-16 19:30:38 -05:00
mrq	65f500083d	tweaks to try and get deepspeed quantized inferencing, validating bitsandbytes and deepspeed quantization, nothing seems to work	2023-10-12 22:21:43 -05:00
mrq	8740cdefc6	added initial support for languages (still testing, marked as model version 3), added experimental 'context extend by limiting the resp context' (untested)	2023-10-11 20:38:40 -05:00
mrq	6045cbce94	added experimental option to append utterances for training target (emphasis on experimental)	2023-10-11 17:32:45 -05:00
mrq	b4405c98ea	remove double spaces in the text phonemes (might have caused problems.........)	2023-10-10 19:18:24 -05:00
mrq	87db03dd93	trim the input prompt to 3 seconds when training NAR tasks (marked as experimental; the paper mentions doing so, but I don't know how much this would harm the retention heads)	2023-10-09 22:03:58 -05:00
mrq	893a610fad	cleanup, use deepspeed inferencing pathway if requested	2023-10-09 15:24:04 -05:00
mrq	27483e56f0	disabled preparing of SpeechX tasks, added dynamic temperature testing (to-do: test it, credited in the function)	2023-10-09 13:01:40 -05:00
mrq	82f02ae9b1	oops	2023-10-06 09:26:52 -05:00
mrq	d12877ee09	added option to set probability of selecting the AR during training under a monolithic AR+NAR, added some more to-dos while I have them in mind	2023-10-02 16:52:42 -05:00
mrq	a6bfe43590	added mirostat sampling (given a partially trained model, it got far decent output than I expected, need to test on a better trained model)	2023-09-18 18:55:41 -05:00
mrq	d07c63b9d8	unified more things with training the AR+NAR monolothic model	2023-09-12 15:54:41 -05:00
mrq	40ef34e1ca	this embedding class definitely works, and migrating from the previous embedding weights seems to work.	2023-09-11 14:13:42 -05:00
mrq	8a6c203277	added per-speaker samplers	2023-09-03 21:27:13 -05:00
mrq	922404285c	fixed segfault from tts-c task token exceeding being too big (inserted it in the hypothetical svc task token because in reality that is never ever going to be a feasible task to train against)	2023-09-02 19:25:43 -05:00
mrq	4613781e23	integrated plot script, added tts-c task token to help the model be able to mix between normal VALL-E and VALL-E continuous	2023-09-02 16:29:53 -05:00
mrq	71e68a8528	tweaked tts-continuous task	2023-09-02 13:39:17 -05:00
mrq	57db3ccfa8	shuffled VALL-E continuous as a task tts-c instead, logic fixes for it	2023-09-02 12:23:40 -05:00
mrq	2bc2d08b09	(need to verify) added modifying model size and config bool to align with VALL-E continuous' methodology	2023-09-01 17:19:34 -05:00
mrq	5c8694db8e	nasty bandaid if there's no validation dataset specified during training (for example, during finetunes)	2023-08-30 18:23:05 -05:00
mrq	7f4388e591	added total samples processed and tokens processed (len of text tokens + len of target response tokens)	2023-08-28 11:02:45 -05:00
mrq	87c4bfedba	added ability to mark models as disabled for training, and hotloading them for eval/validation (useful if training only one model, or training a model per GPU)	2023-08-27 12:26:12 -05:00
mrq	165a1154e0	Undo naive=False test flag, this shouldn't have made its way in	2023-08-26 22:00:43 -05:00
mrq	78378ed1ce	overhauled dataloading code to be marginally faster, mostly cleaned up, and can leverage a metadata json to help things out	2023-08-26 19:53:23 -05:00
mrq	0517d620b8	fixes with the local backend	2023-08-24 17:05:56 -05:00
mrq	22904a8639	more oversights fixed because I've been using a cached dataloader forever now and didn't catch these problems	2023-08-24 10:25:33 -05:00
mrq	5873c27f1a	ops	2023-08-24 09:20:47 -05:00
mrq	4585824cd3	tweaks, including exporting on save/quit	2023-08-23 16:43:03 -05:00
mrq	9c5a33bfd2	added repo with my weights so far	2023-08-22 13:09:44 -05:00
mrq	7b1b82e0e5	inferencing cleanup	2023-08-20 21:36:02 -05:00
mrq	a47029065b	I don't know if the lack of start/stop tokens being added was causing my inference tests to fail, but it seems better now	2023-08-20 19:21:54 -05:00
mrq	fc576010ce	wrapped saving the checkpoint in a try/catch so I can stop waking up to the damn trainer crashing because it ran out of disk space; I'd much rather it keep training to give me time to eventually clear up disk space rather than it silently restarting on its own	2023-08-20 06:29:17 -05:00
mrq	2d1a9f10c0	nightmare of spaghetti that might break compat; mechanism to increase RVQ bins of an existing model without retraining, keeps sampled proms/resps at max RVQ level and trim off excess levels according to what model receives them, some other things I already forgot (I really hope no one else has weights being baked right now)	2023-08-19 15:06:33 -05:00
mrq	f7f6d3bf6d	validated that SpeechX tasks cse and nse works, added a method to test each task by invoking `python3 -m vall_e.data --action=tasks --tasks='sr,se,cse,nse'`	2023-08-19 09:50:07 -05:00
mrq	6ca347e1e1	literally had a urethra moment before going to bed with a way to implement cse/nse tasks	2023-08-19 01:16:46 -05:00
mrq	8f42c578c9	setting up for allowing training for a partial amount of the speechx tasks (do NOT try this at home yet without a proper model, as performance is predecated on having a solid base vall-e model for the tasks	2023-08-19 00:16:08 -05:00
mrq	ae9d38aa31	forgot to have it pull from specified noise to the hdf5 dataset	2023-08-18 23:57:07 -05:00
mrq	77292c42f9	tested the training preparation for tasks ns, sr, and tse (I don't expect it to go well with only 2 RVQ bins)	2023-08-18 23:55:40 -05:00
mrq	bbb0563b3d	pseudocode polyfill stub some other flavor of working on adding the tasks	2023-08-18 22:22:13 -05:00
mrq	2a71486cb6	preparing for SpeechX extensions	2023-08-18 20:58:07 -05:00
mrq	ced31fd9b7	removed the sampler as it's very misleading	2023-08-18 14:47:48 -05:00
mrq	8e7f900210	forgot the =	2023-08-17 19:07:59 -05:00
mrq	3ff7cf8341	maybe fix evaluation dataset not being capped to cfg.evaluation.size	2023-08-17 18:56:37 -05:00
mrq	ee58db746f	actually make the evaluation dataset shuffled for sample_type=speaker	2023-08-17 15:04:45 -05:00
mrq	18403a3523	maybe fixes eval dataloader not shuffling under distributed	2023-08-17 13:41:53 -05:00
mrq	b5f247aa11	just nuked about 9 hours of progress because I didn't make sure it pruned only on the global leader	2023-08-16 23:37:52 -05:00
mrq	44c08d828e	added sample_type that samples from speakers to truly balance an epoch by speakers rather than the entire dataset and a sampler that tries to balance by speakers	2023-08-16 19:39:21 -05:00
mrq	277c759ab1	fixed issue with non-distributed training, oops	2023-08-14 21:42:35 -05:00
mrq	5fa86182b5	oops	2023-08-14 10:50:40 -05:00
mrq	d7deaf6def	distributed training works now (hopefully)	2023-08-13 22:07:45 -05:00
mrq	bf8cedc9dd	Rewrite init	2023-08-02 21:53:35 +00:00

1 2 3 4 5

220 Commits