vall-e

mrq/vall-e

Author	SHA1	Message	Date
mrq	7047fcc6e2	actually make deepspeed work with LoRAs	2024-06-17 13:55:37 -05:00
mrq	1d159b1476	updated export routine to split LoRA weights from the state dict (should work with deepspeed)	2024-06-17 13:28:18 -05:00
mrq	726a4b613f	naive, rudimentary DeepSpeed support (just live with the LoRA weights living with the original weights, they can be split later)	2024-06-17 13:17:24 -05:00
mrq	bd0bc10ec0	added LoRA policy to decide what layer of the model gets adapted based on simple inclusion/exclusion terms	2024-06-17 13:05:06 -05:00
mrq	45a39fb79f	very rudimentary lora support (no deepspeed support, tested training and saving but not loading yet)	2024-06-17 00:09:16 -05:00
mrq	a7a6e0ac76	validated that inferencing works, changed some defaults (NAR benefits from greedy sampling)	2024-06-09 17:11:38 -05:00
mrq	4ade2b60ee	ugh	2024-06-06 21:57:11 -05:00
mrq	fcac9503e2	cleanup	2024-06-06 13:08:02 -05:00
mrq	e50edc3b48	added a flag to convert to a HF compatible model on export by stitching things	2024-06-03 22:34:47 -05:00
mrq	934672252b	feverish cleanup	2024-06-03 21:28:49 -05:00
mrq	c2a436d368	somehow between training sessions grad_norm = None even though it worked before	2024-06-02 08:29:27 -05:00
mrq	827cf632e7	report current loss scale and adjust grad norm by loss scale (for deepspeed)	2024-06-01 10:44:32 -05:00
mrq	856545f8bb	nan loss detection (should have added it earlier), loss scaling for local backend + fp16	2024-05-11 22:23:29 -05:00
mrq	88e9b9caff	local ddp fix	2024-05-11 17:29:01 -05:00
mrq	71e373064f	remove redundant loss, tweak readme	2024-05-11 15:02:47 -05:00
mrq	8aa1b2dabf	documentation update	2024-05-04 21:03:46 -05:00
mrq	9d97eb5104	added FP8 support through `NVIDIA/TransformerEngine`, added RetNet_HF through `syncdoth/RetNet` (as an alternative to branch away from torchscale)	2024-04-08 20:14:51 -05:00
mrq	91062361af	tweaks	2024-03-01 20:38:06 -06:00
mrq	f3c59c3e7e	cleaner replacement code (because I realized BitNet had an implementation for it too), added calculating gradient norm and performing gradient clipping in local trainer (non-deepspeed)	2024-03-01 20:18:43 -06:00
mrq	3da1518ace	added Mistral (non-Mixtral) backend, useless optimization when not training, proper adjustment of the LR for Prodigyopt through d_coeff (maybe), recurrent sampling for LLaMA/Mistral/Mixtral backends (again, doesn't actually work)	2024-01-31 21:48:36 -06:00
mrq	9c198eb75a	added torchscale XMOE integration (because Mixtral 8x7B seems very promising and I want to see if it works)	2023-12-20 18:45:58 -06:00
mrq	6c51a629cc	resetting step count resets the samples processed and other metrics	2023-10-29 12:11:19 -05:00
mrq	09cda7d3f9	added sampling by speaker group name (might be better to de-emphasize the LibriVox/Audiobooks that are in large numbers, and emphasize the smaller pools), log cleanup	2023-10-16 19:30:38 -05:00
mrq	c0b25541e3	restructured some things with the model to remove dead weights	2023-09-20 19:10:59 -05:00
mrq	5ac119a6e7	added light web UI (need to port the telemetry disabling bandaids from aivc)	2023-09-09 16:17:20 -05:00
mrq	8837bc34d7	added option to specify parameters to freeze per-model in YAML (because I need to see about committing atrocities with convering an AR into an AR+NAR)	2023-09-07 18:19:51 -05:00
mrq	81b05dabb9	accurate epoch metric is now reported (based on samples processed / length of dataset's paths, rather than naive assumptions)	2023-09-03 08:03:36 -05:00
mrq	2f06166ddd	cleanups	2023-09-01 21:33:51 -05:00
mrq	e40c0d34a0	somewhat got recurrent forward working (it's as accurate as chunkwise forward: it's not accurate at all), added option to use AMP instead of blanket setting the weight's dtype	2023-09-01 20:58:29 -05:00
mrq	7f4388e591	added total samples processed and tokens processed (len of text tokens + len of target response tokens)	2023-08-28 11:02:45 -05:00
mrq	87c4bfedba	added ability to mark models as disabled for training, and hotloading them for eval/validation (useful if training only one model, or training a model per GPU)	2023-08-27 12:26:12 -05:00
mrq	0517d620b8	fixes with the local backend	2023-08-24 17:05:56 -05:00
mrq	736c077282	ops	2023-08-20 13:42:18 -05:00
mrq	b105f6211e	added ability to export weights mid-training to avoid CBT to yank the weights while the training script is running	2023-08-20 13:39:58 -05:00
mrq	fc576010ce	wrapped saving the checkpoint in a try/catch so I can stop waking up to the damn trainer crashing because it ran out of disk space; I'd much rather it keep training to give me time to eventually clear up disk space rather than it silently restarting on its own	2023-08-20 06:29:17 -05:00
mrq	2d1a9f10c0	nightmare of spaghetti that might break compat; mechanism to increase RVQ bins of an existing model without retraining, keeps sampled proms/resps at max RVQ level and trim off excess levels according to what model receives them, some other things I already forgot (I really hope no one else has weights being baked right now)	2023-08-19 15:06:33 -05:00
mrq	03872b823f	why did I type rglob, another 10 bucks down the drain...	2023-08-17 00:11:29 -05:00
mrq	b5f247aa11	just nuked about 9 hours of progress because I didn't make sure it pruned only on the global leader	2023-08-16 23:37:52 -05:00
mrq	d7152fc7b9	added pruning of old checkpoints if specified (cfg.trainer.keep_last_checkpoints)	2023-08-16 20:12:12 -05:00
mrq	d7deaf6def	distributed training works now (hopefully)	2023-08-13 22:07:45 -05:00
mrq	d89568a96e	some fixes for the local framework	2023-08-05 03:22:15 +00:00
mrq	5970f254e3	some fixes for the local framework	2023-08-05 02:17:30 +00:00
mrq	0a524f1d59	reticulating splines	2023-08-03 21:39:00 -05:00
mrq	608c1970eb	ops	2023-08-03 20:36:19 -05:00
mrq	c85101403f	big cleanup	2023-08-03 20:26:36 -05:00

45 Commits