vall-e

mrq/vall-e

Author	SHA1	Message	Date
mrq	2dd80a03ff	stuff for interfacing with the loss scaler value (because I want to cap it)	2025-03-06 17:07:29 -06:00
mrq	f4f435d7f5	when you already had these ideas to stabilize training but you just ignored them	2025-02-27 23:39:20 -06:00
mrq	6634d07576	added muon optimizer through kludge hacks because it necessitates a second optimizer in tandum that seems to only sometimes work with deepspeed	2025-02-23 11:22:13 -06:00
mrq	d4a6709fb4	stopgap cringe to get this training session working (it does not seem fruitful)	2025-02-11 13:45:09 -06:00
mrq	8515038968	imagine my disappointment when the epoch finished just for it to throw an exception	2024-12-16 18:28:01 -06:00
mrq	3dd31e74d1	finally figured out a clean way to handle "resuming" the tqdm bar	2024-12-14 18:44:43 -06:00
mrq	64c67160a3	tweaks	2024-12-13 19:00:35 -06:00
mrq	f41251f648	more fixes for local engine backend	2024-12-12 14:38:42 -06:00
mrq	34a66e1052	agnostified KD	2024-12-06 23:53:46 -06:00
mrq	42fafbaaca	actually fixed knowledge distillation because of errant -inf logits causing problems and needed to be filtered (and splitting text language / output audio language because it helps)	2024-12-06 21:55:20 -06:00
mrq	dcaf38b359	fixed training tqdm being stubborn	2024-11-23 09:45:23 -06:00
mrq	88d840218d	default set cfg strength to 3.0 since the reference model is updated	2024-11-17 10:23:40 -06:00
mrq	b2eca271a8	ugh	2024-11-13 10:35:44 -06:00
mrq	ad7cfffc00	NAR-len RVQ-0 was being trained causally.............	2024-11-13 09:43:50 -06:00
mrq	976ee87f6f	resume iteration step in tqdm trainer, warn to logger if the sampler state dict was invalidated	2024-11-13 09:09:28 -06:00
mrq	0f2584eba7	new meme sampler PogChamp new meme sampler PogChamp (it sort of helps?)	2024-11-12 22:30:09 -06:00
mrq	32287710a2	moved prints to use logger, edited readme (fused_attn doesnt seem stable for training)	2024-08-29 13:27:16 -05:00
mrq	ab673e0426	add cap for NAR-len training, to avoid any weird cases in early training where it'll just mess up and generate long lengths	2024-08-03 21:00:32 -05:00
mrq	4d2b88b164	throw exception if training, but no model is set to train (because i ran into this wondering what the hell was happening)	2024-08-03 20:51:23 -05:00
mrq	06e948aec1	suppress warning on exit about distributed not being cleaned up (because I updated my system)	2024-07-25 16:50:47 -05:00
mrq	8fffb94964	backport fix from tortoise_tts with local trainer + loading state when training lora	2024-06-25 13:41:29 -05:00
mrq	726a4b613f	naive, rudimentary DeepSpeed support (just live with the LoRA weights living with the original weights, they can be split later)	2024-06-17 13:17:24 -05:00
mrq	31f71fa134	sampler update (some brainworm just never actually had a sampler for sample_type=path)	2024-06-14 16:55:40 -05:00
mrq	132a02c48b	sanity cleanup, backup config yaml for each log file	2024-06-09 11:22:52 -05:00
mrq	4ade2b60ee	ugh	2024-06-06 21:57:11 -05:00
mrq	fcac9503e2	cleanup	2024-06-06 13:08:02 -05:00
mrq	880b4ecd1b	cleanup, putting some thoughts in comments before I forget about them	2024-06-05 19:50:06 -05:00
mrq	3cfc8a96bb	oops	2024-06-05 10:30:04 -05:00
mrq	c1fcd889d5	reverted automatically disabling split loss calc, since it seems that it's actually cacling loss on prom causes the oddities, maybe	2024-06-01 12:34:59 -05:00
mrq	8cf176ab46	ugh	2024-06-01 10:46:42 -05:00
mrq	d0ebce6bac	ugh	2024-06-01 10:30:13 -05:00
mrq	39bc019142	actually save per-rank sampler states	2024-06-01 09:46:32 -05:00
mrq	85f9684720	some cleanup	2024-05-25 17:46:52 -05:00
mrq	0b6499601b	sanitizing	2024-05-11 16:31:05 -05:00
mrq	bd0a36ba8d	I swear I keep seeing tqdm flicker back a number	2024-05-10 18:36:01 -05:00
mrq	277dcec484	apparently I got an error for trying to serialize an errant tensor that made its way into the json, this could be remedied easily with recursively traversing the dict and coercing any objects to primitives, but I'm tired and I just want to start training and nap	2024-05-04 12:33:43 -05:00
mrq	0427d8d076	logger broke for some reason, added flag to just tqdm.write instead, make cfg.bitsandbytes.bitnet==True yamls denoted since I'm sure they're not interoperable	2024-03-01 10:32:35 -06:00
mrq	cce929e136	nasty hotfix for transformer's Mixtral throwing an error when batch sizes > 1	2024-01-26 19:41:12 -06:00
mrq	32d4271ca8	fixed issue with training from scratch (oops)	2023-10-21 09:55:38 -05:00
mrq	09cda7d3f9	added sampling by speaker group name (might be better to de-emphasize the LibriVox/Audiobooks that are in large numbers, and emphasize the smaller pools), log cleanup	2023-10-16 19:30:38 -05:00
mrq	893a610fad	cleanup, use deepspeed inferencing pathway if requested	2023-10-09 15:24:04 -05:00
mrq	3db7e7dea1	implicitly load checkpoint if deepspeed checkpoint not found, updated setup script to grab the diskcached dataloader things	2023-10-06 10:02:45 -05:00
mrq	4abd6564d1	fixed training stats not loading from exported weights, a bit of a readme cleanup, updated example training yaml	2023-09-23 19:59:00 -05:00
mrq	9384900ce6	revert the frankensteined "train one model but hotload the other" since it kept loading the last exported weights and I'm not supporting this usecase anymore anyways	2023-09-22 13:04:17 -05:00
mrq	c0b25541e3	restructured some things with the model to remove dead weights	2023-09-20 19:10:59 -05:00
mrq	5ac119a6e7	added light web UI (need to port the telemetry disabling bandaids from aivc)	2023-09-09 16:17:20 -05:00
mrq	67617d7d69	also cull frozen_params in the params optimizer receives to reduce VRAM it consumes	2023-09-07 18:27:02 -05:00
mrq	8837bc34d7	added option to specify parameters to freeze per-model in YAML (because I need to see about committing atrocities with convering an AR into an AR+NAR)	2023-09-07 18:19:51 -05:00
mrq	ab5134f385	tweaks and fixes	2023-09-07 17:08:38 -05:00
mrq	e7a67410d1	oops	2023-09-07 09:14:03 -05:00

1 2

71 Commits