Commit Graph

112 Commits

Author SHA1 Message Date
mrq
fc576010ce wrapped saving the checkpoint in a try/catch so I can stop waking up to the damn trainer crashing because it ran out of disk space; I'd much rather it keep training to give me time to eventually clear up disk space rather than it silently restarting on its own 2023-08-20 06:29:17 -05:00
mrq
2d1a9f10c0 nightmare of spaghetti that might break compat; mechanism to increase RVQ bins of an existing model without retraining, keeps sampled proms/resps at max RVQ level and trim off excess levels according to what model receives them, some other things I already forgot (I really hope no one else has weights being baked right now) 2023-08-19 15:06:33 -05:00
mrq
03872b823f why did I type rglob, another 10 bucks down the drain... 2023-08-17 00:11:29 -05:00
mrq
b5f247aa11 just nuked about 9 hours of progress because I didn't make sure it pruned only on the global leader 2023-08-16 23:37:52 -05:00
mrq
d7152fc7b9 added pruning of old checkpoints if specified (cfg.trainer.keep_last_checkpoints) 2023-08-16 20:12:12 -05:00
mrq
d7deaf6def distributed training works now (hopefully) 2023-08-13 22:07:45 -05:00
mrq
d89568a96e some fixes for the local framework 2023-08-05 03:22:15 +00:00
mrq
5970f254e3 some fixes for the local framework 2023-08-05 02:17:30 +00:00
mrq
012f54b7f1 another classic commit so i can copy it to another machine to gut out things and use the trainer bits for a side project that I should really get around to working on sooner than later 2023-08-04 14:21:30 -05:00
mrq
0a524f1d59 reticulating splines 2023-08-03 21:39:00 -05:00
mrq
608c1970eb ops 2023-08-03 20:36:19 -05:00
mrq
c85101403f big cleanup 2023-08-03 20:26:36 -05:00