Training GPU offer #8
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I have 2 Threadripper nodes, one with 2x4090 and one with 2x3090 NVLink. Would like to offer this for as long as is necessary to train a good model on Libre-light. I don't care if it's months, I just want a high-quality VALL-E model and it's frustrating that we don't have one yet. Let me know if interested
desu, I would first see if:
For the latter, it's as simple as doing (and these steps are what I'd end up doing anyways):
./training/valle/
deepspeed --module vall_e.train yaml="./training/valle/config.yaml"
I can make a simple shell script to easily have everything set up in place after updating the provided dataset (I think the currently provided one is missing LibriLight's
medium
andduplicate
datasets).I do not necessarily mind supervising training, but all that I would be needed for is just keeping an eye on the metrics and adjusting the LR (as I still refuse to touch a LR scheduler). I haven't used multiple nodes to train before, so I don't have an idea on setting that up.
All that's left for training just seems to be having it run to help tidy things up to make it more consistent for zero-shot (it seems "okay" under the HuggingFace Space), but it seems that it's about an epoch a week currently on either my 4070Ti or 7900XTX (and utilizing both isn't all that easy without multi-noding), but I don't have an idea on how many epochs would help until it's a fool's errand to continue training.
Yeah.............................................................................................
All the bullshit experiments should be out and done with, so this last model should be the last one for actual an VALL-E model, for sure. No more sidetracking with different architectures or layouts or glueing shit in the attempt to save time.
I don't mind being the one to handle it at all. If no LR scheduler I might want some guidance on tweaking LR
VALL-E X isn't quite there yet with that checkpoint, so yeah. Straight VALL-E is ok with me.
On the training, I looked through the leifeiteng repo and it looks really straightforward to use for training with multiple GPUs using DDP if your current checkpoint is compatible? I even don't think it's more than a few hours work to get it multinode. If you aren't thrilled with that plan we can discuss
My thought is also to pump up the dataset with libri-light medium. Have you been training on your side with the medium added?
Adjusting the LR is as simple as entering, for example,
lr 0.05
into the training window. The only caveat is having to remember to edit the training YAML.With prodigyopt, I'm not too sure how imperative it is to touch the LR. Documentation is a bit scarce outside of "use a cosine annealing scheduler" (the issue with that is I don't have a good idea on an optimal LR schedule) and people using it for LoRAs. I feel setting it smaller would help, like a normal optimizer, but with my mini-test trainer the LR didn't have to be touched for it to overfit.
I will forewarn that, as the optimizer states "warm up", the loss will jump up a bit and should settle down afterwards. I might look into a way to add warm starting into the YAML, since the workaround for it right now would be to use the old weights from DeepSpeed with the latest optimizer states.
Distributed training (single node multi-gpu) with my repo works already with no additional tweaks required, but DeepSpeed's sparse documentation on how to set up multi-node training isn't very helpful.
I'm also not sure how robust additional things like my speaker sampler's state would fare under multi-node, so it may or may not be an issue.
Nope. There's quite a lot of fundamental differences. Even with subjugating the trainer to use my model's code, the last crux would be the dataset format it loads from, and it was a bit of a pain to deal with when I was initially evaluating it as a base to expand off of.
It shouldn't take that much time, it's just a matter of if you're hoping I'd have some wisdom on setting up DeepSpeed for multi-node training.
Mhm.
My current dataset composes of:
small
+medium
+duplicated
The current
data.h5
on the HuggingFace model repo, if I remember right, is composed of LibriTTS-R, and a portion of LibriLight'ssmall
+medium
(due to a dataset processing oversight).I recreated a
data.h5
dataset that contains the "libre" portions of the dataset (LibriTTS-R and LibriLight sanslarge
), but the gunzipped file is still uploading to HuggingFace (something like ~95GiB => 40GiB), so that should have the dataset updated for anyone to use. After that's done, I'll have the setup script updated, and all that would be left is to finagle with multi-node training.Alright, after a painful day of trying to upload the gunzipped dataset twice, the latest libre dataset has been uploaded, and the setup script is good to go, so all that's needed to be done to get ready to resume training is: