Training GPU offer #8

New Issue

plasmator · 2023-10-04T21:23:41Z

plasmator commented

2023-10-04 21:23:41 +00:00

I have 2 Threadripper nodes, one with 2x4090 and one with 2x3090 NVLink. Would like to offer this for as long as is necessary to train a good model on Libre-light. I don't care if it's months, I just want a high-quality VALL-E model and it's frustrating that we don't have one yet. Let me know if interested

mrq commented

2023-10-05 00:24:50 +00:00

desu, I would first see if:

VALL-E X is serviceable enough for you (I personally have my issues with it, but that's neither here nor there), as it's "complete".
you're able to handle resuming training yourself.

For the latter, it's as simple as doing (and these steps are what I'd end up doing anyways):

setting up a venv and installing this repo as a package
downloading the necessary files to prepare the training environment under ./training/valle/
- download the latest weights, its YAML, the latest release of the libre dataset, and editing the YAML to increase the batch size
invoking the training script with deepspeed --module vall_e.train yaml="./training/valle/config.yaml"

I can make a simple shell script to easily have everything set up in place after updating the provided dataset (I think the currently provided one is missing LibriLight's medium and duplicate datasets).

I do not necessarily mind supervising training, but all that I would be needed for is just keeping an eye on the metrics and adjusting the LR (as I still refuse to touch a LR scheduler). I haven't used multiple nodes to train before, so I don't have an idea on setting that up.

All that's left for training just seems to be having it run to help tidy things up to make it more consistent for zero-shot (it seems "okay" under the HuggingFace Space), but it seems that it's about an epoch a week currently on either my 4070Ti or 7900XTX (and utilizing both isn't all that easy without multi-noding), but I don't have an idea on how many epochs would help until it's a fool's errand to continue training.

and it's frustrating that we don't have one yet

Yeah.............................................................................................

All the bullshit experiments should be out and done with, so this last model should be the last one for actual an VALL-E model, for sure. No more sidetracking with different architectures or layouts or glueing shit in the attempt to save time.

desu, I would first see if: * [VALL-E X](https://huggingface.co/spaces/Plachta/VALL-E-X/) is serviceable enough for you (I personally have my issues with it, but that's neither here nor there), as it's "complete". * you're able to handle resuming training yourself. For the latter, it's as simple as doing (and these steps are what I'd end up doing anyways): * setting up a venv and installing this repo as a package * downloading the necessary files to prepare the training environment under `./training/valle/` + download [the latest weights](https://huggingface.co/ecker/vall-e/blob/main/ckpt/ar%2Bnar-retnet-8/fp32.pth), its [YAML](https://huggingface.co/ecker/vall-e/blob/main/config.ar_nar.yaml), the latest release of the [libre dataset](https://huggingface.co/ecker/vall-e/blob/main/data.h5), and editing the YAML to increase the batch size * invoking the training script with `deepspeed --module vall_e.train yaml="./training/valle/config.yaml"` I can make a simple shell script to easily have everything set up in place after updating the provided dataset (I think the currently provided one is missing LibriLight's `medium` and `duplicate` datasets). I do not necessarily mind supervising training, but all that I would be needed for is just keeping an eye on the metrics and adjusting the LR (as I still refuse to touch a LR scheduler). I haven't used multiple nodes to train before, so I don't have an idea on setting that up. All that's left for training just seems to be having it run to help tidy things up to make it more consistent for zero-shot (it seems "okay" under the [HuggingFace Space](https://huggingface.co/spaces/ecker/vall-e)), but it seems that it's about an epoch a week currently on either my 4070Ti or 7900XTX (and utilizing both isn't all that easy without multi-noding), but I don't have an idea on how many epochs would help until it's a fool's errand to continue training. > and it's frustrating that we don't have one yet Yeah............................................................................................. All the bullshit experiments should be out and done with, so this last model should be the last one for actual an VALL-E model, for sure. No more sidetracking with different architectures or layouts or glueing shit in the attempt to save time.

plasmator commented

2023-10-05 03:30:08 +00:00

I don't mind being the one to handle it at all. If no LR scheduler I might want some guidance on tweaking LR

VALL-E X isn't quite there yet with that checkpoint, so yeah. Straight VALL-E is ok with me.

On the training, I looked through the leifeiteng repo and it looks really straightforward to use for training with multiple GPUs using DDP if your current checkpoint is compatible? I even don't think it's more than a few hours work to get it multinode. If you aren't thrilled with that plan we can discuss

My thought is also to pump up the dataset with libri-light medium. Have you been training on your side with the medium added?

I don't mind being the one to handle it at all. If no LR scheduler I might want some guidance on tweaking LR VALL-E X isn't quite there yet with that checkpoint, so yeah. Straight VALL-E is ok with me. On the training, I looked through the leifeiteng repo and it looks really straightforward to use for training with multiple GPUs using DDP if your current checkpoint is compatible? I even don't think it's more than a few hours work to get it multinode. If you aren't thrilled with that plan we can discuss My thought is also to pump up the dataset with libri-light medium. Have you been training on your side with the medium added?

mrq commented

2023-10-05 14:46:19 +00:00

I might want some guidance on tweaking LR

Adjusting the LR is as simple as entering, for example, lr 0.05 into the training window. The only caveat is having to remember to edit the training YAML.

With prodigyopt, I'm not too sure how imperative it is to touch the LR. Documentation is a bit scarce outside of "use a cosine annealing scheduler" (the issue with that is I don't have a good idea on an optimal LR schedule) and people using it for LoRAs. I feel setting it smaller would help, like a normal optimizer, but with my mini-test trainer the LR didn't have to be touched for it to overfit.

I will forewarn that, as the optimizer states "warm up", the loss will jump up a bit and should settle down afterwards. I might look into a way to add warm starting into the YAML, since the workaround for it right now would be to use the old weights from DeepSpeed with the latest optimizer states.

I looked through the leifeiteng repo and it looks really straightforward to use for training with multiple GPUs using DDP

Distributed training (single node multi-gpu) with my repo works already with no additional tweaks required, but DeepSpeed's sparse documentation on how to set up multi-node training isn't very helpful.

I'm also not sure how robust additional things like my speaker sampler's state would fare under multi-node, so it may or may not be an issue.

if your current checkpoint is compatible?

Nope. There's quite a lot of fundamental differences. Even with subjugating the trainer to use my model's code, the last crux would be the dataset format it loads from, and it was a bit of a pain to deal with when I was initially evaluating it as a base to expand off of.

I even don't think it's more than a few hours work to get it multinode.

It shouldn't take that much time, it's just a matter of if you're hoping I'd have some wisdom on setting up DeepSpeed for multi-node training.

My thought is also to pump up the dataset with libri-light medium. Have you been training on your side with the medium added?

Mhm.

My current dataset composes of:

LibriTTS-R
LibriLight's small + medium + duplicated
406 donated audiobooks from a kind anon
some vidya voice clip rips between whatever I could think of and what were suggested to me in the past (which honestly is paled in comparison by everything else)

The current data.h5 on the HuggingFace model repo, if I remember right, is composed of LibriTTS-R, and a portion of LibriLight's small + medium (due to a dataset processing oversight).

I recreated a data.h5 dataset that contains the "libre" portions of the dataset (LibriTTS-R and LibriLight sans large), but the gunzipped file is still uploading to HuggingFace (something like ~95GiB => 40GiB), so that should have the dataset updated for anyone to use. After that's done, I'll have the setup script updated, and all that would be left is to finagle with multi-node training.

> I might want some guidance on tweaking LR Adjusting the LR is as simple as entering, for example, `lr 0.05` into the training window. The only caveat is having to remember to edit the training YAML. With [prodigyopt](https://github.com/konstmish/prodigy/), I'm not too sure how imperative it is to touch the LR. Documentation is a bit scarce outside of "use a cosine annealing scheduler" (the issue with that is I don't have a good idea on an optimal LR schedule) and people using it for LoRAs. I *feel* setting it smaller would help, like a normal optimizer, but with my mini-test trainer the LR didn't have to be touched for it to overfit. I will forewarn that, as the optimizer states "warm up", the loss will jump up a bit and should settle down afterwards. I might look into a way to add warm starting into the YAML, since the workaround for it right now would be to use the old weights from DeepSpeed with the latest optimizer states. > I looked through the leifeiteng repo and it looks really straightforward to use for training with multiple GPUs using DDP Distributed training (single node multi-gpu) with my repo works already with no additional tweaks required, but DeepSpeed's sparse documentation on how to set up multi-node training isn't very helpful. I'm also not sure how robust additional things like my speaker sampler's state would fare under multi-node, so it may or may not be an issue. > if your current checkpoint is compatible? Nope. There's quite a lot of fundamental differences. Even with subjugating the trainer to use my model's code, the last crux would be the dataset format it loads from, and it was a bit of a pain to deal with when I was initially evaluating it as a base to expand off of. > I even don't think it's more than a few hours work to get it multinode. It shouldn't take that much time, it's just a matter of if you're hoping I'd have some wisdom on setting up DeepSpeed for multi-node training. > My thought is also to pump up the dataset with libri-light medium. Have you been training on your side with the medium added? Mhm. My current dataset composes of: * LibriTTS-R * LibriLight's `small` + `medium` + `duplicated` * 406 donated audiobooks from a kind anon * some vidya voice clip rips between whatever I could think of and what were suggested to me in the past (which honestly is paled in comparison by everything else) The current `data.h5` on the HuggingFace model repo, if I remember right, is composed of LibriTTS-R, and a portion of LibriLight's `small` + `medium` (due to a dataset processing oversight). I recreated a `data.h5` dataset that contains the "libre" portions of the dataset (LibriTTS-R and LibriLight sans `large`), but the gunzipped file is still uploading to HuggingFace (something like ~95GiB => 40GiB), so that should have the dataset updated for anyone to use. After that's done, I'll have the [setup script](https://git.ecker.tech/mrq/vall-e/src/branch/master/scripts/setup.sh) updated, and all that would be left is to finagle with multi-node training.

mrq commented

2023-10-06 13:13:24 +00:00

Alright, after a painful day of trying to upload the gunzipped dataset twice, the latest libre dataset has been uploaded, and the setup script is good to go, so all that's needed to be done to get ready to resume training is:

git clone https://git.ecker.tech/mrq/vall-e
cd vall-e
./scripts/setup-training.sh
source ./venv/bin/activate
# <whatever extra setup for multi-node here>
deepspeed --module vall_e.train yaml="./training/valle/config.yaml" # and additional flags for multi-node

Alright, after a painful day of trying to upload the gunzipped dataset twice, the latest libre dataset has been uploaded, and the setup script is good to go, so all that's needed to be done to get ready to resume training is: ```sh git clone https://git.ecker.tech/mrq/vall-e cd vall-e ./scripts/setup-training.sh source ./venv/bin/activate # <whatever extra setup for multi-node here> deepspeed --module vall_e.train yaml="./training/valle/config.yaml" # and additional flags for multi-node ```

Sign in to join this conversation.

No Label

No Milestone

No project

No Assignees

2 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/vall-e#8