Curious about other strains #1

New Issue

psr · 2023-04-10T05:36:46Z

psr commented

2023-04-10 05:36:46 +00:00

I noticed that this fork is based off of https://github.com/enhuiz/vall-e

Trying to see what all strains are out there I could only find this other one: https://github.com/lifeiteng/vall-e

I like how he automated the libritts data acquisition, but it's interesting that this one does not use DeepSpeed while the other does.

Anyway just figured I'd ask.

I noticed that this fork is based off of https://github.com/enhuiz/vall-e Trying to see what all strains are out there I could only find this other one: https://github.com/lifeiteng/vall-e I like how he automated the libritts data acquisition, but it's interesting that this one does not use DeepSpeed while the other does. Anyway just figured I'd ask.

mrq commented

2023-04-10 15:46:09 +00:00

I'm a bit fried so bear with me if I'm incoherent.

The lifeiteng implementation leverages Lhotse + k2 + icefall to prepare its dataset. I can't quite recall the pros and cons I've gleaned with that approach besides:

compressing the training data (with gzip or hdf5, I don't remember which layer has what) reduces the IO bottleneck when hosting the data on spinning rust (but this gain isn't much of a gain on NVMe)
there's a way to explicitly provide the training/validation split rather than it being up in the air with the enhuiz implementation/my fork (based on the last 5% of a speaker)

but:

it's a dependency plague to set up, k2 is CBT to try and get installed right, even with the precompiled wheels
there's a lot of cruft with the data in the dataset that I'm sure doesn't get used for the implementation; Lhotse/k2/icefall might do something under the hood, but it's pretty much superfluous

I sort of have a simple script to just strictly take a LibriTTS dataset and spit out phonemes and quantized audio without needing AIVC, but I honestly can't be assed to maintain that script. One side of it is that it was only for testing on runpod.io rentals to quickly get a dataset together (and even then I sort of already have a repo with that data prepared anyways), and the dataset it pulls from is too small to even use for a feasible model.

Outside of data preparation, I think I mentioned before that that implementation has some rather odd quirks with how it's training that makes me wary to use it.

But yeah, as far as I'm aware, those are the only two implementations for VALL-E out in the wild. I say my fork isn't since it's in its own bubble here and not on a github.

I'm a bit fried so bear with me if I'm incoherent. The lifeiteng implementation leverages Lhotse + k2 + icefall to prepare its dataset. I can't quite recall the pros and cons I've gleaned with that approach besides: * compressing the training data (with gzip or hdf5, I don't remember which layer has what) reduces the IO bottleneck when hosting the data on spinning rust (but this gain isn't much of a gain on NVMe) * there's a way to explicitly provide the training/validation split rather than it being up in the air with the enhuiz implementation/my fork (based on the last 5% of a speaker) but: * it's a dependency plague to set up, k2 is CBT to try and get installed right, even with the precompiled wheels * there's a lot of cruft with the data in the dataset that I'm sure doesn't get used for the implementation; Lhotse/k2/icefall might do something under the hood, but it's pretty much superfluous I *sort of* have a simple script to just strictly take a LibriTTS dataset and spit out phonemes and quantized audio without needing AIVC, but I honestly can't be assed to maintain that script. One side of it is that it was only for testing on runpod.io rentals to quickly get a dataset together (and even then I sort of already have a repo with that data prepared anyways), and the dataset it pulls from is too small to even use for a feasible model. Outside of data preparation, I think I mentioned before that that implementation has some rather odd quirks with how it's training that makes me wary to use it. --- But yeah, as far as I'm aware, those are the only two implementations for VALL-E out in the wild. I say my fork isn't since it's in its own bubble here and not on a github.

psr commented

2023-04-11 20:29:29 +00:00

I appreciate your analysis and response.

I am following along and trying to train too, but I am very new to this.

I'll close this for now instead of derailing the conversation.

I appreciate your analysis and response. I am following along and trying to train too, but I am very new to this. I'll close this for now instead of derailing the conversation.

psr closed this issue

2023-04-11 20:29:31 +00:00

psr commented

2023-04-21 23:12:28 +00:00

Hey @mrq hope you've had the chance to get some rest since last we spoke. I have some experience to report for what it may be worth:

While you've been busy hacking on your version here, I trained a model using enhuiz's version, in order to catch up to your starting point on this project.

It took 8 days to train the NAR and AR in parallel using a pair of 4090's (yes I am a lucky boy lol), on the LibriTTS <2GB 'dev-clean' dataset, using the default parameters:

(base) user@fae6122ef50a:~/vall-e$ cat config/LibriTTS/ar.yml
data_dirs: [data/LibriTTS/]
spkr_name_getter: "lambda p: p.parts[-3]"

model: ar
batch_size: 24
eval_batch_size: 24
eval_every: 10_000

sampling_temperature: 1.0
(base) user@fae6122ef50a:~/vall-e$ cat config/LibriTTS/nar.yml
data_dirs: [data/LibriTTS/]
spkr_name_getter: "lambda p: p.parts[-3]"

model: nar
batch_size: 24
eval_batch_size: 24
eval_every: 1_000

sampling_temperature: 0.2
(base) user@fae6122ef50a:~/vall-e$

Finally, today, I was able to try it out, but man it sounds like something went very wrong. I went back to your message and realized perhaps it is related to the issues with the dataset that you alluded to and are causing you to have to do so much extra work.

Hey @mrq hope you've had the chance to get some rest since last we spoke. I have some experience to report for what it may be worth: While you've been busy hacking on your version here, I trained a model using enhuiz's version, in order to catch up to your starting point on this project. It took 8 days to train the NAR and AR in parallel using a pair of 4090's (yes I am a lucky boy lol), on the LibriTTS <2GB 'dev-clean' dataset, using the default parameters: ``` (base) user@fae6122ef50a:~/vall-e$ cat config/LibriTTS/ar.yml data_dirs: [data/LibriTTS/] spkr_name_getter: "lambda p: p.parts[-3]" model: ar batch_size: 24 eval_batch_size: 24 eval_every: 10_000 sampling_temperature: 1.0 (base) user@fae6122ef50a:~/vall-e$ cat config/LibriTTS/nar.yml data_dirs: [data/LibriTTS/] spkr_name_getter: "lambda p: p.parts[-3]" model: nar batch_size: 24 eval_batch_size: 24 eval_every: 1_000 sampling_temperature: 0.2 (base) user@fae6122ef50a:~/vall-e$ ``` Finally, today, I was able to try it out, but man it sounds like something went very wrong. I went back to your message and realized perhaps it is related to the issues with the dataset that you alluded to and are causing you to have to do so much extra work.

psr reopened this issue

2023-04-21 23:12:28 +00:00

mrq commented

2023-04-21 23:46:31 +00:00

hope you've had the chance to get some rest

Hah.

I trained a model using enhuiz's version

Yeesh. I probably should have made training with the repo more palatable to save your time. It should be at a good spot to use for training, but I still technically can't say that as I don't have a fully trained model yet.

The base implementation as-is has issues from lack of polishing, like:

all tensors being int64, rather than optimal formats (text: uint8, audio: int16)
none of DeepSpeed's niceties are utilized, like bfloat16, quantization, or ZeRO

It took 8 days to train the NAR and AR in parallel using a pair of 4090's (yes I am a lucky boy lol), on the LibriTTS <2GB 'dev-clean' dataset

I suppose 8 days isn't that much of a waste of time in the grand scheme of things.

I'm guessing you just spawned two training instances so one GPU trains one model? The base implementation also doesn't have a nice way to train both at the same time.

I was able to try it out, but man it sounds like something went very wrong

There's some weird glaring issue with the base implementation's inference code.

Technically my fork still has it broken, as I keep forgetting to copy over the portion that does AR + NAR inferencing during evaluation/validation, and plant it in the actual inferencing code.

For now though, you'll have to backport the [run_eval](https://git.ecker.tech/mrq/vall-e/src/branch/master/vall_e/train.py#L108) block and rely on the evaluation/validation outputs to get outputs.

perhaps it is related to the issues with the dataset

mmm

I think the LibriTTS dataset itself is actually fine (or rather its subsets) for training, it's just a lot of other issues with the base implementation itself that cropped up. The biggest, hugest crux was inferencing not being quite right, and the evaluation/validation having an odd way to go about it. I think if I go back and use the previous test models (if I had them) then I'm pretty sure they would have decent output.

> hope you've had the chance to get some rest Hah. > I trained a model using enhuiz's version Yeesh. I probably should have made training with the repo more palatable to save your time. It *should* be at a good spot to use for training, but I still *technically* can't say that as I don't have a fully trained model yet. The base implementation as-is has issues from lack of polishing, like: * all tensors being int64, rather than optimal formats (text: uint8, audio: int16) * none of DeepSpeed's niceties are utilized, like bfloat16, quantization, or ZeRO > It took 8 days to train the NAR and AR in parallel using a pair of 4090's (yes I am a lucky boy lol), on the LibriTTS <2GB 'dev-clean' dataset I suppose 8 days isn't that much of a waste of time in the grand scheme of things. I'm guessing you just spawned two training instances so one GPU trains one model? The base implementation also doesn't have a nice way to train both at the same time. > I was able to try it out, but man it sounds like something went very wrong There's some weird glaring issue with the base implementation's inference code. *Technically* my fork still has it broken, as I keep forgetting to copy over the portion that does AR + NAR inferencing during evaluation/validation, and plant it in the actual inferencing code. For now though, you'll have to backport the `[run_eval](https://git.ecker.tech/mrq/vall-e/src/branch/master/vall_e/train.py#L108)` block and rely on the evaluation/validation outputs to get outputs. > perhaps it is related to the issues with the dataset mmm I think the LibriTTS dataset itself is actually fine (or rather its subsets) for training, it's just a lot of other issues with the base implementation itself that cropped up. The biggest, hugest crux was inferencing not being quite right, and the evaluation/validation having an odd way to go about it. I think if I go back and use the previous test models (if I had them) then I'm *pretty* sure they would have decent output.

❤️ 1

psr commented

2023-04-22 00:32:20 +00:00

Would the trained models be useful for you to help test inference? Let me know and i'll put them up for you

mrq commented

2023-04-25 20:52:16 +00:00

I guess I never sent my reply here, oops.

Would the trained models be useful for you to help test inference?

mmm, it shouldn't be necessary. I just need to spend half an hour at most to finagle with it, and I'd have to get the original implementation set up again if I want to load a model trained on it originally.

I guess I never sent my reply here, oops. > Would the trained models be useful for you to help test inference? mmm, it shouldn't be necessary. I just need to spend half an hour at most to finagle with it, and I'd have to get the original implementation set up again if I want to load a model trained on it originally.

psr commented

2023-04-26 21:17:02 +00:00

Okay, sounds good, I'll keep it on hand in case it becomes useful considering the amount of time and energy spent creating it.

mrq commented

2023-04-27 00:50:06 +00:00

Finally got off my ass and "fixed" inferencing in my fork. You have two options with using the model you trained:

Using the original implementation

In ./vall-e/vall_e/__main__.py#L27 with:

        proms = proms[0].t()

I only remember that einops rearrange being the crux of getting actual output from the models when I was working on this fork, so that backport should be enough.

Inferencing will be how it is documented in its README.

Using my fork

In ./vall-e/vall_e/inference.py#L70 with:

        prom = to_device(prom, self.device)
        phns = to_device(phns, self.device)

in order to avoid having the input tensors downcasted (I'm not sure how smart Torch would be, you might not even need to do this). You then have two other options:

Using the un-exported DeepSpeed checkpoint

This will require modifying your training configuration YAML to conform with my fork. I think the only fundamental difference is unifying the AR and NAR configs into one file and having model: name be model: [ ar, nar ].

You can then inference by just passing yaml=./path/to/your/config.yaml

Using the exported fp32 pytorch weight

This is just specifying the individual pickled models by passing --ar-ckpt ./path/to/your/ar.pt --nar-ckpt ./path/to/your/nar.pt (which should be the same as what's documented in the README).

I at least tested my implementation's inferencing and it sounds about on par to the output I get during training's evaluation / validation pass and not the mess I remember hearing before (even when the AR and NAR individually sounded semi-discernable), so it's working to the best of my knowledge.

Finally got off my ass and "fixed" inferencing in my fork. You have two options with using the model you trained: ## Using the original implementation In [`./vall-e/vall_e/__main__.py#L27`](https://github.com/enhuiz/vall-e/blob/main/vall_e/__main__.py#L27) with: ``` proms = proms[0].t() ``` I only remember that einops rearrange being the crux of getting actual output from the models when I was working on this fork, so that backport should be enough. Inferencing will be how it is documented in its README. ## Using my fork In [`./vall-e/vall_e/inference.py#L70`](https://git.ecker.tech/mrq/vall-e/src/branch/master/vall_e/inference.py#L70) with: ``` prom = to_device(prom, self.device) phns = to_device(phns, self.device) ``` in order to avoid having the input tensors downcasted (I'm not sure how smart Torch would be, you might not even need to do this). You then have two other options: ### Using the un-exported DeepSpeed checkpoint This will require modifying your training configuration YAML to conform with my fork. I think the only fundamental difference is unifying the AR and NAR configs into one file and having `model: name` be `model: [ ar, nar ]`. You can then inference by just passing `yaml=./path/to/your/config.yaml` ### Using the exported fp32 pytorch weight This is just specifying the individual pickled models by passing `--ar-ckpt ./path/to/your/ar.pt --nar-ckpt ./path/to/your/nar.pt` (which should be the same as what's documented in the README). --- I at least tested my implementation's inferencing and it sounds about on par to the output I get during training's evaluation / validation pass and not the mess I remember hearing before (even when the AR and NAR individually sounded semi-discernable), so it's working to the best of my knowledge.

psr commented

2023-04-28 17:55:45 +00:00

Thanks for your effort to try to help me salvage my model. I should mention that I got lifeitang's version generating pretty nice audio. You can get through the dependency hell easily if you just use Docker, you can refer to my PR to your voice cloning project as a basis for that if you want.

Regarding your advice to salvage enhuiz's, I tried the first option, but I get errors. I tried to convert it before and after the call to rearrange but I get errors regarding shape.

I can't really understand the second option, I'm afraid, but I appreciate the effort. At this point I am just going to drop pursuit of the enhuiz based strain and focus on trying to get the lifeitang version more refined because it's working pretty well.

Thank you

Thanks for your effort to try to help me salvage my model. I should mention that I got lifeitang's version generating pretty nice audio. You can get through the dependency hell easily if you just use Docker, you can refer to my PR to your voice cloning project as a basis for that if you want. Regarding your advice to salvage enhuiz's, I tried the first option, but I get errors. I tried to convert it before and after the call to `rearrange` but I get errors regarding shape. I can't really understand the second option, I'm afraid, but I appreciate the effort. At this point I am just going to drop pursuit of the enhuiz based strain and focus on trying to get the lifeitang version more refined because it's working pretty well. Thank you

psr closed this issue

2023-04-28 17:55:45 +00:00

mrq commented

2023-04-28 18:33:00 +00:00

Oh my bad, I meant replace the linked line with the new line for both of them.

Bluebomber182 referenced this issue

2023-08-23 01:47:08 +00:00

Prompt crashes using the pre-trained models #4

Sign in to join this conversation.