Curious about other strains #1
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I noticed that this fork is based off of https://github.com/enhuiz/vall-e
Trying to see what all strains are out there I could only find this other one: https://github.com/lifeiteng/vall-e
I like how he automated the libritts data acquisition, but it's interesting that this one does not use DeepSpeed while the other does.
Anyway just figured I'd ask.
I'm a bit fried so bear with me if I'm incoherent.
The lifeiteng implementation leverages Lhotse + k2 + icefall to prepare its dataset. I can't quite recall the pros and cons I've gleaned with that approach besides:
but:
I sort of have a simple script to just strictly take a LibriTTS dataset and spit out phonemes and quantized audio without needing AIVC, but I honestly can't be assed to maintain that script. One side of it is that it was only for testing on runpod.io rentals to quickly get a dataset together (and even then I sort of already have a repo with that data prepared anyways), and the dataset it pulls from is too small to even use for a feasible model.
Outside of data preparation, I think I mentioned before that that implementation has some rather odd quirks with how it's training that makes me wary to use it.
But yeah, as far as I'm aware, those are the only two implementations for VALL-E out in the wild. I say my fork isn't since it's in its own bubble here and not on a github.
I appreciate your analysis and response.
I am following along and trying to train too, but I am very new to this.
I'll close this for now instead of derailing the conversation.
Hey @mrq hope you've had the chance to get some rest since last we spoke. I have some experience to report for what it may be worth:
While you've been busy hacking on your version here, I trained a model using enhuiz's version, in order to catch up to your starting point on this project.
It took 8 days to train the NAR and AR in parallel using a pair of 4090's (yes I am a lucky boy lol), on the LibriTTS <2GB 'dev-clean' dataset, using the default parameters:
Finally, today, I was able to try it out, but man it sounds like something went very wrong. I went back to your message and realized perhaps it is related to the issues with the dataset that you alluded to and are causing you to have to do so much extra work.
Hah.
Yeesh. I probably should have made training with the repo more palatable to save your time. It should be at a good spot to use for training, but I still technically can't say that as I don't have a fully trained model yet.
The base implementation as-is has issues from lack of polishing, like:
I suppose 8 days isn't that much of a waste of time in the grand scheme of things.
I'm guessing you just spawned two training instances so one GPU trains one model? The base implementation also doesn't have a nice way to train both at the same time.
There's some weird glaring issue with the base implementation's inference code.
Technically my fork still has it broken, as I keep forgetting to copy over the portion that does AR + NAR inferencing during evaluation/validation, and plant it in the actual inferencing code.
For now though, you'll have to backport the
[run_eval](https://git.ecker.tech/mrq/vall-e/src/branch/master/vall_e/train.py#L108)
block and rely on the evaluation/validation outputs to get outputs.mmm
I think the LibriTTS dataset itself is actually fine (or rather its subsets) for training, it's just a lot of other issues with the base implementation itself that cropped up. The biggest, hugest crux was inferencing not being quite right, and the evaluation/validation having an odd way to go about it. I think if I go back and use the previous test models (if I had them) then I'm pretty sure they would have decent output.
Would the trained models be useful for you to help test inference? Let me know and i'll put them up for you
I guess I never sent my reply here, oops.
mmm, it shouldn't be necessary. I just need to spend half an hour at most to finagle with it, and I'd have to get the original implementation set up again if I want to load a model trained on it originally.
Okay, sounds good, I'll keep it on hand in case it becomes useful considering the amount of time and energy spent creating it.
Finally got off my ass and "fixed" inferencing in my fork. You have two options with using the model you trained:
Using the original implementation
In
./vall-e/vall_e/__main__.py#L27
with:I only remember that einops rearrange being the crux of getting actual output from the models when I was working on this fork, so that backport should be enough.
Inferencing will be how it is documented in its README.
Using my fork
In
./vall-e/vall_e/inference.py#L70
with:in order to avoid having the input tensors downcasted (I'm not sure how smart Torch would be, you might not even need to do this). You then have two other options:
Using the un-exported DeepSpeed checkpoint
This will require modifying your training configuration YAML to conform with my fork. I think the only fundamental difference is unifying the AR and NAR configs into one file and having
model: name
bemodel: [ ar, nar ]
.You can then inference by just passing
yaml=./path/to/your/config.yaml
Using the exported fp32 pytorch weight
This is just specifying the individual pickled models by passing
--ar-ckpt ./path/to/your/ar.pt --nar-ckpt ./path/to/your/nar.pt
(which should be the same as what's documented in the README).I at least tested my implementation's inferencing and it sounds about on par to the output I get during training's evaluation / validation pass and not the mess I remember hearing before (even when the AR and NAR individually sounded semi-discernable), so it's working to the best of my knowledge.
Thanks for your effort to try to help me salvage my model. I should mention that I got lifeitang's version generating pretty nice audio. You can get through the dependency hell easily if you just use Docker, you can refer to my PR to your voice cloning project as a basis for that if you want.
Regarding your advice to salvage enhuiz's, I tried the first option, but I get errors. I tried to convert it before and after the call to
rearrange
but I get errors regarding shape.I can't really understand the second option, I'm afraid, but I appreciate the effort. At this point I am just going to drop pursuit of the enhuiz based strain and focus on trying to get the lifeitang version more refined because it's working pretty well.
Thank you
Oh my bad, I meant replace the linked line with the new line for both of them.