Any tips for getting the fastest inference physically possible? #363
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#363
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context:
Hey Mrq, what I am after is a model that can capture porosity/tone/cadence of a voice well. I don't care too much about the quality of the actual audio (as long as doesn't contain a ton of static and distortion) because I'm going to be taking the output and shoving into RVC to match the voice pitch and quality really nicely. I'm trying to create a chatbot app for fun so low inference times are important to me
Testing:
I've gone down a huge rabbit hole of testing your tortoise fork, the fast tortoise fork, and the original repo. I quickly realized that original repo wouldn't be that useful for speed
I cloned the fast fork and edited the autoregressive.py to use deepspeed. I saw some speed ups that were pretty nice. Sometimes as much as 7-10 seconds speed ups (compared to your fork). But on average it was only a 5-6 speed up (and keep in mind I'm on a mid 3070TI with only 8GB of VRAM)
Whats odd is the AR checkpoint .pth files produce by your repo don't seem to work with the fask fork. They work and the code runs but the voices don't sound anything like what they sound with your fork. Which I assuming is because your implementation is better and has higher quality
I tried using the UI with https://github.com/152334H/DL-Art-School and training a model to get an AR checkpoint. I used this AR checkpoint with fast tortoise and deepspeed and still the quality of the pitch and tone of the voice was not was I was hoping for. The actual audio quality was the same between your fork and the fast fork but the pitch and tone was much better with your implementation.
I also tried adding deepspeed to your autoregressive.py but seemed to slow it down a ton so I took it out
Next steps:
So now that leads me to what are ways that I can get your fork running as fast as possible? Right now here are the settings I'm using for max speed:
You could get even faster with
cond_free=False
andhalf_p=True
but quality significantly suffers for only a 3-4 second gain in speed.The audio is absolutely low quality but the static is not too bad and it sounds great when shoved into RVC. Examples:
Any other ideas on things to try for speedups? I was planning on eventually running it in the cloud with 3-4 GPUs. I was gonna try something like pytriton to use 3-4 3090s/v100s on one machine to see if I can get inference on 20 seconds of audio less than 10 seconds but not sure if that's even possible
I suppose I can look into utilizing DeepSpeed. I remember looking at how it was implemented and it seems rather easy to incorporate.
Ah. I wonder if BitsAndBytes has anything to do with it. Another user mentioned that there was some black magic being done with BitsAndBytes from even just being loaded, so it could have complications with DeepSpeed.
How strange, DLAS should be agnostic to what TorToiSe flavor uses it. The only difference between my fork and the other flavors is that I'm doing something different with generating the AR / diffusion conditioning latents, which in reality, shouldn't be that much of a difference.
If you want to verify, you can take the latents generated from my fork / web UI, and you can have them loaded in lieu of voice wavs, just make sure it's something like:
./tortoise-tts/voices/{voicename}/cond_latents.pth
or whatever is the root voice folder.That's what I was going to initially suggest: low sample count because, with a finetune, generating more samples to pick the best of with the CLVP shouldn't really matter all that much.
There should be a theoretical speedup if I were to skip the CLVP altogether, but I remember trying that and there was some gripes with the script.
I think you can up the diffusion iterations, as this will directly affect the actual quality of the waveform, and I don't remember it being that slow.
I remember
cond_free
mattering quite a bit in terms of quality. I'll be honest and sayhalf_p
is quite a mess with how it's implemented, and if I were to take the time and meddle I could probably get faster throughput with it, although at the maybe cost of accuracy (in theory there's an accuracy hit, but with my tests in VALL-E any perceived accuracy hits just seems to be random chance).I haven't looked much into trying to distribute inferencing over multiple GPUs with TorToiSe, but I remember when I tried it didn't seem possible, although I was rather stupid back then.
There's still a lot of other bandaids I haven't got around to try, sure, like triton or xtransformers or flash attention or proper quantized inferencing with DeepSpeed or BitsAndBytes, but the reality is I feel like it's a bit of a dead end with trying to squeeze out the most out of TorToiSe.
I also would suggest to try to go for the VALL-E backend, but I feel despite it being much more lightweight and oodles snappier, it doesn't have the maturity/consistency that TorToiSe does at the moment. In all my testings with the VALL-E backend, it's quite a chore to wrangle it to get something decent, and it's still very immature.
Yeah I was shocked how it easy it was, only a few lines. I was really expecting to be debugging for a while to see if I could get it to work. Although, one of the downsides with it is installing it. I'm on linux WSL but I think installing it directly in Windows (even in a conda env) can be a pain but I'm not 100% sure
You were on the money with this one. I was initially using wav files and it was causing the voice to sound odd compared to the outputs I was getting from your fork. I used the latents and that fixed it, thanks a lot!
I'm pretty happy with the performance I'm getting out of the fast repo now with the latents + deepspeed and I'm only on a 3070TI 8 GB VRAM. I can imagine on a 3090 or multi-gpu machine you could get some really fast inference with this which is awesome
I noticed when firing up the python script with deepspeed enabled, you get met with a message like this:
use_triton
caught my eye. No idea if its even going to be possible, but that's my next journey of figuring out how to shove tortoise inference into pytriton. If it's possible and not horribly painful I'll be sure to report back. If it is possible I'll probably open a repo with a dockerfile and all the code on how I did it. Because fast tortoise inference+RVC seems to be the best local TTS compared to 11labsHey @mrq did some more testing
I found out that deepspeed in its init function https://deepspeed.readthedocs.io/en/stable/inference-init.html has a config https://deepspeed.readthedocs.io/en/stable/inference-init.html#deepspeed.inference.config.DeepSpeedTPConfig which allows you to enable tensor parallelism. I'm going to test this with multi gpus in a cloud machine to see if that actually speeds anything up or if it just introduces more overhead. Gotta do some more testing on that
The generation of autoregressive samples and the transforming of ar outputs into audio have been the slowest. But really the slowest bottleneck seems to be transforming ar outputs into audio. So I may go down a rabbit hole on seeing if I can speed up the diffusion but unsure how easy that will be