Any tips for getting the fastest inference physically possible? #363

New Issue

drew · 2023-09-01T18:35:00Z

drew commented

2023-09-01 18:35:00 +00:00

Context:

Hey Mrq, what I am after is a model that can capture porosity/tone/cadence of a voice well. I don't care too much about the quality of the actual audio (as long as doesn't contain a ton of static and distortion) because I'm going to be taking the output and shoving into RVC to match the voice pitch and quality really nicely. I'm trying to create a chatbot app for fun so low inference times are important to me

Testing:

I've gone down a huge rabbit hole of testing your tortoise fork, the fast tortoise fork, and the original repo. I quickly realized that original repo wouldn't be that useful for speed

I cloned the fast fork and edited the autoregressive.py to use deepspeed. I saw some speed ups that were pretty nice. Sometimes as much as 7-10 seconds speed ups (compared to your fork). But on average it was only a 5-6 speed up (and keep in mind I'm on a mid 3070TI with only 8GB of VRAM)

Whats odd is the AR checkpoint .pth files produce by your repo don't seem to work with the fask fork. They work and the code runs but the voices don't sound anything like what they sound with your fork. Which I assuming is because your implementation is better and has higher quality

I tried using the UI with https://github.com/152334H/DL-Art-School and training a model to get an AR checkpoint. I used this AR checkpoint with fast tortoise and deepspeed and still the quality of the pitch and tone of the voice was not was I was hoping for. The actual audio quality was the same between your fork and the fast fork but the pitch and tone was much better with your implementation.

I also tried adding deepspeed to your autoregressive.py but seemed to slow it down a ton so I took it out

Next steps:

So now that leads me to what are ways that I can get your fork running as fast as possible? Right now here are the settings I'm using for max speed:

num_autoregressive_samples=1
diffusion_iterations=10
cond_free=True
half_p=False

You could get even faster with cond_free=False and half_p=True but quality significantly suffers for only a 3-4 second gain in speed.

The audio is absolutely low quality but the static is not too bad and it sounds great when shoved into RVC. Examples:

Tortoise output https://voca.ro/1nCc2kqid4PA - I think inference time was roughly 15-17 seconds
After being shoved into RVC https://voca.ro/1oxrptpAAU0B

Any other ideas on things to try for speedups? I was planning on eventually running it in the cloud with 3-4 GPUs. I was gonna try something like pytriton to use 3-4 3090s/v100s on one machine to see if I can get inference on 20 seconds of audio less than 10 seconds but not sure if that's even possible

### Context: --- Hey Mrq, what I am after is a model that can capture porosity/tone/cadence of a voice well. I don't care too much about the quality of the actual audio (as long as doesn't contain a ton of static and distortion) because I'm going to be taking the output and shoving into RVC to match the voice pitch and quality really nicely. I'm trying to create a chatbot app for fun so low inference times are important to me ### Testing: --- I've gone down a huge rabbit hole of testing your tortoise fork, the fast tortoise fork, and the original repo. I quickly realized that original repo wouldn't be that useful for speed I cloned the fast fork and edited the autoregressive.py to use deepspeed. I saw some speed ups that were pretty nice. Sometimes as much as 7-10 seconds speed ups (compared to your fork). But on average it was only a 5-6 speed up (and keep in mind I'm on a mid 3070TI with only 8GB of VRAM) Whats odd is the AR checkpoint .pth files produce by your repo don't seem to work with the fask fork. They work and the code runs but the voices don't sound anything like what they sound with your fork. Which I assuming is because your implementation is better and has higher quality I tried using the UI with https://github.com/152334H/DL-Art-School and training a model to get an AR checkpoint. I used this AR checkpoint with fast tortoise and deepspeed and still the quality of the pitch and tone of the voice was not was I was hoping for. The actual audio quality was the same between your fork and the fast fork but the pitch and tone was much better with your implementation. I also tried adding deepspeed to your autoregressive.py but seemed to slow it down a ton so I took it out ### Next steps: --- So now that leads me to what are ways that I can get your fork running as fast as possible? Right now here are the settings I'm using for max speed: ``` num_autoregressive_samples=1 diffusion_iterations=10 cond_free=True half_p=False ``` You could get even faster with `cond_free=False` and `half_p=True` but quality significantly suffers for only a 3-4 second gain in speed. The audio is absolutely low quality but the static is not too bad and it sounds great when shoved into RVC. Examples: - Tortoise output https://voca.ro/1nCc2kqid4PA - I think inference time was roughly 15-17 seconds - After being shoved into RVC https://voca.ro/1oxrptpAAU0B Any other ideas on things to try for speedups? I was planning on eventually running it in the cloud with 3-4 GPUs. I was gonna try something like pytriton to use 3-4 3090s/v100s on one machine to see if I can get inference on 20 seconds of audio less than 10 seconds but not sure if that's even possible

mrq commented

2023-09-01 19:03:17 +00:00

I cloned the fast fork and edited the autoregressive.py to use deepspeed. I saw some speed ups that were pretty nice. Sometimes as much as 7-10 seconds speed ups (compared to your fork). But on average it was only a 5-6 speed up (and keep in mind I'm on a mid 3070TI with only 8GB of VRAM)

I suppose I can look into utilizing DeepSpeed. I remember looking at how it was implemented and it seems rather easy to incorporate.

I also tried adding deepspeed to your autoregressive.py but seemed to slow it down a ton so I took it out

Ah. I wonder if BitsAndBytes has anything to do with it. Another user mentioned that there was some black magic being done with BitsAndBytes from even just being loaded, so it could have complications with DeepSpeed.

Whats odd is the AR checkpoint .pth files produce by your repo don't seem to work with the fask fork. They work and the code runs but the voices don't sound anything like what they sound with your fork. Which I assuming is because your implementation is better and has higher quality
I tried using the UI with https://github.com/152334H/DL-Art-School and training a model to get an AR checkpoint. I used this AR checkpoint with fast tortoise and deepspeed and still the quality of the pitch and tone of the voice was not was I was hoping for. The actual audio quality was the same between your fork and the fast fork but the pitch and tone was much better with your implementation.

How strange, DLAS should be agnostic to what TorToiSe flavor uses it. The only difference between my fork and the other flavors is that I'm doing something different with generating the AR / diffusion conditioning latents, which in reality, shouldn't be that much of a difference.

If you want to verify, you can take the latents generated from my fork / web UI, and you can have them loaded in lieu of voice wavs, just make sure it's something like: ./tortoise-tts/voices/{voicename}/cond_latents.pth or whatever is the root voice folder.

So now that leads me to what are ways that I can get your fork running as fast as possible? Right now here are the settings I'm using for max speed:

That's what I was going to initially suggest: low sample count because, with a finetune, generating more samples to pick the best of with the CLVP shouldn't really matter all that much.

There should be a theoretical speedup if I were to skip the CLVP altogether, but I remember trying that and there was some gripes with the script.

I think you can up the diffusion iterations, as this will directly affect the actual quality of the waveform, and I don't remember it being that slow.

You could get even faster with cond_free=False and half_p=True but quality significantly suffers for only a 3-4 second gain in speed.

I remember cond_free mattering quite a bit in terms of quality. I'll be honest and say half_p is quite a mess with how it's implemented, and if I were to take the time and meddle I could probably get faster throughput with it, although at the maybe cost of accuracy (in theory there's an accuracy hit, but with my tests in VALL-E any perceived accuracy hits just seems to be random chance).

Any other ideas on things to try for speedups? I was planning on eventually running it in the cloud with 3-4 GPUs. I was gonna try something like pytriton to use 3-4 3090s/v100s on one machine to see if I can get inference on 20 seconds of audio less than 10 seconds but not sure if that's even possible

I haven't looked much into trying to distribute inferencing over multiple GPUs with TorToiSe, but I remember when I tried it didn't seem possible, although I was rather stupid back then.

There's still a lot of other bandaids I haven't got around to try, sure, like triton or xtransformers or flash attention or proper quantized inferencing with DeepSpeed or BitsAndBytes, but the reality is I feel like it's a bit of a dead end with trying to squeeze out the most out of TorToiSe.

I also would suggest to try to go for the VALL-E backend, but I feel despite it being much more lightweight and oodles snappier, it doesn't have the maturity/consistency that TorToiSe does at the moment. In all my testings with the VALL-E backend, it's quite a chore to wrangle it to get something decent, and it's still very immature.

> I cloned the fast fork and edited the autoregressive.py to use deepspeed. I saw some speed ups that were pretty nice. Sometimes as much as 7-10 seconds speed ups (compared to your fork). But on average it was only a 5-6 speed up (and keep in mind I'm on a mid 3070TI with only 8GB of VRAM) I suppose I can look into utilizing DeepSpeed. I remember looking at how it was implemented and it seems rather easy to incorporate. > I also tried adding deepspeed to your autoregressive.py but seemed to slow it down a ton so I took it out Ah. I wonder if BitsAndBytes has anything to do with it. Another user mentioned that there was some black magic being done with BitsAndBytes from even just being loaded, so it could have complications with DeepSpeed. > Whats odd is the AR checkpoint .pth files produce by your repo don't seem to work with the fask fork. They work and the code runs but the voices don't sound anything like what they sound with your fork. Which I assuming is because your implementation is better and has higher quality > I tried using the UI with https://github.com/152334H/DL-Art-School and training a model to get an AR checkpoint. I used this AR checkpoint with fast tortoise and deepspeed and still the quality of the pitch and tone of the voice was not was I was hoping for. The actual audio quality was the same between your fork and the fast fork but the pitch and tone was much better with your implementation. How strange, DLAS should be agnostic to what TorToiSe flavor uses it. The only difference between my fork and the other flavors is that I'm doing something different with generating the AR / diffusion conditioning latents, which in reality, shouldn't be that much of a difference. ***If*** you want to verify, you can take the latents generated from my fork / web UI, and you can have them loaded in lieu of voice wavs, just make sure it's something like: `./tortoise-tts/voices/{voicename}/cond_latents.pth` or whatever is the root voice folder. > So now that leads me to what are ways that I can get your fork running as fast as possible? Right now here are the settings I'm using for max speed: That's what I was going to initially suggest: low sample count because, with a finetune, generating more samples to pick the best of with the CLVP shouldn't really matter all that much. There should be a theoretical speedup if I were to skip the CLVP altogether, but I remember trying that and there was some gripes with the script. I think you can up the diffusion iterations, as this will directly affect the actual quality of the waveform, and I don't remember it being that slow. > You could get even faster with cond_free=False and half_p=True but quality significantly suffers for only a 3-4 second gain in speed. I remember `cond_free` mattering quite a bit in terms of quality. I'll be honest and say `half_p` is quite a mess with how it's implemented, and if I were to take the time and meddle I could probably get faster throughput with it, although at the maybe cost of accuracy (in theory there's an accuracy hit, but with my tests in VALL-E any perceived accuracy hits just seems to be random chance). > Any other ideas on things to try for speedups? I was planning on eventually running it in the cloud with 3-4 GPUs. I was gonna try something like pytriton to use 3-4 3090s/v100s on one machine to see if I can get inference on 20 seconds of audio less than 10 seconds but not sure if that's even possible I haven't looked much into trying to distribute inferencing over multiple GPUs with TorToiSe, but I remember when I tried it didn't seem possible, although I was rather stupid back then. There's still a lot of other bandaids I haven't got around to try, sure, like triton or xtransformers or flash attention or proper quantized inferencing with DeepSpeed or BitsAndBytes, but the reality is I feel like it's a bit of a dead end with trying to squeeze out the most out of TorToiSe. I also would suggest to try to go for the VALL-E backend, but I feel despite it being much more lightweight and oodles snappier, it doesn't have the maturity/consistency that TorToiSe does at the moment. In all my testings with the VALL-E backend, it's quite a chore to wrangle it to get something decent, and it's still very immature.

drew commented

2023-09-02 01:04:09 +00:00

I suppose I can look into utilizing DeepSpeed. I remember looking at how it was implemented and it seems rather easy to incorporate.

Yeah I was shocked how it easy it was, only a few lines. I was really expecting to be debugging for a while to see if I could get it to work. Although, one of the downsides with it is installing it. I'm on linux WSL but I think installing it directly in Windows (even in a conda env) can be a pain but I'm not 100% sure

If you want to verify, you can take the latents generated from my fork / web UI, and you can have them loaded in lieu of voice wavs, just make sure it's something like: ./tortoise-tts/voices/{voicename}/cond_latents.pth or whatever is the root voice folder.

You were on the money with this one. I was initially using wav files and it was causing the voice to sound odd compared to the outputs I was getting from your fork. I used the latents and that fixed it, thanks a lot!

I'm pretty happy with the performance I'm getting out of the fast repo now with the latents + deepspeed and I'm only on a 3070TI 8 GB VRAM. I can imagine on a 3090 or multi-gpu machine you could get some really fast inference with this which is awesome

I noticed when firing up the python script with deepspeed enabled, you get met with a message like this:

DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 1024, 'intermediate_size': 4096, 'heads': 16, 'num_hidden_layers': -1, 'dtype': torch.float32, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False}

use_triton caught my eye. No idea if its even going to be possible, but that's my next journey of figuring out how to shove tortoise inference into pytriton. If it's possible and not horribly painful I'll be sure to report back. If it is possible I'll probably open a repo with a dockerfile and all the code on how I did it. Because fast tortoise inference+RVC seems to be the best local TTS compared to 11labs

> I suppose I can look into utilizing DeepSpeed. I remember looking at how it was implemented and it seems rather easy to incorporate. Yeah I was shocked how it easy it was, only a few lines. I was really expecting to be debugging for a while to see if I could get it to work. Although, one of the downsides with it is installing it. I'm on linux WSL but I think installing it directly in Windows (even in a conda env) can be a pain but I'm not 100% sure > If you want to verify, you can take the latents generated from my fork / web UI, and you can have them loaded in lieu of voice wavs, just make sure it's something like: ./tortoise-tts/voices/{voicename}/cond_latents.pth or whatever is the root voice folder. You were on the money with this one. I was initially using wav files and it was causing the voice to sound odd compared to the outputs I was getting from your fork. I used the latents and that fixed it, thanks a lot! I'm pretty happy with the performance I'm getting out of the fast repo now with the latents + deepspeed and I'm only on a 3070TI 8 GB VRAM. I can imagine on a 3090 or multi-gpu machine you could get some really fast inference with this which is awesome I noticed when firing up the python script with deepspeed enabled, you get met with a message like this: ``` DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 1024, 'intermediate_size': 4096, 'heads': 16, 'num_hidden_layers': -1, 'dtype': torch.float32, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False} ``` `use_triton` caught my eye. No idea if its even going to be possible, but that's my next journey of figuring out how to shove tortoise inference into pytriton. If it's possible and not horribly painful I'll be sure to report back. If it is possible I'll probably open a repo with a dockerfile and all the code on how I did it. Because fast tortoise inference+RVC seems to be the best local TTS compared to 11labs

drew commented

2023-09-06 23:29:35 +00:00

Hey @mrq did some more testing

I found out that deepspeed in its init function https://deepspeed.readthedocs.io/en/stable/inference-init.html has a config https://deepspeed.readthedocs.io/en/stable/inference-init.html#deepspeed.inference.config.DeepSpeedTPConfig which allows you to enable tensor parallelism. I'm going to test this with multi gpus in a cloud machine to see if that actually speeds anything up or if it just introduces more overhead. Gotta do some more testing on that

The generation of autoregressive samples and the transforming of ar outputs into audio have been the slowest. But really the slowest bottleneck seems to be transforming ar outputs into audio. So I may go down a rabbit hole on seeing if I can speed up the diffusion but unsure how easy that will be

Hey @mrq did some more testing I found out that deepspeed in its init function https://deepspeed.readthedocs.io/en/stable/inference-init.html has a config https://deepspeed.readthedocs.io/en/stable/inference-init.html#deepspeed.inference.config.DeepSpeedTPConfig which allows you to enable tensor parallelism. I'm going to test this with multi gpus in a cloud machine to see if that actually speeds anything up or if it just introduces more overhead. Gotta do some more testing on that The generation of autoregressive samples and the transforming of ar outputs into audio have been the slowest. But really the slowest bottleneck seems to be transforming ar outputs into audio. So I may go down a rabbit hole on seeing if I can speed up the diffusion but unsure how easy that will be

drew referenced this issue

2023-09-11 16:54:55 +00:00

Speed Increase? #374

Qual referenced this issue

2023-11-11 02:34:39 +00:00

Speed Increase? #374

Sign in to join this conversation.