generating voice clip is so much slower compared to using original Tortoise TTS #183

Open
opened 2023-03-28 21:39:18 +00:00 by embanot · 9 comments

I decided to try this out after playing around with the original Tortoise TTS and wanting to do more fine tuning and creating my own models.

I've got it up and running but when testing it out, it is taking significantly more time to generate a short one sentance voice clip compared to the original Tortoise. I'm using a 3080ti and even one Fast preset, it takes like 5 mins to generate a short sentance. In Tortoise, it would only take me a bit less than a min on the Fast preset. Why is this?

I've checked that my GPU is being utilized during the generation and not my CPU. I've kept all the settings at default. Any ideas what might be causing the slow down?

I decided to try this out after playing around with the original Tortoise TTS and wanting to do more fine tuning and creating my own models. I've got it up and running but when testing it out, it is taking significantly more time to generate a short one sentance voice clip compared to the original Tortoise. I'm using a 3080ti and even one Fast preset, it takes like 5 mins to generate a short sentance. In Tortoise, it would only take me a bit less than a min on the Fast preset. Why is this? I've checked that my GPU is being utilized during the generation and not my CPU. I've kept all the settings at default. Any ideas what might be causing the slow down?

Post your console log.

Post your console log.
Author

Here it is.

I did notice when I changed the setting to low Vram, it sped things up. But is that lowering the quality quite a bit? As I mentioned, i'm using a 3080ti so I don't know if that qualifies as low Vram or not and how it compared to the original Tortoise TTS.

Here it is. I did notice when I changed the setting to low Vram, it sped things up. But is that lowering the quality quite a bit? As I mentioned, i'm using a 3080ti so I don't know if that qualifies as low Vram or not and how it compared to the original Tortoise TTS.

Hmm. I see the first time it took extremely long because it had to generate the latents for that voice and model, but that doesn't explain why it took so long the second time. Try changing "Sample Batch Size" in the Settings tab to something like 8 or 12. (I use 8 on a RTX3060 and it would probably take me about 30 seconds to generate a line like that.)

Edit: Ehh, I 53s cold-start, 48s warm (on Ultra Fast). Close enough.

sneed@FMRLYCHKS:~/ai-voice-cloning$ ./start.sh --listen 172.20.8.87:8080

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/sneed/ai-voice-cloning/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/lib/wsl/lib: did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
/home/sneed/ai-voice-cloning/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('unix')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/sneed/ai-voice-cloning/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Running on local URL:  http://172.20.8.87:8080

To create a public link, set `share=True` in `launch()`.
Loading TorToiSe... (AR: ./training/Veronica/finetune/models/1500_gpt.pth, vocoder: bigvgan_24khz_100band)
Hardware acceleration found: cuda
Loading tokenizer JSON: ./modules/tortoise-tts/tortoise/data/tokenizer.json
Loaded tokenizer
Loading autoregressive model: ./training/Veronica/finetune/models/1500_gpt.pth
Loaded autoregressive model
Loaded diffusion model
Loading vocoder model: bigvgan_24khz_100band
Loading vocoder model: bigvgan_24khz_100band.pth
Removing weight norm...
Loaded vocoder model
Loaded TorToiSe, ready for generation.
[1/1] Generating line: [I am really sad,] But their legacy lives on, and their tales is still told to this day, as a reminder of the power of courage and determination, and the importance of hope and love.
Loading voice: Veronica with model 724a9bd5
Reading from latent: ./voices/Veronica/cond_latents_724a9bd5.pth
Generating autoregressive samples
Computing best candidates using CLVP
Transforming autoregressive outputs into audio..
Generating line took 46.11549711227417 seconds
/home/sneed/ai-voice-cloning/venv/lib/python3.10/site-packages/torchaudio/functional/functional.py:1458: UserWarning: "kaiser_window" resampling method name is being deprecated and replaced by "sinc_interp_kaiser" in the next release. The default behavior remains unchanged.
  warnings.warn(
Loading Voicefixer
Loaded Voicefixer
Generation took 53.15977644920349 seconds, saved to './results//Veronica//Veronica_00043_fixed.wav'

Unloaded Voicefixer
[1/1] Generating line: [I am really sad,] But their legacy lives on, and their tale is still told to this day, as a reminder of the power of courage and determination, and the importance of hope and love.
Loading voice: Veronica with model 724a9bd5
Reading from latent: ./voices/Veronica/cond_latents_724a9bd5.pth
Generating autoregressive samples
Computing best candidates using CLVP
Transforming autoregressive outputs into audio..
Generating line took 42.6276969909668 seconds
Loading Voicefixer
Loaded Voicefixer
Generation took 48.25649571418762 seconds, saved to './results//Veronica//Veronica_00044_fixed.wav'

https://vocaroo.com/1cVIRssr40wb

Hmm. I see the first time it took extremely long because it had to generate the latents for that voice and model, but that doesn't explain why it took so long the second time. Try changing "Sample Batch Size" in the Settings tab to something like 8 or 12. (I use 8 on a RTX3060 and it would probably take me about 30 seconds to generate a line like that.) Edit: Ehh, I 53s cold-start, 48s warm (on Ultra Fast). Close enough. ```` sneed@FMRLYCHKS:~/ai-voice-cloning$ ./start.sh --listen 172.20.8.87:8080 ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ /home/sneed/ai-voice-cloning/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/lib/wsl/lib: did not contain libcudart.so as expected! Searching further paths... warn(msg) /home/sneed/ai-voice-cloning/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('unix')} warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /home/sneed/ai-voice-cloning/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so... Running on local URL: http://172.20.8.87:8080 To create a public link, set `share=True` in `launch()`. Loading TorToiSe... (AR: ./training/Veronica/finetune/models/1500_gpt.pth, vocoder: bigvgan_24khz_100band) Hardware acceleration found: cuda Loading tokenizer JSON: ./modules/tortoise-tts/tortoise/data/tokenizer.json Loaded tokenizer Loading autoregressive model: ./training/Veronica/finetune/models/1500_gpt.pth Loaded autoregressive model Loaded diffusion model Loading vocoder model: bigvgan_24khz_100band Loading vocoder model: bigvgan_24khz_100band.pth Removing weight norm... Loaded vocoder model Loaded TorToiSe, ready for generation. [1/1] Generating line: [I am really sad,] But their legacy lives on, and their tales is still told to this day, as a reminder of the power of courage and determination, and the importance of hope and love. Loading voice: Veronica with model 724a9bd5 Reading from latent: ./voices/Veronica/cond_latents_724a9bd5.pth Generating autoregressive samples Computing best candidates using CLVP Transforming autoregressive outputs into audio.. Generating line took 46.11549711227417 seconds /home/sneed/ai-voice-cloning/venv/lib/python3.10/site-packages/torchaudio/functional/functional.py:1458: UserWarning: "kaiser_window" resampling method name is being deprecated and replaced by "sinc_interp_kaiser" in the next release. The default behavior remains unchanged. warnings.warn( Loading Voicefixer Loaded Voicefixer Generation took 53.15977644920349 seconds, saved to './results//Veronica//Veronica_00043_fixed.wav' Unloaded Voicefixer [1/1] Generating line: [I am really sad,] But their legacy lives on, and their tale is still told to this day, as a reminder of the power of courage and determination, and the importance of hope and love. Loading voice: Veronica with model 724a9bd5 Reading from latent: ./voices/Veronica/cond_latents_724a9bd5.pth Generating autoregressive samples Computing best candidates using CLVP Transforming autoregressive outputs into audio.. Generating line took 42.6276969909668 seconds Loading Voicefixer Loaded Voicefixer Generation took 48.25649571418762 seconds, saved to './results//Veronica//Veronica_00044_fixed.wav' ```` https://vocaroo.com/1cVIRssr40wb
Owner

Yeah, a fresh install with fresh settings will take ages on the initial run. All these things will definitely eat up time:

  • download several models (the AR, the diffusion, the CLVP, the BigVGAN, the voicefixer), and these are dependent on your internet speeds.
  • compute the latents for your voice (and I still will admit its defaults are not sane and I don't think can ever be sane).
  • longer sentences will take much longer to generate.
  • more samples will take more sampling time AND more CLVP candidate picking time.
  • more iterations and more candidates will take more time too.

but that shouldn't explain the generation times.

I would:

  • try using the Ultra Fast preset (the disparity between it and Fast is pretty large in terms of compute time, I'm not sure why I've kept it as such, and Ultra Fast is fine enough with BigVGAN + Voicefixer to make up for things).
  • under Settings > Sample Batch Size, set it to 16.
    • you can check Unsqueeze Sample Batches and instead set Sample Batch Size to something like 32.
    • check your VRAM usage when you're generating lines. If you notice you still have free VRAM, you can increase this value more.

Aside from that, I'm not sure what would make it take so long. My 2060 was able to do short sentences rather fast at sample size 1 just from using torch2.0.0.

Yeah, a fresh install with fresh settings will take *ages* on the initial run. All these things will definitely eat up time: * download several models (the AR, the diffusion, the CLVP, the BigVGAN, the voicefixer), and these are dependent on your internet speeds. * compute the latents for your voice (and I still will admit its defaults are not sane and I don't think can ever be sane). * longer sentences will take much longer to generate. * more samples will take more sampling time AND more CLVP candidate picking time. * more iterations and more candidates will take more time too. but that shouldn't explain the generation times. I would: * try using the `Ultra Fast` preset (the disparity between it and `Fast` is pretty large in terms of compute time, I'm not sure why I've kept it as such, and `Ultra Fast` is fine enough with BigVGAN + Voicefixer to make up for things). * under `Settings` > `Sample Batch Size`, set it to 16. + you can check `Unsqueeze Sample Batches` and instead set `Sample Batch Size` to something like 32. + check your VRAM usage when you're generating lines. If you notice you still have free VRAM, you can increase this value more. Aside from that, I'm not sure what would make it take so long. My 2060 was able to do short sentences rather fast at sample size 1 just from using torch2.0.0.

I am finding this very slow as well. It seems the same file using the same voice folder and model in DLAS generates significantly faster than here in the AI-Voice-Cloning using the same ultra-fast default setting. The DLAS output is also much cleaner. I'm not sure the settings are exactly equivalent between the two, but wanted to know if there was some way of leveling the playing field as the interface for AI-Voice is so much better...

I will post some comparison info when I get back to my machine.

I am finding this very slow as well. It seems the same file using the same voice folder and model in DLAS generates significantly faster than here in the AI-Voice-Cloning using the same ultra-fast default setting. The DLAS output is also much cleaner. I'm not sure the settings are exactly equivalent between the two, but wanted to know if there was some way of leveling the playing field as the interface for AI-Voice is so much better... I will post some comparison info when I get back to my machine.

Yeah, a fresh install with fresh settings will take ages on the initial run. All these things will definitely eat up time:

  • download several models (the AR, the diffusion, the CLVP, the BigVGAN, the voicefixer), and these are dependent on your internet speeds.
  • compute the latents for your voice (and I still will admit its defaults are not sane and I don't think can ever be sane).
  • longer sentences will take much longer to generate.
  • more samples will take more sampling time AND more CLVP candidate picking time.
  • more iterations and more candidates will take more time too.

but that shouldn't explain the generation times.

I would:

  • try using the Ultra Fast preset (the disparity between it and Fast is pretty large in terms of compute time, I'm not sure why I've kept it as such, and Ultra Fast is fine enough with BigVGAN + Voicefixer to make up for things).
  • under Settings > Sample Batch Size, set it to 16.
    • you can check Unsqueeze Sample Batches and instead set Sample Batch Size to something like 32.
    • check your VRAM usage when you're generating lines. If you notice you still have free VRAM, you can increase this value more.

Aside from that, I'm not sure what would make it take so long. My 2060 was able to do short sentences rather fast at sample size 1 just from using torch2.0.0.

Could you please quantify this, how fast in seconds and what what do you consider a short sentense. We are also hitting an issue inference speed, wondering if you have had thoughts about where we could dig in (in terms of code/logic) to understand what the bottleneck is. Woule be happy to do a PR once we do. Thanks for your work

> Yeah, a fresh install with fresh settings will take *ages* on the initial run. All these things will definitely eat up time: > * download several models (the AR, the diffusion, the CLVP, the BigVGAN, the voicefixer), and these are dependent on your internet speeds. > * compute the latents for your voice (and I still will admit its defaults are not sane and I don't think can ever be sane). > * longer sentences will take much longer to generate. > * more samples will take more sampling time AND more CLVP candidate picking time. > * more iterations and more candidates will take more time too. > > but that shouldn't explain the generation times. > > I would: > * try using the `Ultra Fast` preset (the disparity between it and `Fast` is pretty large in terms of compute time, I'm not sure why I've kept it as such, and `Ultra Fast` is fine enough with BigVGAN + Voicefixer to make up for things). > * under `Settings` > `Sample Batch Size`, set it to 16. > + you can check `Unsqueeze Sample Batches` and instead set `Sample Batch Size` to something like 32. > + check your VRAM usage when you're generating lines. If you notice you still have free VRAM, you can increase this value more. > > Aside from that, I'm not sure what would make it take so long. My 2060 was able to do short sentences rather fast at sample size 1 just from using torch2.0.0. Could you please quantify this, how fast in seconds and what what do you consider a short sentense. We are also hitting an issue inference speed, wondering if you have had thoughts about where we could dig in (in terms of code/logic) to understand what the bottleneck is. Woule be happy to do a PR once we do. Thanks for your work

The bottleneck is largely at the sample generation, afaik. Because higher quality outputs necessarily require more inference time, that's the precise trade-off, and cutting corners, other than some novel approach probably won't get you the faster inference time you're looking for.

However, something you could try is creating a great model on tortoise and using that to generate lines for a platform like 11labs or Resemble (bear in mind they have a strong American prosody bias to their base models) and get faster inferences that way.

The bottleneck is largely at the sample generation, afaik. Because higher quality outputs necessarily require more inference time, that's the precise trade-off, and cutting corners, other than some novel approach probably won't get you the faster inference time you're looking for. However, something you could try is creating a great model on tortoise and using that to generate lines for a platform like 11labs or Resemble (bear in mind they have a strong American prosody bias to their base models) and get faster inferences that way.

The bottleneck is largely at the sample generation, afaik. Because higher quality outputs necessarily require more inference time, that's the precise trade-off, and cutting corners, other than some novel approach probably won't get you the faster inference time you're looking for.

However, something you could try is creating a great model on tortoise and using that to generate lines for a platform like 11labs or Resemble (bear in mind they have a strong American prosody bias to their base models) and get faster inferences that way.

From researching a little seems like 11labs did actually start of with forking TorToiSe, wonder what they changed and updated to speed it up so much.

In terms of using 11labs or Resemble, unfortunately that does not work for our use case, we do want to build something inhouse without reliance on 3rd party APIs..

> The bottleneck is largely at the sample generation, afaik. Because higher quality outputs necessarily require more inference time, that's the precise trade-off, and cutting corners, other than some novel approach probably won't get you the faster inference time you're looking for. > > However, something you could try is creating a great model on tortoise and using that to generate lines for a platform like 11labs or Resemble (bear in mind they have a strong American prosody bias to their base models) and get faster inferences that way. From researching a little seems like 11labs did actually start of with forking TorToiSe, wonder what they changed and updated to speed it up so much. In terms of using 11labs or Resemble, unfortunately that does not work for our use case, we do want to build something inhouse without reliance on 3rd party APIs..

Well, the basic idea still stands, train a great model on tortoise, and use that to generate a corpus for a model that is less accurate, but faster inference. The thought is that by capturing the essence of a voice with a HQ model, it can then be trained with synthetic data on a smaller model with higher quality results than just training on the lower quality model. This is a quick and dirty way of sparsifying ml models in other domains.

Well, the basic idea still stands, train a great model on tortoise, and use that to generate a corpus for a model that is less accurate, but faster inference. The thought is that by capturing the essence of a voice with a HQ model, it can then be trained with synthetic data on a smaller model with higher quality results than just training on the lower quality model. This is a quick and dirty way of sparsifying ml models in other domains.
Sign in to join this conversation.
No Milestone
No project
No Assignees
6 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#183
No description provided.