generating voice clip is so much slower compared to using original Tortoise TTS #183
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
6 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#183
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I decided to try this out after playing around with the original Tortoise TTS and wanting to do more fine tuning and creating my own models.
I've got it up and running but when testing it out, it is taking significantly more time to generate a short one sentance voice clip compared to the original Tortoise. I'm using a 3080ti and even one Fast preset, it takes like 5 mins to generate a short sentance. In Tortoise, it would only take me a bit less than a min on the Fast preset. Why is this?
I've checked that my GPU is being utilized during the generation and not my CPU. I've kept all the settings at default. Any ideas what might be causing the slow down?
Post your console log.
Here it is.
I did notice when I changed the setting to low Vram, it sped things up. But is that lowering the quality quite a bit? As I mentioned, i'm using a 3080ti so I don't know if that qualifies as low Vram or not and how it compared to the original Tortoise TTS.
Hmm. I see the first time it took extremely long because it had to generate the latents for that voice and model, but that doesn't explain why it took so long the second time. Try changing "Sample Batch Size" in the Settings tab to something like 8 or 12. (I use 8 on a RTX3060 and it would probably take me about 30 seconds to generate a line like that.)
Edit: Ehh, I 53s cold-start, 48s warm (on Ultra Fast). Close enough.
https://vocaroo.com/1cVIRssr40wb
Yeah, a fresh install with fresh settings will take ages on the initial run. All these things will definitely eat up time:
but that shouldn't explain the generation times.
I would:
Ultra Fast
preset (the disparity between it andFast
is pretty large in terms of compute time, I'm not sure why I've kept it as such, andUltra Fast
is fine enough with BigVGAN + Voicefixer to make up for things).Settings
>Sample Batch Size
, set it to 16.Unsqueeze Sample Batches
and instead setSample Batch Size
to something like 32.Aside from that, I'm not sure what would make it take so long. My 2060 was able to do short sentences rather fast at sample size 1 just from using torch2.0.0.
I am finding this very slow as well. It seems the same file using the same voice folder and model in DLAS generates significantly faster than here in the AI-Voice-Cloning using the same ultra-fast default setting. The DLAS output is also much cleaner. I'm not sure the settings are exactly equivalent between the two, but wanted to know if there was some way of leveling the playing field as the interface for AI-Voice is so much better...
I will post some comparison info when I get back to my machine.
Could you please quantify this, how fast in seconds and what what do you consider a short sentense. We are also hitting an issue inference speed, wondering if you have had thoughts about where we could dig in (in terms of code/logic) to understand what the bottleneck is. Woule be happy to do a PR once we do. Thanks for your work
The bottleneck is largely at the sample generation, afaik. Because higher quality outputs necessarily require more inference time, that's the precise trade-off, and cutting corners, other than some novel approach probably won't get you the faster inference time you're looking for.
However, something you could try is creating a great model on tortoise and using that to generate lines for a platform like 11labs or Resemble (bear in mind they have a strong American prosody bias to their base models) and get faster inferences that way.
From researching a little seems like 11labs did actually start of with forking TorToiSe, wonder what they changed and updated to speed it up so much.
In terms of using 11labs or Resemble, unfortunately that does not work for our use case, we do want to build something inhouse without reliance on 3rd party APIs..
Well, the basic idea still stands, train a great model on tortoise, and use that to generate a corpus for a model that is less accurate, but faster inference. The thought is that by capturing the essence of a voice with a HQ model, it can then be trained with synthetic data on a smaller model with higher quality results than just training on the lower quality model. This is a quick and dirty way of sparsifying ml models in other domains.