(QoL improvements for) a multi-voice TTS system trained with an emphasis on quality

Go to file

mrq 945136330c Forgot to rename the cached latents to the new filename		2023-02-05 23:51:52 +00:00
convert	added button to refresh voice list, enabling KV caching for a bonerific speed increase (credit to https://github.com/152334H/tortoise-tts-fast/)	2023-02-05 17:59:13 +00:00
examples	Commit comparisons with naturalspeech	2022-05-22 05:13:08 -06:00
scripts	Fix typos	2023-01-06 11:04:36 +08:00
tortoise	Forgot to rename the cached latents to the new filename	2023-02-05 23:51:52 +00:00
.gitignore	Updated .gitignore (that does not apply to me because I have a bad habit of having a repo copy separate from a working copy)	2023-02-05 06:40:50 +00:00
app.py	Forgot to rename the cached latents to the new filename	2023-02-05 23:51:52 +00:00
CITATION.cff
LICENSE
MANIFEST.in	Also include voices in the manifest	2022-05-31 10:31:50 -06:00
README_OLD.md	QoL fixes	2023-02-02 21:13:28 +00:00
README.md	Forgot to rename the cached latents to the new filename	2023-02-05 23:51:52 +00:00
requirements.txt	Added small optimization with caching latents, dropped Anaconda for just a py3.9 + pip + venv setup, added helper install scripts for such, cleaned up app.py, added flag '--low-vram' to disable minor optimizations	2023-02-04 01:50:57 +00:00
setup.bat	modified how conditional latents are computed (before, it just happened to only bother reading the first 102400/24000=4.26 seconds per audio input, now it will chunk it all to compute latents)	2023-02-05 23:25:41 +00:00
setup.py	Also include voices in the manifest	2022-05-31 10:31:50 -06:00
start.bat	modified how conditional latents are computed (before, it just happened to only bother reading the first 102400/24000=4.26 seconds per audio input, now it will chunk it all to compute latents)	2023-02-05 23:25:41 +00:00
tortoise_tts.ipynb	Update notebook	2022-05-02 21:57:19 -06:00
tortoise_v2_examples.html	Update tortoise_v2_examples.html	2023-01-02 19:45:11 +02:00
update.bat	modified how conditional latents are computed (before, it just happened to only bother reading the first 102400/24000=4.26 seconds per audio input, now it will chunk it all to compute latents)	2023-02-05 23:25:41 +00:00

README.md

AI Voice Cloning for Retards and Savants

This rentry aims to serve as both a foolproof guide for setting up AI voice cloning tools for legitimate, local use on Windows (with an Nvidia GPU), as well as a stepping stone for anons that genuinely want to play around with TorToiSe.

Similar to my own findings for Stable Diffusion image generation, this rentry may appear a little disheveled as I note my new findings with TorToiSe. Please keep this in mind if the guide seems to shift a bit or sound confusing.

>B-but what about the colab notebook/hugging space instance??

I link those a bit later on as alternatives for Windows+AMD users. You're free to skip the installation section and jump after that.

> Wheres the love for Linux abloobloo

I'm extremely lazy and can't be assed to install Arch Linux again, much less create shell script equivalents. The commands should be almost 1:1 with what's in the batch file, save for the line to activate the venv.

I leave this as an exercise to the Linux reader.

>Ugh... why bother when I can just abuse 11.AI?

I very much encourage (You) to use 11.AI while it's still viable to use. For the layman, it's easier to go through the hoops of coughing up the $5 or abusing the free trial over actually setting up a TorToiSe environment and dealing with its quirks.

However, I also encourage your own experimentation with TorToiSe, as it's very, very promising, it just takes a little love and elbow grease.

Installing

Below is a very retard-proof guide for getting the software set up. In the future, I'll include a batch script to use for those that don't need tight handholding.

For setting up on Linux, the general framework should be the same, but left as an exercise to the reader.

For Windows users with an AMD GPU, tough luck, as ROCm drivers are not (easily) available for Windows, and requires inane patches with PyTorch. Consider using the Colab notebook, or the Hugging Face space, for tortoise-tts.

Pre-Requirements

Python 3.9: https://www.python.org/downloads/release/python-3913/

Git (optional): https://git-scm.com/download/win

Setup

Download Python and run the installer.

After installing python, open the Start Menu and search for Command Prompt. Type cd , then drag and drop the folder you want to work in (experienced users can just cd <path> directly).

Paste git clone https://git.ecker.tech/mrq/tortoise-tts to download TorToiSe and additional scripts. Inexperienced users can just download the repo as a ZIP, and extract.

Afterwards, run setup.bat to automatically set things up.

If you've done everything right, you shouldn't have any errors.

Updating

To check for updates, simply run update.bat. It should pull from the repo, as well as fetch for any new dependencies.

Pitfalls You May Encounter

I'll try and make a list of "common" (or what I feel may be common that I experience) issues with getting TorToiSe set up:

failed reading zip archive: failed finding central directory: You had a file fail to download completely during the model downloading initialization phase. Please open either .\models\tortoise\ or .\models\transformers\, and delete the offending file. You can deduce what that file is by reading the stack trace. A few lines above the last like will be a line trying to read a model path.
torch.cuda.OutOfMemoryError: CUDA out of memory.: You most likely have a GPU with low VRAM (~4GiB), and the small optimizations with keeping data on the GPU is enough to OOM. Please open the start.bat file and add --low-vram to the command (for example: py app.py --low-vram) to disable those small optimizations.

Preparing Voice Samples

Now that the tough part is dealt with, it's time to prepare voice clips to use.

Unlike training embeddings for AI image generations, preparing a "dataset" for voice cloning is very simple. While the repo suggests using short clips of about ten seconds each, you aren't required to manually snip them up. I'm not sure which way is "better", as some voices work perfectly fine with two clips with minutes each worth of audio, while other voices work better with ten short clips.

As a general rule of thumb, try to source clips that aren't noisy, and are entirely just the subject you are trying to clone. If you must, run your source through a background music/noise remover (how to is an exercise left to the reader). It isn't entirely a detriment if you're unable to provide clean audio, however. Just be wary that you might have some headaches with getting acceptable output.

After sourcing some clips, here are some considerations whether you should narrow down the pool you used, or not:

if you're aiming for a specific delivery (for example, having a line re-read but with word(s) replaced), use just that clip with the line. If you want to err on the side of caution, you can add one more similar clip for safety.
if your source clips are all delivered in a similar manner (for example, the Patrick Bateman example provided later), it's not necessary to cull.
if you're hoping to generate something non-specific, you're free to just use your entire pool.

There's no hard specifics on how many, or how long, your sources should be.

After sourcing your clips, there are some considerations on how to narrow down your voice clips, if needed:

if you're aiming for a specific delivery (for example, having a line re-read but with word(s) replaced), use just that clip with line isolated.
if you're aiming to generate a wide range of lines, you shouldn't have to worry about culling for similar clips, and you can just dump them all in for use. To me, there's no noticeable difference between combining them into one file, or keeping them all separated (outside of the initial load for a ton of files).

If you're looking to trim your clips, in my opinion, ~~Audacity~~ Tenacity works good enough, as you can easily output your clips into the proper format (22050 Hz sampling rate), but some of the time, the software will print out some (sometimes harmless, sometimes harmful) warning message (WavFileWarning: Chunk (non-data) not understood, skipping it.), it's safe to assume you need to properly remux it with ffmpeg, simply with ffmpeg -i [input] -ar 22050 -c:a pcm_f32le [output].wav. Power users can use the previous command instead of relying on Tenacity to remux.

After preparing your clips as WAV files at a sample rate of 22050 Hz, open up the tortoise-tts folder you're working in, navigate to ./tortoise/voice/, create a new folder in whatever name you want, then dump your clips into that folder. While you're in the voice folder, you can take a look at the other provided voices.

Using the Software

Now you're ready to generate clips. With the command prompt still open, simply enter start.bat, and wait for it to print out a URL to open in your browser, something like http://127.0.0.1:7860.

If you're looking to access your copy of TorToiSe from outside your local network, pass --share into the command (for example, python app.py --share). You'll get a temporary gradio link to use.

You'll be presented with a bunch of options, but do not be overwhelmed, as most of the defaults are sane, but below are a rough explanation on which input does what:

Prompt: text you want to be read. You wrap text in [brackets] for "prompt engineering", where it'll affect the output, but those words won't actually be read.
Line Delimiter: String to split the prompt into pieces. The stitched clip will be stored as combined.wav
- Setting this to \n will generate each line as one clip before stitching it. Leave blank to disable this.
Emotion: the "emotion" used for the delivery. This is a shortcut to utilizing "prompt engineering" by starting with [I am really <emotion>,] in your prompt. This is merely a suggestion, not a guarantee.
Custom Emotion + Prompt: a non-preset "emotion" used for the delivery. This is a shortcut to utilizing "prompt engineering" by starting with [<emotion>] in your prompt.
Voice: the voice you want to clone. You can select microphone if you want to use input from your microphone.
Microphone Source: Use your own voice from a line-in source.
Candidates: number of outputs to generate, starting from the best candidate. Depending on your iteration steps, generating the final sound files could be cheap, but they only offer alternatives to the samples generated to pull from (in other words, the later candidates perform worse), so don't be compelled to generate a ton of candidates.
Seed: initializes the PRNG to this value. Use this if you want to reproduce a generated voice.
Preset: shortcut values for sample count and iteration steps. Clicking a preset will update its corresponding values. Higher presets result in better quality at the cost of computation time.
Samples: analogous to samples in image generation. More samples = better resemblance / clone quality, at the cost of performance. This strictly affects clone quality.
Iterations: influences audio sound quality in the final output. More iterations = higher quality sound. This step is relatively cheap, so do not be discouraged from increasing this. This strictly affects quality in the actual sound.
Temperature: how much randomness to introduce to the generated samples. Lower values = better resemblance to the source samples, but some temperature is still required for great output. This value is very inconsistent and entirely depends on the input voice. In other words, some voices will be receptive to playing with this value, while others won't make much of a difference.
Pause Size: Governs how large pauses are at the end of a clip (in token size, not seconds). Increase this if your output gets cut off at the end.
Diffusion Sampler: sampler method during the diffusion pass. Currently, only P and DDIM are added, but does not seem to offer any substantial differences in my short tests. P refers to the default, vanilla sampling method in diffusion.py. To reiterate, this only is useful for the diffusion decoding path, after the autoregressive outputs are generated. Below are an explanation of experimental flags. Messing with these might impact performance, as these are exposed only if you know what you are doing.
Half-Precision: (attempts to) hint to PyTorch to auto-cast to float16 (half precision) for compute. Disabled by default, due to it making computations slower in most cases.
Conditional Free: a quality boosting improvement at the cost of some performance. Enabled by default, as I think the penaly is negligible in the end.

After you fill everything out, click Run, and wait for your output in the output window. The sampled voice is also returned, but if you're using multiple files, it'll return the first file, rather than a combined file.

All outputs are saved under ./result/[voice name]/[timestamp]/ as result.wav, and the settings in input.txt. There doesn't seem to be an inherent way to add a Download button in Gradio, so keep that folder in mind.

To save you from headaches, I strongly recommend playing around with shorter sentences first to find the right values for the voice you're using before generating longer sentences.

As a quick optimization, I modified the script to where the conditional_latents are saved after loading voice samples, and subsequent uses will load that file directly (at the cost of not returning the Sample voice to the web UI). If there's voice samples that have a modification time newer than this cached file, it'll skip loading it and load the normal WAVs instead.

!NOTE!: cached latents.pth files generated before 2023.02.05 will be ignored, due to a change in computing the conditiona latents. This should help bump up voice cloning quality. Apologies for the inconvenience.

Example(s)

Below are some outputs I deem substantial enough to share. As I continue delving into TorToiSe, I'll supply more examples and the values I use.

Source (Patrick Bateman):

https://files.catbox.moe/skzumo.zip

Output (My name is Patrick Bateman., fast preset):

I trimmed up some of the samples to end up with ten short clips of about 10 seconds each. With a 2060, it took a hair over a minute to generate the initial samples, then five to ten seconds for each clip of a total of three. Not too bad for something running on consumer grade shitware.

Source (Harry Mason):

Output (The McDonalds building creepypasta, custom preset of 128 samples, 256 iterations):

https://voca.ro/16XSgdlcC5uT

This took quite a while, over the course of a day half-paying-attention at the command prompt to generate the next piece. I only had to regenerate one section that sounded funny, but compared to 11.AI requiring tons of regenerations for something usable, this is nice to just let run and forget. Initially he sounds rather passable as Harry Mason, but as it goes on it seems to kinda falter. Sound effects and music are added in post and aren't generated by TorToiSe.

Caveats (and Upsides)

To me, I find a few problems:

a voice's "clonability" depends on the "compatability" with the model TorToiSe was initially trained on. It's pretty much a gamble on what plays nicely. Patrick Bateman and Harry Mason will work nice, while James Sunderland, SA2 Shadow, and Mitsuru will refuse to get anything consistently decent.
generation time takes quite a while on cards with low compute power (for example, a 2060) for substantial texts, and gets worse for voices with "low compatability" as more samples are required. For me personally, if it bothered me, I could rent out a Paperspace instance again and nab the non-pay-as-you-go A100 to crank out audio clips. My 2060 is my secondary card, so it might as well get some use. There are performance gains to be reaped, however, so this may dwindle away.
the content of your text could greatly affect the delivery for the entire text. For example, if you lose the die roll and the wrong emotion gets deduced, then it'll throw off the entire clip and subsequent candidates. For example, just having the James Sunderland voice say "Mary?" will have it generate as a female voice some of the time. This appears to be predicated on how "prompt engineering" works with changing emotions, so it's understandable.
the lack of an obvious analog to the "stability" and "similarity" sliders kind of sucks, but it's not the end of the world. However, the temperature option seems to prove to be a proper analog to either of these.
I'm not sure if this is specifically an """algorithm""" problem, or is just the nature of sampling, but the GPU is grossly underutilized for compute. I could be wrong and I actually have something misconfigured.

However, I can look past these as TorToiSe offers, in comparison to 11.AI:

the "speaking too fast" issue does not exist with TorToiSe. I don't need to fight with it by pretending I'm a Gaia user in the early 2000s by sprinkling ellipses.
the overall delivery seems very natural, sometimes small, dramatic pauses gets added at the legitimately most convenient moments, and the inhales tend to be more natural. Many of vocaroos from 11.AI where it just does not seem properly delivered.
being able to run it locally means I do not have to worry about some Polack seeing me use the "dick" word.