diff --git a/README.md b/README.md index 55b83ac..e833036 100755 --- a/README.md +++ b/README.md @@ -74,8 +74,8 @@ After sourcing some clips, here are some considerations whether you should narro There's no hard specifics on how many, or how long, your sources should be. After sourcing your clips, there are some considerations on how to narrow down your voice clips, if needed: -* if you're aiming for a specific delivery (for example, having a line re-read but with word(s) replaced), use just that clip with the line. If you want to err on the side of caution, you can add one more similar clip for safety. -* if you're aiming to generate a wide range of lines, you shouldn't have to worry about culling for similar clips, and you can just dump them all in for use +* if you're aiming for a specific delivery (for example, having a line re-read but with word(s) replaced), use just that clip with line isolated. +* if you're aiming to generate a wide range of lines, you shouldn't have to worry about culling for similar clips, and you can just dump them all in for use. To me, there's no noticeable difference between combining them into one file, or keeping them all separated (outside of the initial load for a ton of files). If you're looking to trim your clips, in my opinion, ~~Audacity~~ Tenacity works good enough, as you can easily output your clips into the proper format (22050 Hz sampling rate), but some of the time, the software will print out some (sometimes harmless, sometimes harmful) warning message (`WavFileWarning: Chunk (non-data) not understood, skipping it.`), it's safe to assume you need to properly remux it with `ffmpeg`, simply with `ffmpeg -i [input] -ar 22050 -c:a pcm_f32le [output].wav`. Power users can use the previous command instead of relying on Tenacity to remux. @@ -106,6 +106,9 @@ You'll be presented with a bunch of options, but do not be overwhelmed, as most * `Diffusion Sampler`: sampler method during the diffusion pass. Currently, only `P` and `DDIM` are added, but does not seem to offer any substantial differences in my short tests. `P` refers to the default, vanilla sampling method in `diffusion.py`. To reiterate, this ***only*** is useful for the diffusion decoding path, after the autoregressive outputs are generated. +Below are an explanation of experimental flags. Messing with these might impact performance, as these are exposed only if you know what you are doing. +* `Half-Precision`: (attempts to) hint to PyTorch to auto-cast to float16 (half precision) for compute. Disabled by default, due to it making computations slower in most cases. +* `Conditional Free`: a quality boosting improvement at the cost of some performance. Enabled by default, as I think the penaly is negligible in the end. After you fill everything out, click `Run`, and wait for your output in the output window. The sampled voice is also returned, but if you're using multiple files, it'll return the first file, rather than a combined file. @@ -138,7 +141,7 @@ Source (Harry Mason): Output (The McDonalds building creepypasta, custom preset of 128 samples, 256 iterations): * https://voca.ro/16XSgdlcC5uT -This took quite a while, over the course of a day half-paying-attention at the command prompt to generate the next piece. I only had to regenerate one section that sounded funny, but compared to 11.AI requiring tons of regenerations for something usable, this is nice to just let run and forget. Initially he sounds rather passable as Harry Mason, but as it goes on it seems to kinda falter. **!**NOTE**!**: sound effects and music are added in post and aren't generated by TorToiSe. +This took quite a while, over the course of a day half-paying-attention at the command prompt to generate the next piece. I only had to regenerate one section that sounded funny, but compared to 11.AI requiring tons of regenerations for something usable, this is nice to just let run and forget. Initially he sounds rather passable as Harry Mason, but as it goes on it seems to kinda falter. Sound effects and music are added in post and aren't generated by TorToiSe. ## Caveats (and Upsides) @@ -147,6 +150,7 @@ To me, I find a few problems: It's pretty much a gamble on what plays nicely. Patrick Bateman and Harry Mason will work nice, while James Sunderland, SA2 Shadow, and Mitsuru will refuse to get anything consistently decent. * generation time takes quite a while on cards with low compute power (for example, a 2060) for substantial texts, and gets worse for voices with "low compatability" as more samples are required. For me personally, if it bothered me, I could rent out a Paperspace instance again and nab the non-pay-as-you-go A100 to crank out audio clips. My 2060 is my secondary card, so it might as well get some use. + There are performance gains to be reaped, however, so this may dwindle away. * the content of your text could ***greatly*** affect the delivery for the entire text. For example, if you lose the die roll and the wrong emotion gets deduced, then it'll throw off the entire clip and subsequent candidates. For example, just having the James Sunderland voice say "Mary?" will have it generate as a female voice some of the time. @@ -158,4 +162,4 @@ To me, I find a few problems: However, I can look past these as TorToiSe offers, in comparison to 11.AI: * the "speaking too fast" issue does not exist with TorToiSe. I don't need to fight with it by pretending I'm a Gaia user in the early 2000s by sprinkling ellipses. * the overall delivery seems very natural, sometimes small, dramatic pauses gets added at the legitimately most convenient moments, and the inhales tend to be more natural. Many of vocaroos from 11.AI where it just does not seem properly delivered. -* being able to run it locally means I do not have to worry about some Polack seeing me use the "dick" word. \ No newline at end of file +* being able to run it locally means I do not have to worry about some Polack seeing me use the "dick" word. diff --git a/app.py b/app.py index 0f38687..742ebdd 100755 --- a/app.py +++ b/app.py @@ -11,8 +11,6 @@ from tortoise.utils.audio import load_audio, load_voice, load_voices from tortoise.utils.text import split_and_recombine_text def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, candidates, num_autoregressive_samples, diffusion_iterations, temperature, diffusion_sampler, breathing_room, experimentals, progress=gr.Progress()): - print(experimentals) - if voice != "microphone": voices = [voice] else: @@ -30,7 +28,7 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c if voice_samples is not None: sample_voice = voice_samples[0] conditioning_latents = tts.get_conditioning_latents(voice_samples, progress=progress) - torch.save(conditioning_latents, os.path.join(f'./tortoise/voices/{voice}/', f'latents.pth')) + torch.save(conditioning_latents, os.path.join(f'./tortoise/voices/{voice}/', f'cond_latents.pth')) voice_samples = None else: sample_voice = None @@ -214,13 +212,13 @@ def main(): temperature = gr.Slider(value=0.2, minimum=0, maximum=1, step=0.1, label="Temperature") breathing_room = gr.Slider(value=12, minimum=1, maximum=32, step=1, label="Pause Size") diffusion_sampler = gr.Radio( - ["P", "DDIM"], + ["P", "DDIM"], # + ["K_Euler_A", "DPM++2M"], value="P", label="Diffusion Samplers", type="value", ) - experimentals = gr.CheckboxGroup(["Half Precision", "Conditioning-Free"], value=[False, True], label="Experimental Flags") + experimentals = gr.CheckboxGroup(["Half Precision", "Conditioning-Free"], value=["Conditioning-Free"], label="Experimental Flags") preset.change(fn=update_presets, inputs=preset, diff --git a/tortoise/utils/audio.py b/tortoise/utils/audio.py index b2a5cfd..e7cd3a8 100755 --- a/tortoise/utils/audio.py +++ b/tortoise/utils/audio.py @@ -108,7 +108,7 @@ def load_voice(voice, extra_voice_dirs=[], load_latents=True): voices = [] latent = None for file in paths: - if file == "cond_latents.pth": + if file[-16:] == "cond_latents.pth": latent = file elif file[-4:] == ".pth": {}