added flag to enable/disable voicefixer using CUDA because I'll OOM on my 2060, changed from naively subdividing eavenly (2,4,8,16 pieces) to just incrementing by 1 (1,2,3,4) when trying to subdivide within constraints of the max chunk size for computing voice latents

2023-02-14 16:47:34 +00:00 · 2023-02-14 16:47:34 +00:00 · 48275899e8
commit 48275899e8
parent b648186691
3 changed files with 33 additions and 9 deletions
--- a/README.md
+++ b/README.md
@ -187,7 +187,8 @@ You'll be presented with a bunch of options in the default `Generate` tab, but d
 * `Microphone Source`: Use your own voice from a line-in source.
 * `Reload Voice List`: refreshes the voice list and updates. ***Click this*** after adding or removing a new voice.
 * `(Re)Compute Voice Latents`: regenerates a voice's cached latents.
-* `Experimental Compute Latents Mode`: this mode will combine all voice samples into one file, then split it evenly (if under the maximum allowed chunk size under `Settings`)
+* `Experimental Compute Latents Mode`: this mode will adjust the behavior for computing voice latents. leave this checked if you're unsure
+	- I've left my comments on either modes in `./tortoise/api.py`, if you're curious

 Below are a list of generation settings:
 * `Candidates`: number of outputs to generate, starting from the best candidate. Depending on your iteration steps, generating the final sound files could be cheap, but they only offer alternatives to the samples generated to pull from (in other words, the later candidates perform worse), so don't be compelled to generate a ton of candidates.
@ -262,6 +263,7 @@ Below are settings that override the default launch arguments. Some of these req
 * `Embed Output Metadata`: enables embedding the settings and latents used to generate that audio clip inside that audio clip. Metadata is stored as a JSON string in the `lyrics` tag.
 * `Slimmer Computed Latents`: falls back to the original, 12.9KiB way of storing latents (without the extra bits required for using the CVVP model).
 * `Voice Fixer`: runs each generated audio clip through `voicefixer`, if available and installed.
+* `Use CUDA for Voice Fixer`: if available, hints to `voicefixer` to use hardware acceleration. this flag is specifically because I'll OOM on my 2060, since the models for `voicefixer` do not leave the GPU and are heavily fragmented, I presume.
 * `Voice Latent Max Chunk Size`: during the voice latents calculation pass, this limits how large, in bytes, a chunk can be. Large values can run into VRAM OOM errors.
 * `Sample Batch Size`: sets the batch size when generating autoregressive samples. Bigger batches result in faster compute, at the cost of increased VRAM consumption. Leave to 0 to calculate a "best" fit.
 * `Concurrency Count`: how many Gradio events the queue can process at once. Leave this over 1 if you want to modify settings in the UI that updates other settings while generating audio clips.
--- a/tortoise/api.py
+++ b/tortoise/api.py
@ -339,20 +339,35 @@ class TextToSpeech:
            diffusion_conds = []
            chunks = []

-            # new behavior: combine all samples, and divide accordingly
-            # doesn't work, need to fix
+            # below are two behaviors while i try and figure out how I should gauge the "best" method
+            # there's too many little variables to consider, like:
+            #  does it matter if there's a lot of silence (from expanding to largest size)
+            #  how detrimental is it to slice a waveform mid-sentence/word/phoneme 
+            #  is it "more accurate" to use one large file to compute the latents across
+            #  is it "more accurate" to compute latents across each individual sample (or sentence) and then average them
+            #    averaging latents is how tortoise can voice mix, so it most likely will just average a speaker's range
+            #  do any of these considerations even matter? they don't really seem to
+
+            # new behavior: 
+            #  combine all samples
+            #  divide until each chunk fits under the requested max chunk size
            if calculation_mode == 1:
                concat = torch.cat(samples, dim=-1)
                if chunk_size is None:
                    chunk_size = concat.shape[-1]

                if max_chunk_size is not None and chunk_size > max_chunk_size:
-                    while chunk_size > max_chunk_size:
-                        chunk_size = int(chunk_size / 2)
+                    divisions = 1
+                    while int(chunk_size / divisions) > max_chunk_size:
+                        divisions = divisions + 1
+                    chunk_size = int(chunk_size / divisions)

                print(f"Using method 1: size of best fit: {chunk_size}")
                chunks = torch.chunk(concat, int(concat.shape[-1] / chunk_size), dim=1)
-            # default new behavior: use the smallest voice sample as a common chunk size
+
+            # old new behavior:
+            #  if chunkning tensors: use the smallest voice sample as a common size of best fit
+            #  if not chunking tensors: use the largest voice sample as a common size of best fit
            else:
                if chunk_size is None:
                    for sample in tqdm_override(samples, verbose=verbose and len(samples) > 1, progress=progress if len(samples) > 1 else None, desc="Calculating size of best fit..."):
@ -374,6 +389,8 @@ class TextToSpeech:
                else:
                    chunks = samples
            
+            # expand / truncate samples to match the common size
+            # required, as tensors need to be of the same length
            for chunk in tqdm_override(chunks, verbose=verbose, progress=progress, desc="Computing conditioning latents..."):
                chunk = pad_or_truncate(chunk, chunk_size)
                cond_mel = wav_to_univnet_mel(chunk.to(device), do_normalization=False, device=device)
--- a/webui.py
+++ b/webui.py
@ -272,7 +272,7 @@ def generate(
            voicefixer.restore(
                input=path,
                output=path,
-                cuda=get_device_name() == "cuda",
+                cuda=get_device_name() == "cuda" and args.voice_fixer_use_cuda,
                #mode=mode,
            )

@ -475,7 +475,7 @@ def get_voice_list(dir=get_voice_dir()):
 def update_voices():
    return gr.Dropdown.update(choices=get_voice_list())

-def export_exec_settings( listen, share, check_for_updates, models_from_local_only, low_vram, embed_output_metadata, latents_lean_and_mean, voice_fixer, cond_latent_max_chunk_size, sample_batch_size, concurrency_count, output_sample_rate, output_volume ):
+def export_exec_settings( listen, share, check_for_updates, models_from_local_only, low_vram, embed_output_metadata, latents_lean_and_mean, voice_fixer, voice_fixer_use_cuda, cond_latent_max_chunk_size, sample_batch_size, concurrency_count, output_sample_rate, output_volume ):
    args.listen = listen
    args.share = share
    args.check_for_updates = check_for_updates
@ -486,6 +486,7 @@ def export_exec_settings( listen, share, check_for_updates, models_from_local_on
    args.embed_output_metadata = embed_output_metadata
    args.latents_lean_and_mean = latents_lean_and_mean
    args.voice_fixer = voice_fixer
+    args.voice_fixer_use_cuda = voice_fixer_use_cuda
    args.concurrency_count = concurrency_count
    args.output_sample_rate = output_sample_rate
    args.output_volume = output_volume
@ -501,6 +502,7 @@ def export_exec_settings( listen, share, check_for_updates, models_from_local_on
        'embed-output-metadata': args.embed_output_metadata,
        'latents-lean-and-mean': args.latents_lean_and_mean,
        'voice-fixer': args.voice_fixer,
+        'voice-fixer-use-cuda': args.voice_fixer_use_cuda,
        'concurrency-count': args.concurrency_count,
        'output-sample-rate': args.output_sample_rate,
        'output-volume': args.output_volume,
@ -520,6 +522,7 @@ def setup_args():
        'embed-output-metadata': True,
        'latents-lean-and-mean': True,
        'voice-fixer': True,
+        'voice-fixer-use-cuda': True,
        'cond-latent-max-chunk-size': 1000000,
        'concurrency-count': 2,
        'output-sample-rate': 44100,
@ -541,6 +544,7 @@ def setup_args():
    parser.add_argument("--no-embed-output-metadata", action='store_false', default=not default_arguments['embed-output-metadata'], help="Disables embedding output metadata into resulting WAV files for easily fetching its settings used with the web UI (data is stored in the lyrics metadata tag)")
    parser.add_argument("--latents-lean-and-mean", action='store_true', default=default_arguments['latents-lean-and-mean'], help="Exports the bare essentials for latents.")
    parser.add_argument("--voice-fixer", action='store_true', default=default_arguments['voice-fixer'], help="Uses python module 'voicefixer' to improve audio quality, if available.")
+    parser.add_argument("--voice-fixer-use-cuda", action='store_true', default=default_arguments['voice-fixer-use-cuda'], help="Hints to voicefixer to use CUDA, if available.")
    parser.add_argument("--cond-latent-max-chunk-size", default=default_arguments['cond-latent-max-chunk-size'], type=int, help="Sets an upper limit to audio chunk size when computing conditioning latents")
    parser.add_argument("--sample-batch-size", default=default_arguments['sample-batch-size'], type=int, help="Sets an upper limit to audio chunk size when computing conditioning latents")
    parser.add_argument("--concurrency-count", type=int, default=default_arguments['concurrency-count'], help="How many Gradio events to process at once")
@ -824,6 +828,7 @@ def setup_gradio():
                        gr.Checkbox(label="Embed Output Metadata", value=args.embed_output_metadata),
                        gr.Checkbox(label="Slimmer Computed Latents", value=args.latents_lean_and_mean),
                        gr.Checkbox(label="Voice Fixer", value=args.voice_fixer),
+                        gr.Checkbox(label="Use CUDA for Voice Fixer", value=args.voice_fixer_use_cuda),
                    ]
                    gr.Button(value="Check for Updates").click(check_for_updates)
                    gr.Button(value="Reload TTS").click(reload_tts)