added flag to enable/disable voicefixer using CUDA because I'll OOM on my 2060, changed from naively subdividing eavenly (2,4,8,16 pieces) to just incrementing by 1 (1,2,3,4) when trying to subdivide within constraints of the max chunk size for computing voice latents

This commit is contained in:
mrq 2023-02-14 16:47:34 +00:00
parent b648186691
commit 48275899e8
3 changed files with 33 additions and 9 deletions

View File

@ -187,7 +187,8 @@ You'll be presented with a bunch of options in the default `Generate` tab, but d
* `Microphone Source`: Use your own voice from a line-in source. * `Microphone Source`: Use your own voice from a line-in source.
* `Reload Voice List`: refreshes the voice list and updates. ***Click this*** after adding or removing a new voice. * `Reload Voice List`: refreshes the voice list and updates. ***Click this*** after adding or removing a new voice.
* `(Re)Compute Voice Latents`: regenerates a voice's cached latents. * `(Re)Compute Voice Latents`: regenerates a voice's cached latents.
* `Experimental Compute Latents Mode`: this mode will combine all voice samples into one file, then split it evenly (if under the maximum allowed chunk size under `Settings`) * `Experimental Compute Latents Mode`: this mode will adjust the behavior for computing voice latents. leave this checked if you're unsure
- I've left my comments on either modes in `./tortoise/api.py`, if you're curious
Below are a list of generation settings: Below are a list of generation settings:
* `Candidates`: number of outputs to generate, starting from the best candidate. Depending on your iteration steps, generating the final sound files could be cheap, but they only offer alternatives to the samples generated to pull from (in other words, the later candidates perform worse), so don't be compelled to generate a ton of candidates. * `Candidates`: number of outputs to generate, starting from the best candidate. Depending on your iteration steps, generating the final sound files could be cheap, but they only offer alternatives to the samples generated to pull from (in other words, the later candidates perform worse), so don't be compelled to generate a ton of candidates.
@ -262,6 +263,7 @@ Below are settings that override the default launch arguments. Some of these req
* `Embed Output Metadata`: enables embedding the settings and latents used to generate that audio clip inside that audio clip. Metadata is stored as a JSON string in the `lyrics` tag. * `Embed Output Metadata`: enables embedding the settings and latents used to generate that audio clip inside that audio clip. Metadata is stored as a JSON string in the `lyrics` tag.
* `Slimmer Computed Latents`: falls back to the original, 12.9KiB way of storing latents (without the extra bits required for using the CVVP model). * `Slimmer Computed Latents`: falls back to the original, 12.9KiB way of storing latents (without the extra bits required for using the CVVP model).
* `Voice Fixer`: runs each generated audio clip through `voicefixer`, if available and installed. * `Voice Fixer`: runs each generated audio clip through `voicefixer`, if available and installed.
* `Use CUDA for Voice Fixer`: if available, hints to `voicefixer` to use hardware acceleration. this flag is specifically because I'll OOM on my 2060, since the models for `voicefixer` do not leave the GPU and are heavily fragmented, I presume.
* `Voice Latent Max Chunk Size`: during the voice latents calculation pass, this limits how large, in bytes, a chunk can be. Large values can run into VRAM OOM errors. * `Voice Latent Max Chunk Size`: during the voice latents calculation pass, this limits how large, in bytes, a chunk can be. Large values can run into VRAM OOM errors.
* `Sample Batch Size`: sets the batch size when generating autoregressive samples. Bigger batches result in faster compute, at the cost of increased VRAM consumption. Leave to 0 to calculate a "best" fit. * `Sample Batch Size`: sets the batch size when generating autoregressive samples. Bigger batches result in faster compute, at the cost of increased VRAM consumption. Leave to 0 to calculate a "best" fit.
* `Concurrency Count`: how many Gradio events the queue can process at once. Leave this over 1 if you want to modify settings in the UI that updates other settings while generating audio clips. * `Concurrency Count`: how many Gradio events the queue can process at once. Leave this over 1 if you want to modify settings in the UI that updates other settings while generating audio clips.

View File

@ -339,20 +339,35 @@ class TextToSpeech:
diffusion_conds = [] diffusion_conds = []
chunks = [] chunks = []
# new behavior: combine all samples, and divide accordingly # below are two behaviors while i try and figure out how I should gauge the "best" method
# doesn't work, need to fix # there's too many little variables to consider, like:
# does it matter if there's a lot of silence (from expanding to largest size)
# how detrimental is it to slice a waveform mid-sentence/word/phoneme
# is it "more accurate" to use one large file to compute the latents across
# is it "more accurate" to compute latents across each individual sample (or sentence) and then average them
# averaging latents is how tortoise can voice mix, so it most likely will just average a speaker's range
# do any of these considerations even matter? they don't really seem to
# new behavior:
# combine all samples
# divide until each chunk fits under the requested max chunk size
if calculation_mode == 1: if calculation_mode == 1:
concat = torch.cat(samples, dim=-1) concat = torch.cat(samples, dim=-1)
if chunk_size is None: if chunk_size is None:
chunk_size = concat.shape[-1] chunk_size = concat.shape[-1]
if max_chunk_size is not None and chunk_size > max_chunk_size: if max_chunk_size is not None and chunk_size > max_chunk_size:
while chunk_size > max_chunk_size: divisions = 1
chunk_size = int(chunk_size / 2) while int(chunk_size / divisions) > max_chunk_size:
divisions = divisions + 1
chunk_size = int(chunk_size / divisions)
print(f"Using method 1: size of best fit: {chunk_size}") print(f"Using method 1: size of best fit: {chunk_size}")
chunks = torch.chunk(concat, int(concat.shape[-1] / chunk_size), dim=1) chunks = torch.chunk(concat, int(concat.shape[-1] / chunk_size), dim=1)
# default new behavior: use the smallest voice sample as a common chunk size
# old new behavior:
# if chunkning tensors: use the smallest voice sample as a common size of best fit
# if not chunking tensors: use the largest voice sample as a common size of best fit
else: else:
if chunk_size is None: if chunk_size is None:
for sample in tqdm_override(samples, verbose=verbose and len(samples) > 1, progress=progress if len(samples) > 1 else None, desc="Calculating size of best fit..."): for sample in tqdm_override(samples, verbose=verbose and len(samples) > 1, progress=progress if len(samples) > 1 else None, desc="Calculating size of best fit..."):
@ -374,6 +389,8 @@ class TextToSpeech:
else: else:
chunks = samples chunks = samples
# expand / truncate samples to match the common size
# required, as tensors need to be of the same length
for chunk in tqdm_override(chunks, verbose=verbose, progress=progress, desc="Computing conditioning latents..."): for chunk in tqdm_override(chunks, verbose=verbose, progress=progress, desc="Computing conditioning latents..."):
chunk = pad_or_truncate(chunk, chunk_size) chunk = pad_or_truncate(chunk, chunk_size)
cond_mel = wav_to_univnet_mel(chunk.to(device), do_normalization=False, device=device) cond_mel = wav_to_univnet_mel(chunk.to(device), do_normalization=False, device=device)

View File

@ -272,7 +272,7 @@ def generate(
voicefixer.restore( voicefixer.restore(
input=path, input=path,
output=path, output=path,
cuda=get_device_name() == "cuda", cuda=get_device_name() == "cuda" and args.voice_fixer_use_cuda,
#mode=mode, #mode=mode,
) )
@ -475,7 +475,7 @@ def get_voice_list(dir=get_voice_dir()):
def update_voices(): def update_voices():
return gr.Dropdown.update(choices=get_voice_list()) return gr.Dropdown.update(choices=get_voice_list())
def export_exec_settings( listen, share, check_for_updates, models_from_local_only, low_vram, embed_output_metadata, latents_lean_and_mean, voice_fixer, cond_latent_max_chunk_size, sample_batch_size, concurrency_count, output_sample_rate, output_volume ): def export_exec_settings( listen, share, check_for_updates, models_from_local_only, low_vram, embed_output_metadata, latents_lean_and_mean, voice_fixer, voice_fixer_use_cuda, cond_latent_max_chunk_size, sample_batch_size, concurrency_count, output_sample_rate, output_volume ):
args.listen = listen args.listen = listen
args.share = share args.share = share
args.check_for_updates = check_for_updates args.check_for_updates = check_for_updates
@ -486,6 +486,7 @@ def export_exec_settings( listen, share, check_for_updates, models_from_local_on
args.embed_output_metadata = embed_output_metadata args.embed_output_metadata = embed_output_metadata
args.latents_lean_and_mean = latents_lean_and_mean args.latents_lean_and_mean = latents_lean_and_mean
args.voice_fixer = voice_fixer args.voice_fixer = voice_fixer
args.voice_fixer_use_cuda = voice_fixer_use_cuda
args.concurrency_count = concurrency_count args.concurrency_count = concurrency_count
args.output_sample_rate = output_sample_rate args.output_sample_rate = output_sample_rate
args.output_volume = output_volume args.output_volume = output_volume
@ -501,6 +502,7 @@ def export_exec_settings( listen, share, check_for_updates, models_from_local_on
'embed-output-metadata': args.embed_output_metadata, 'embed-output-metadata': args.embed_output_metadata,
'latents-lean-and-mean': args.latents_lean_and_mean, 'latents-lean-and-mean': args.latents_lean_and_mean,
'voice-fixer': args.voice_fixer, 'voice-fixer': args.voice_fixer,
'voice-fixer-use-cuda': args.voice_fixer_use_cuda,
'concurrency-count': args.concurrency_count, 'concurrency-count': args.concurrency_count,
'output-sample-rate': args.output_sample_rate, 'output-sample-rate': args.output_sample_rate,
'output-volume': args.output_volume, 'output-volume': args.output_volume,
@ -520,6 +522,7 @@ def setup_args():
'embed-output-metadata': True, 'embed-output-metadata': True,
'latents-lean-and-mean': True, 'latents-lean-and-mean': True,
'voice-fixer': True, 'voice-fixer': True,
'voice-fixer-use-cuda': True,
'cond-latent-max-chunk-size': 1000000, 'cond-latent-max-chunk-size': 1000000,
'concurrency-count': 2, 'concurrency-count': 2,
'output-sample-rate': 44100, 'output-sample-rate': 44100,
@ -541,6 +544,7 @@ def setup_args():
parser.add_argument("--no-embed-output-metadata", action='store_false', default=not default_arguments['embed-output-metadata'], help="Disables embedding output metadata into resulting WAV files for easily fetching its settings used with the web UI (data is stored in the lyrics metadata tag)") parser.add_argument("--no-embed-output-metadata", action='store_false', default=not default_arguments['embed-output-metadata'], help="Disables embedding output metadata into resulting WAV files for easily fetching its settings used with the web UI (data is stored in the lyrics metadata tag)")
parser.add_argument("--latents-lean-and-mean", action='store_true', default=default_arguments['latents-lean-and-mean'], help="Exports the bare essentials for latents.") parser.add_argument("--latents-lean-and-mean", action='store_true', default=default_arguments['latents-lean-and-mean'], help="Exports the bare essentials for latents.")
parser.add_argument("--voice-fixer", action='store_true', default=default_arguments['voice-fixer'], help="Uses python module 'voicefixer' to improve audio quality, if available.") parser.add_argument("--voice-fixer", action='store_true', default=default_arguments['voice-fixer'], help="Uses python module 'voicefixer' to improve audio quality, if available.")
parser.add_argument("--voice-fixer-use-cuda", action='store_true', default=default_arguments['voice-fixer-use-cuda'], help="Hints to voicefixer to use CUDA, if available.")
parser.add_argument("--cond-latent-max-chunk-size", default=default_arguments['cond-latent-max-chunk-size'], type=int, help="Sets an upper limit to audio chunk size when computing conditioning latents") parser.add_argument("--cond-latent-max-chunk-size", default=default_arguments['cond-latent-max-chunk-size'], type=int, help="Sets an upper limit to audio chunk size when computing conditioning latents")
parser.add_argument("--sample-batch-size", default=default_arguments['sample-batch-size'], type=int, help="Sets an upper limit to audio chunk size when computing conditioning latents") parser.add_argument("--sample-batch-size", default=default_arguments['sample-batch-size'], type=int, help="Sets an upper limit to audio chunk size when computing conditioning latents")
parser.add_argument("--concurrency-count", type=int, default=default_arguments['concurrency-count'], help="How many Gradio events to process at once") parser.add_argument("--concurrency-count", type=int, default=default_arguments['concurrency-count'], help="How many Gradio events to process at once")
@ -824,6 +828,7 @@ def setup_gradio():
gr.Checkbox(label="Embed Output Metadata", value=args.embed_output_metadata), gr.Checkbox(label="Embed Output Metadata", value=args.embed_output_metadata),
gr.Checkbox(label="Slimmer Computed Latents", value=args.latents_lean_and_mean), gr.Checkbox(label="Slimmer Computed Latents", value=args.latents_lean_and_mean),
gr.Checkbox(label="Voice Fixer", value=args.voice_fixer), gr.Checkbox(label="Voice Fixer", value=args.voice_fixer),
gr.Checkbox(label="Use CUDA for Voice Fixer", value=args.voice_fixer_use_cuda),
] ]
gr.Button(value="Check for Updates").click(check_for_updates) gr.Button(value="Check for Updates").click(check_for_updates)
gr.Button(value="Reload TTS").click(reload_tts) gr.Button(value="Reload TTS").click(reload_tts)