forked from mrq/tortoise-tts
added flag to enable/disable voicefixer using CUDA because I'll OOM on my 2060, changed from naively subdividing eavenly (2,4,8,16 pieces) to just incrementing by 1 (1,2,3,4) when trying to subdivide within constraints of the max chunk size for computing voice latents
This commit is contained in:
parent
b648186691
commit
48275899e8
|
@ -187,7 +187,8 @@ You'll be presented with a bunch of options in the default `Generate` tab, but d
|
||||||
* `Microphone Source`: Use your own voice from a line-in source.
|
* `Microphone Source`: Use your own voice from a line-in source.
|
||||||
* `Reload Voice List`: refreshes the voice list and updates. ***Click this*** after adding or removing a new voice.
|
* `Reload Voice List`: refreshes the voice list and updates. ***Click this*** after adding or removing a new voice.
|
||||||
* `(Re)Compute Voice Latents`: regenerates a voice's cached latents.
|
* `(Re)Compute Voice Latents`: regenerates a voice's cached latents.
|
||||||
* `Experimental Compute Latents Mode`: this mode will combine all voice samples into one file, then split it evenly (if under the maximum allowed chunk size under `Settings`)
|
* `Experimental Compute Latents Mode`: this mode will adjust the behavior for computing voice latents. leave this checked if you're unsure
|
||||||
|
- I've left my comments on either modes in `./tortoise/api.py`, if you're curious
|
||||||
|
|
||||||
Below are a list of generation settings:
|
Below are a list of generation settings:
|
||||||
* `Candidates`: number of outputs to generate, starting from the best candidate. Depending on your iteration steps, generating the final sound files could be cheap, but they only offer alternatives to the samples generated to pull from (in other words, the later candidates perform worse), so don't be compelled to generate a ton of candidates.
|
* `Candidates`: number of outputs to generate, starting from the best candidate. Depending on your iteration steps, generating the final sound files could be cheap, but they only offer alternatives to the samples generated to pull from (in other words, the later candidates perform worse), so don't be compelled to generate a ton of candidates.
|
||||||
|
@ -262,6 +263,7 @@ Below are settings that override the default launch arguments. Some of these req
|
||||||
* `Embed Output Metadata`: enables embedding the settings and latents used to generate that audio clip inside that audio clip. Metadata is stored as a JSON string in the `lyrics` tag.
|
* `Embed Output Metadata`: enables embedding the settings and latents used to generate that audio clip inside that audio clip. Metadata is stored as a JSON string in the `lyrics` tag.
|
||||||
* `Slimmer Computed Latents`: falls back to the original, 12.9KiB way of storing latents (without the extra bits required for using the CVVP model).
|
* `Slimmer Computed Latents`: falls back to the original, 12.9KiB way of storing latents (without the extra bits required for using the CVVP model).
|
||||||
* `Voice Fixer`: runs each generated audio clip through `voicefixer`, if available and installed.
|
* `Voice Fixer`: runs each generated audio clip through `voicefixer`, if available and installed.
|
||||||
|
* `Use CUDA for Voice Fixer`: if available, hints to `voicefixer` to use hardware acceleration. this flag is specifically because I'll OOM on my 2060, since the models for `voicefixer` do not leave the GPU and are heavily fragmented, I presume.
|
||||||
* `Voice Latent Max Chunk Size`: during the voice latents calculation pass, this limits how large, in bytes, a chunk can be. Large values can run into VRAM OOM errors.
|
* `Voice Latent Max Chunk Size`: during the voice latents calculation pass, this limits how large, in bytes, a chunk can be. Large values can run into VRAM OOM errors.
|
||||||
* `Sample Batch Size`: sets the batch size when generating autoregressive samples. Bigger batches result in faster compute, at the cost of increased VRAM consumption. Leave to 0 to calculate a "best" fit.
|
* `Sample Batch Size`: sets the batch size when generating autoregressive samples. Bigger batches result in faster compute, at the cost of increased VRAM consumption. Leave to 0 to calculate a "best" fit.
|
||||||
* `Concurrency Count`: how many Gradio events the queue can process at once. Leave this over 1 if you want to modify settings in the UI that updates other settings while generating audio clips.
|
* `Concurrency Count`: how many Gradio events the queue can process at once. Leave this over 1 if you want to modify settings in the UI that updates other settings while generating audio clips.
|
||||||
|
|
|
@ -339,20 +339,35 @@ class TextToSpeech:
|
||||||
diffusion_conds = []
|
diffusion_conds = []
|
||||||
chunks = []
|
chunks = []
|
||||||
|
|
||||||
# new behavior: combine all samples, and divide accordingly
|
# below are two behaviors while i try and figure out how I should gauge the "best" method
|
||||||
# doesn't work, need to fix
|
# there's too many little variables to consider, like:
|
||||||
|
# does it matter if there's a lot of silence (from expanding to largest size)
|
||||||
|
# how detrimental is it to slice a waveform mid-sentence/word/phoneme
|
||||||
|
# is it "more accurate" to use one large file to compute the latents across
|
||||||
|
# is it "more accurate" to compute latents across each individual sample (or sentence) and then average them
|
||||||
|
# averaging latents is how tortoise can voice mix, so it most likely will just average a speaker's range
|
||||||
|
# do any of these considerations even matter? they don't really seem to
|
||||||
|
|
||||||
|
# new behavior:
|
||||||
|
# combine all samples
|
||||||
|
# divide until each chunk fits under the requested max chunk size
|
||||||
if calculation_mode == 1:
|
if calculation_mode == 1:
|
||||||
concat = torch.cat(samples, dim=-1)
|
concat = torch.cat(samples, dim=-1)
|
||||||
if chunk_size is None:
|
if chunk_size is None:
|
||||||
chunk_size = concat.shape[-1]
|
chunk_size = concat.shape[-1]
|
||||||
|
|
||||||
if max_chunk_size is not None and chunk_size > max_chunk_size:
|
if max_chunk_size is not None and chunk_size > max_chunk_size:
|
||||||
while chunk_size > max_chunk_size:
|
divisions = 1
|
||||||
chunk_size = int(chunk_size / 2)
|
while int(chunk_size / divisions) > max_chunk_size:
|
||||||
|
divisions = divisions + 1
|
||||||
|
chunk_size = int(chunk_size / divisions)
|
||||||
|
|
||||||
print(f"Using method 1: size of best fit: {chunk_size}")
|
print(f"Using method 1: size of best fit: {chunk_size}")
|
||||||
chunks = torch.chunk(concat, int(concat.shape[-1] / chunk_size), dim=1)
|
chunks = torch.chunk(concat, int(concat.shape[-1] / chunk_size), dim=1)
|
||||||
# default new behavior: use the smallest voice sample as a common chunk size
|
|
||||||
|
# old new behavior:
|
||||||
|
# if chunkning tensors: use the smallest voice sample as a common size of best fit
|
||||||
|
# if not chunking tensors: use the largest voice sample as a common size of best fit
|
||||||
else:
|
else:
|
||||||
if chunk_size is None:
|
if chunk_size is None:
|
||||||
for sample in tqdm_override(samples, verbose=verbose and len(samples) > 1, progress=progress if len(samples) > 1 else None, desc="Calculating size of best fit..."):
|
for sample in tqdm_override(samples, verbose=verbose and len(samples) > 1, progress=progress if len(samples) > 1 else None, desc="Calculating size of best fit..."):
|
||||||
|
@ -374,6 +389,8 @@ class TextToSpeech:
|
||||||
else:
|
else:
|
||||||
chunks = samples
|
chunks = samples
|
||||||
|
|
||||||
|
# expand / truncate samples to match the common size
|
||||||
|
# required, as tensors need to be of the same length
|
||||||
for chunk in tqdm_override(chunks, verbose=verbose, progress=progress, desc="Computing conditioning latents..."):
|
for chunk in tqdm_override(chunks, verbose=verbose, progress=progress, desc="Computing conditioning latents..."):
|
||||||
chunk = pad_or_truncate(chunk, chunk_size)
|
chunk = pad_or_truncate(chunk, chunk_size)
|
||||||
cond_mel = wav_to_univnet_mel(chunk.to(device), do_normalization=False, device=device)
|
cond_mel = wav_to_univnet_mel(chunk.to(device), do_normalization=False, device=device)
|
||||||
|
|
9
webui.py
9
webui.py
|
@ -272,7 +272,7 @@ def generate(
|
||||||
voicefixer.restore(
|
voicefixer.restore(
|
||||||
input=path,
|
input=path,
|
||||||
output=path,
|
output=path,
|
||||||
cuda=get_device_name() == "cuda",
|
cuda=get_device_name() == "cuda" and args.voice_fixer_use_cuda,
|
||||||
#mode=mode,
|
#mode=mode,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
@ -475,7 +475,7 @@ def get_voice_list(dir=get_voice_dir()):
|
||||||
def update_voices():
|
def update_voices():
|
||||||
return gr.Dropdown.update(choices=get_voice_list())
|
return gr.Dropdown.update(choices=get_voice_list())
|
||||||
|
|
||||||
def export_exec_settings( listen, share, check_for_updates, models_from_local_only, low_vram, embed_output_metadata, latents_lean_and_mean, voice_fixer, cond_latent_max_chunk_size, sample_batch_size, concurrency_count, output_sample_rate, output_volume ):
|
def export_exec_settings( listen, share, check_for_updates, models_from_local_only, low_vram, embed_output_metadata, latents_lean_and_mean, voice_fixer, voice_fixer_use_cuda, cond_latent_max_chunk_size, sample_batch_size, concurrency_count, output_sample_rate, output_volume ):
|
||||||
args.listen = listen
|
args.listen = listen
|
||||||
args.share = share
|
args.share = share
|
||||||
args.check_for_updates = check_for_updates
|
args.check_for_updates = check_for_updates
|
||||||
|
@ -486,6 +486,7 @@ def export_exec_settings( listen, share, check_for_updates, models_from_local_on
|
||||||
args.embed_output_metadata = embed_output_metadata
|
args.embed_output_metadata = embed_output_metadata
|
||||||
args.latents_lean_and_mean = latents_lean_and_mean
|
args.latents_lean_and_mean = latents_lean_and_mean
|
||||||
args.voice_fixer = voice_fixer
|
args.voice_fixer = voice_fixer
|
||||||
|
args.voice_fixer_use_cuda = voice_fixer_use_cuda
|
||||||
args.concurrency_count = concurrency_count
|
args.concurrency_count = concurrency_count
|
||||||
args.output_sample_rate = output_sample_rate
|
args.output_sample_rate = output_sample_rate
|
||||||
args.output_volume = output_volume
|
args.output_volume = output_volume
|
||||||
|
@ -501,6 +502,7 @@ def export_exec_settings( listen, share, check_for_updates, models_from_local_on
|
||||||
'embed-output-metadata': args.embed_output_metadata,
|
'embed-output-metadata': args.embed_output_metadata,
|
||||||
'latents-lean-and-mean': args.latents_lean_and_mean,
|
'latents-lean-and-mean': args.latents_lean_and_mean,
|
||||||
'voice-fixer': args.voice_fixer,
|
'voice-fixer': args.voice_fixer,
|
||||||
|
'voice-fixer-use-cuda': args.voice_fixer_use_cuda,
|
||||||
'concurrency-count': args.concurrency_count,
|
'concurrency-count': args.concurrency_count,
|
||||||
'output-sample-rate': args.output_sample_rate,
|
'output-sample-rate': args.output_sample_rate,
|
||||||
'output-volume': args.output_volume,
|
'output-volume': args.output_volume,
|
||||||
|
@ -520,6 +522,7 @@ def setup_args():
|
||||||
'embed-output-metadata': True,
|
'embed-output-metadata': True,
|
||||||
'latents-lean-and-mean': True,
|
'latents-lean-and-mean': True,
|
||||||
'voice-fixer': True,
|
'voice-fixer': True,
|
||||||
|
'voice-fixer-use-cuda': True,
|
||||||
'cond-latent-max-chunk-size': 1000000,
|
'cond-latent-max-chunk-size': 1000000,
|
||||||
'concurrency-count': 2,
|
'concurrency-count': 2,
|
||||||
'output-sample-rate': 44100,
|
'output-sample-rate': 44100,
|
||||||
|
@ -541,6 +544,7 @@ def setup_args():
|
||||||
parser.add_argument("--no-embed-output-metadata", action='store_false', default=not default_arguments['embed-output-metadata'], help="Disables embedding output metadata into resulting WAV files for easily fetching its settings used with the web UI (data is stored in the lyrics metadata tag)")
|
parser.add_argument("--no-embed-output-metadata", action='store_false', default=not default_arguments['embed-output-metadata'], help="Disables embedding output metadata into resulting WAV files for easily fetching its settings used with the web UI (data is stored in the lyrics metadata tag)")
|
||||||
parser.add_argument("--latents-lean-and-mean", action='store_true', default=default_arguments['latents-lean-and-mean'], help="Exports the bare essentials for latents.")
|
parser.add_argument("--latents-lean-and-mean", action='store_true', default=default_arguments['latents-lean-and-mean'], help="Exports the bare essentials for latents.")
|
||||||
parser.add_argument("--voice-fixer", action='store_true', default=default_arguments['voice-fixer'], help="Uses python module 'voicefixer' to improve audio quality, if available.")
|
parser.add_argument("--voice-fixer", action='store_true', default=default_arguments['voice-fixer'], help="Uses python module 'voicefixer' to improve audio quality, if available.")
|
||||||
|
parser.add_argument("--voice-fixer-use-cuda", action='store_true', default=default_arguments['voice-fixer-use-cuda'], help="Hints to voicefixer to use CUDA, if available.")
|
||||||
parser.add_argument("--cond-latent-max-chunk-size", default=default_arguments['cond-latent-max-chunk-size'], type=int, help="Sets an upper limit to audio chunk size when computing conditioning latents")
|
parser.add_argument("--cond-latent-max-chunk-size", default=default_arguments['cond-latent-max-chunk-size'], type=int, help="Sets an upper limit to audio chunk size when computing conditioning latents")
|
||||||
parser.add_argument("--sample-batch-size", default=default_arguments['sample-batch-size'], type=int, help="Sets an upper limit to audio chunk size when computing conditioning latents")
|
parser.add_argument("--sample-batch-size", default=default_arguments['sample-batch-size'], type=int, help="Sets an upper limit to audio chunk size when computing conditioning latents")
|
||||||
parser.add_argument("--concurrency-count", type=int, default=default_arguments['concurrency-count'], help="How many Gradio events to process at once")
|
parser.add_argument("--concurrency-count", type=int, default=default_arguments['concurrency-count'], help="How many Gradio events to process at once")
|
||||||
|
@ -824,6 +828,7 @@ def setup_gradio():
|
||||||
gr.Checkbox(label="Embed Output Metadata", value=args.embed_output_metadata),
|
gr.Checkbox(label="Embed Output Metadata", value=args.embed_output_metadata),
|
||||||
gr.Checkbox(label="Slimmer Computed Latents", value=args.latents_lean_and_mean),
|
gr.Checkbox(label="Slimmer Computed Latents", value=args.latents_lean_and_mean),
|
||||||
gr.Checkbox(label="Voice Fixer", value=args.voice_fixer),
|
gr.Checkbox(label="Voice Fixer", value=args.voice_fixer),
|
||||||
|
gr.Checkbox(label="Use CUDA for Voice Fixer", value=args.voice_fixer_use_cuda),
|
||||||
]
|
]
|
||||||
gr.Button(value="Check for Updates").click(check_for_updates)
|
gr.Button(value="Check for Updates").click(check_for_updates)
|
||||||
gr.Button(value="Reload TTS").click(reload_tts)
|
gr.Button(value="Reload TTS").click(reload_tts)
|
||||||
|
|
Loading…
Reference in New Issue
Block a user