Added settings page, added checking for updates (disabled by default), some other things that I don't remember

This commit is contained in:
mrq 2023-02-06 21:43:01 +00:00
parent d1172ead36
commit d8c88078f3
3 changed files with 247 additions and 100 deletions

View File

@ -97,7 +97,9 @@ Now you're ready to generate clips. With the command prompt still open, simply e
If you're looking to access your copy of TorToiSe from outside your local network, pass `--share` into the command (for example, `python app.py --share`). You'll get a temporary gradio link to use. If you're looking to access your copy of TorToiSe from outside your local network, pass `--share` into the command (for example, `python app.py --share`). You'll get a temporary gradio link to use.
You'll be presented with a bunch of options, but do not be overwhelmed, as most of the defaults are sane, but below are a rough explanation on which input does what: ### Generate
You'll be presented with a bunch of options in the default `Generate` tab, but do not be overwhelmed, as most of the defaults are sane, but below are a rough explanation on which input does what:
* `Prompt`: text you want to be read. You wrap text in `[brackets]` for "prompt engineering", where it'll affect the output, but those words won't actually be read. * `Prompt`: text you want to be read. You wrap text in `[brackets]` for "prompt engineering", where it'll affect the output, but those words won't actually be read.
* `Line Delimiter`: String to split the prompt into pieces. The stitched clip will be stored as `combined.wav` * `Line Delimiter`: String to split the prompt into pieces. The stitched clip will be stored as `combined.wav`
- Setting this to `\n` will generate each line as one clip before stitching it. Leave blank to disable this. - Setting this to `\n` will generate each line as one clip before stitching it. Leave blank to disable this.
@ -115,9 +117,6 @@ You'll be presented with a bunch of options, but do not be overwhelmed, as most
* `Diffusion Sampler`: sampler method during the diffusion pass. Currently, only `P` and `DDIM` are added, but does not seem to offer any substantial differences in my short tests. * `Diffusion Sampler`: sampler method during the diffusion pass. Currently, only `P` and `DDIM` are added, but does not seem to offer any substantial differences in my short tests.
`P` refers to the default, vanilla sampling method in `diffusion.py`. `P` refers to the default, vanilla sampling method in `diffusion.py`.
To reiterate, this ***only*** is useful for the diffusion decoding path, after the autoregressive outputs are generated. To reiterate, this ***only*** is useful for the diffusion decoding path, after the autoregressive outputs are generated.
Below are an explanation of experimental flags. Messing with these might impact performance, as these are exposed only if you know what you are doing.
* `Half-Precision`: (attempts to) hint to PyTorch to auto-cast to float16 (half precision) for compute. Disabled by default, due to it making computations slower.
* `Conditional Free`: a quality boosting improvement at the cost of some performance. Enabled by default, as I think the penaly is negligible in the end.
After you fill everything out, click `Run`, and wait for your output in the output window. The sampled voice is also returned, but if you're using multiple files, it'll return the first file, rather than a combined file. After you fill everything out, click `Run`, and wait for your output in the output window. The sampled voice is also returned, but if you're using multiple files, it'll return the first file, rather than a combined file.
@ -129,6 +128,29 @@ As a quick optimization, I modified the script to have the `conditional_latents`
**!**NOTE**!**: cached `latents.pth` files generated before 2023.02.05 will be ignored, due to a change in computing the conditiona latents. This *should* help bump up voice cloning quality. Apologies for the inconvenience. **!**NOTE**!**: cached `latents.pth` files generated before 2023.02.05 will be ignored, due to a change in computing the conditiona latents. This *should* help bump up voice cloning quality. Apologies for the inconvenience.
### Utilities
In this tab, you can find some helper utilities that might be of assistance.
For now, an analog to the PNG info found in Voldy's Stable Diffusion Web UI resides here. With it, you can upload an audio file generated with this web UI to view the settings used to generate that output. Additionally, the voice latents used to generate the uploaded audio clip can be extracted.
If you want to reuse its generation settings, simply click "Copy Settings".
### Settings
This tab (should) hold a bunch of other settings, from tunables that shouldn't be tampered with, to settings pertaining to the web UI itself.
Below are settings that override the default launch arguments. Some of these require restarting to work.
* `Public Share Gradio`: overrides `--share`. Tells Gradio to generate a public URL for the web UI
* `Check for Updates`: checks for updates on page load and notifies in console. Only works if you pulled this repo from a gitea instance.
* `Low VRAM`: disables optimizations in TorToiSe that increases VRAM consumption. Suggested if your GPU has under 6GiB.
* `Voice Latent Max Chunk Size`: during the voice latents calculation pass, this limits how large, in bytes, a chunk can be. Large values can run into VRAM OOM errors.
* `Concurrency Count`: how many Gradio events the queue can process at once. Leave this over 1 if you want to modify settings in the UI that updates other settings while generating audio clips.
Below are an explanation of experimental flags. Messing with these might impact performance, as these are exposed only if you know what you are doing.
* `Half-Precision`: (attempts to) hint to PyTorch to auto-cast to float16 (half precision) for compute. Disabled by default, due to it making computations slower.
* `Conditional Free`: a quality boosting improvement at the cost of some performance. Enabled by default, as I think the penaly is negligible in the end.
## Example(s) ## Example(s)
Below are some (rather outdated) outputs I deem substantial enough to share. As I continue delving into TorToiSe, I'll supply more examples and the values I use. Below are some (rather outdated) outputs I deem substantial enough to share. As I continue delving into TorToiSe, I'll supply more examples and the values I use.

283
app.py
View File

@ -1,20 +1,24 @@
import os import os
import argparse import argparse
import gradio as gr
import torch
import torchaudio
import time import time
import json import json
import base64 import base64
import re
import urllib.request
import torch
import torchaudio
import music_tag
import gradio as gr
from datetime import datetime from datetime import datetime
from tortoise.api import TextToSpeech from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, load_voice, load_voices from tortoise.utils.audio import load_audio, load_voice, load_voices
from tortoise.utils.text import split_and_recombine_text from tortoise.utils.text import split_and_recombine_text
import music_tag
def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, candidates, num_autoregressive_samples, diffusion_iterations, temperature, diffusion_sampler, breathing_room, experimentals, progress=gr.Progress()): def generate(text, delimiter, emotion, prompt, voice, mic_audio, seed, candidates, num_autoregressive_samples, diffusion_iterations, temperature, diffusion_sampler, breathing_room, experimentals, progress=gr.Progress(track_tqdm=True)):
if voice != "microphone": if voice != "microphone":
voices = [voice] voices = [voice]
else: else:
@ -33,7 +37,7 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c
sample_voice = voice_samples[0] sample_voice = voice_samples[0]
conditioning_latents = tts.get_conditioning_latents(voice_samples, progress=progress, max_chunk_size=args.cond_latent_max_chunk_size) conditioning_latents = tts.get_conditioning_latents(voice_samples, progress=progress, max_chunk_size=args.cond_latent_max_chunk_size)
if voice != "microphone": if voice != "microphone":
torch.save(conditioning_latents, os.path.join(f'./tortoise/voices/{voice}/', f'cond_latents.pth')) torch.save(conditioning_latents, f'./tortoise/voices/{voice}/cond_latents.pth')
voice_samples = None voice_samples = None
else: else:
sample_voice = None sample_voice = None
@ -41,8 +45,6 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c
if seed == 0: if seed == 0:
seed = None seed = None
print(conditioning_latents)
start_time = time.time() start_time = time.time()
settings = { settings = {
@ -82,9 +84,10 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c
audio_cache = {} audio_cache = {}
for line, cut_text in enumerate(texts): for line, cut_text in enumerate(texts):
if emotion == "Custom" and prompt.strip() != "": if emotion == "Custom":
if prompt.strip() != "":
cut_text = f"[{prompt},] {cut_text}" cut_text = f"[{prompt},] {cut_text}"
elif emotion != "None": else:
cut_text = f"[I am really {emotion.lower()},] {cut_text}" cut_text = f"[I am really {emotion.lower()},] {cut_text}"
print(f"[{str(line+1)}/{str(len(texts))}] Generating line: {cut_text}") print(f"[{str(line+1)}/{str(len(texts))}] Generating line: {cut_text}")
@ -100,15 +103,15 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c
'text': cut_text, 'text': cut_text,
} }
os.makedirs(os.path.join(outdir, f'candidate_{j}'), exist_ok=True) os.makedirs(f'{outdir}/candidate_{j}', exist_ok=True)
torchaudio.save(os.path.join(outdir, f'candidate_{j}/result_{line}.wav'), audio, 24000) torchaudio.save(f'{outdir}/candidate_{j}/result_{line}.wav', audio, 24000)
else: else:
audio = gen.squeeze(0).cpu() audio = gen.squeeze(0).cpu()
audio_cache[f"result_{line}.wav"] = { audio_cache[f"result_{line}.wav"] = {
'audio': audio, 'audio': audio,
'text': cut_text, 'text': cut_text,
} }
torchaudio.save(os.path.join(outdir, f'result_{line}.wav'), audio, 24000) torchaudio.save(f'{outdir}/result_{line}.wav', audio, 24000)
output_voice = None output_voice = None
if len(texts) > 1: if len(texts) > 1:
@ -120,17 +123,26 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c
else: else:
audio = audio_cache[f'result_{line}.wav']['audio'] audio = audio_cache[f'result_{line}.wav']['audio']
audio_clips.append(audio) audio_clips.append(audio)
audio_clips = torch.cat(audio_clips, dim=-1)
torchaudio.save(os.path.join(outdir, f'combined_{candidate}.wav'), audio_clips, 24000) audio_clips = torch.cat(audio_clips, dim=-1).squeeze(0).cpu()
torchaudio.save(f'{outdir}/combined_{candidate}.wav', audio_clips, 24000)
audio_cache[f'combined_{candidate}.wav'] = {
'audio': audio,
'text': cut_text,
}
if output_voice is None: if output_voice is None:
output_voice = (24000, audio_clips.squeeze().cpu().numpy()) output_voice = audio_clips
else: else:
if isinstance(gen, list): if isinstance(gen, list):
output_voice = gen[0] output_voice = gen[0]
else: else:
output_voice = gen output_voice = gen
output_voice = (24000, output_voice.squeeze().cpu().numpy())
if output_voice is not None:
output_voice = (24000, output_voice.numpy())
info = { info = {
'text': text, 'text': text,
@ -139,7 +151,6 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c
'prompt': prompt, 'prompt': prompt,
'voice': voice, 'voice': voice,
'mic_audio': mic_audio, 'mic_audio': mic_audio,
'preset': preset,
'seed': seed, 'seed': seed,
'candidates': candidates, 'candidates': candidates,
'num_autoregressive_samples': num_autoregressive_samples, 'num_autoregressive_samples': num_autoregressive_samples,
@ -151,27 +162,31 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c
'time': time.time()-start_time, 'time': time.time()-start_time,
} }
with open(os.path.join(outdir, f'input.txt'), 'w', encoding="utf-8") as f: with open(f'{outdir}/input.json', 'w', encoding="utf-8") as f:
f.write(json.dumps(info, indent='\t') ) f.write(json.dumps(info, indent='\t') )
if voice is not None and conditioning_latents is not None: if voice is not None and conditioning_latents is not None:
with open(os.path.join(f'./tortoise/voices/{voice}/', f'cond_latents.pth'), 'rb') as f: with open(f'./tortoise/voices/{voice}/cond_latents.pth', 'rb') as f:
info['latents'] = base64.b64encode(f.read()).decode("ascii") info['latents'] = base64.b64encode(f.read()).decode("ascii")
print(f"Saved to '{outdir}'")
for path in audio_cache: for path in audio_cache:
info['text'] = audio_cache[path]['text'] info['text'] = audio_cache[path]['text']
metadata = music_tag.load_file(os.path.join(outdir, path)) metadata = music_tag.load_file(f"{outdir}/{path}")
metadata['lyrics'] = json.dumps(info) metadata['lyrics'] = json.dumps(info)
metadata.save() metadata.save()
if sample_voice is not None: if sample_voice is not None:
sample_voice = (22050, sample_voice.squeeze().cpu().numpy()) sample_voice = (22050, sample_voice.squeeze().cpu().numpy())
audio_clips = [] print(f"Saved to '{outdir}'")
info['seed'] = settings['use_deterministic_seed']
del info['latents']
with open(f'./config/generate.json', 'w', encoding="utf-8") as f:
f.write(json.dumps(info, indent='\t') )
return ( return (
sample_voice, sample_voice,
output_voice, output_voice,
@ -192,14 +207,18 @@ def update_presets(value):
else: else:
return (gr.update(), gr.update()) return (gr.update(), gr.update())
def read_metadata(file, save_latents=True): def read_generate_settings(file, save_latents=True):
j = None j = None
latents = None latents = None
if file is not None: if file is not None:
if hasattr(file, 'name'):
metadata = music_tag.load_file(file.name) metadata = music_tag.load_file(file.name)
if 'lyrics' in metadata: if 'lyrics' in metadata:
j = json.loads(str(metadata['lyrics'])) j = json.loads(str(metadata['lyrics']))
elif file[-5:] == ".json":
with open(file, 'r') as f:
j = json.load(f)
if 'latents' in j and save_latents: if 'latents' in j and save_latents:
latents = base64.b64decode(j['latents']) latents = base64.b64decode(j['latents'])
@ -207,42 +226,107 @@ def read_metadata(file, save_latents=True):
if latents and save_latents: if latents and save_latents:
outdir='/voices/.temp/' outdir='/voices/.temp/'
os.makedirs(os.path.join(outdir), exist_ok=True) os.makedirs(outdir, exist_ok=True)
with open(os.path.join(outdir, 'cond_latents.pth'), 'wb') as f: with open(f'{outdir}/cond_latents.pth', 'wb') as f:
f.write(latents) f.write(latents)
latents = os.path.join(outdir, 'cond_latents.pth') latents = f'{outdir}/cond_latents.pth'
return ( return (
j, j,
latents latents
) )
def copy_settings(file): def import_generate_settings(file="./config/generate.json"):
metadata, latents = read_metadata(file, save_latents=False) settings, _ = read_generate_settings(file, save_latents=False)
if metadata is None: if settings is None:
return None return None
return ( return (
metadata['text'], settings['text'],
metadata['delimiter'], settings['delimiter'],
metadata['emotion'], settings['emotion'],
metadata['prompt'], settings['prompt'],
metadata['voice'], settings['voice'],
metadata['mic_audio'], settings['mic_audio'],
metadata['preset'], settings['seed'],
metadata['seed'], settings['candidates'],
metadata['candidates'], settings['num_autoregressive_samples'],
metadata['num_autoregressive_samples'], settings['diffusion_iterations'],
metadata['diffusion_iterations'], settings['temperature'],
metadata['temperature'], settings['diffusion_sampler'],
metadata['diffusion_sampler'], settings['breathing_room'],
metadata['breathing_room'], settings['experimentals'],
metadata['experimentals'],
) )
def curl(url):
try:
req = urllib.request.Request(url, headers={'User-Agent': 'Python'})
conn = urllib.request.urlopen(req)
data = conn.read()
data = data.decode()
data = json.loads(data)
conn.close()
return data
except Exception as e:
print(e)
return None
def check_for_updates():
if not os.path.isfile('./.git/FETCH_HEAD'):
print("Cannot check for updates: not from a git repo")
return False
with open(f'./.git/FETCH_HEAD', 'r', encoding="utf-8") as f:
head = f.read()
match = re.findall(r"^([a-f0-9]+).+?https:\/\/(.+?)\/(.+?)\/(.+?)\n", head)
if match is None or len(match) == 0:
print("Cannot check for updates: cannot parse FETCH_HEAD")
return False
match = match[0]
local = match[0]
host = match[1]
owner = match[2]
repo = match[3]
res = curl(f"https://{host}/api/v1/repos/{owner}/{repo}/branches/") #this only works for gitea instances
if res is None or len(res) == 0:
print("Cannot check for updates: cannot fetch from remote")
return False
remote = res[0]["commit"]["id"]
if remote != local:
print(f"New version found: {local[:8]} => {remote[:8]}")
return True
return False
def update_voices(): def update_voices():
return gr.Dropdown.update(choices=os.listdir(os.path.join("tortoise", "voices")) + ["microphone"]) return gr.Dropdown.update(choices=os.listdir(os.listdir("./tortoise/voices")) + ["microphone"])
def export_exec_settings( share, check_for_updates, low_vram, cond_latent_max_chunk_size, concurrency_count ):
args.share = share
args.low_vram = low_vram
args.check_for_updates = check_for_updates
args.cond_latent_max_chunk_size = cond_latent_max_chunk_size
args.concurrency_count = concurrency_count
settings = {
'share': args.share,
'low-vram':args.low_vram,
'check-for-updates':args.check_for_updates,
'cond-latent-max-chunk-size': args.cond_latent_max_chunk_size,
'concurrency-count': args.concurrency_count,
}
with open(f'./config/exec.json', 'w', encoding="utf-8") as f:
f.write(json.dumps(settings, indent='\t') )
def main(): def main():
with gr.Blocks() as webui: with gr.Blocks() as webui:
@ -253,15 +337,15 @@ def main():
delimiter = gr.Textbox(lines=1, label="Line Delimiter", placeholder="\\n") delimiter = gr.Textbox(lines=1, label="Line Delimiter", placeholder="\\n")
emotion = gr.Radio( emotion = gr.Radio(
["None", "Happy", "Sad", "Angry", "Disgusted", "Arrogant", "Custom"], ["Happy", "Sad", "Angry", "Disgusted", "Arrogant", "Custom"],
value="None", value="Custom",
label="Emotion", label="Emotion",
type="value", type="value",
interactive=True interactive=True
) )
prompt = gr.Textbox(lines=1, label="Custom Emotion + Prompt (if selected)") prompt = gr.Textbox(lines=1, label="Custom Emotion + Prompt (if selected)")
voice = gr.Dropdown( voice = gr.Dropdown(
os.listdir(os.path.join("tortoise", "voices")) + ["microphone"], os.listdir("./tortoise/voices") + ["microphone"],
label="Voice", label="Voice",
type="value", type="value",
) )
@ -289,8 +373,7 @@ def main():
seed = gr.Number(value=0, precision=0, label="Seed") seed = gr.Number(value=0, precision=0, label="Seed")
preset = gr.Radio( preset = gr.Radio(
["Ultra Fast", "Fast", "Standard", "High Quality", "None"], ["Ultra Fast", "Fast", "Standard", "High Quality"],
value="None",
label="Preset", label="Preset",
type="value", type="value",
) )
@ -306,8 +389,6 @@ def main():
type="value", type="value",
) )
experimentals = gr.CheckboxGroup(["Half Precision", "Conditioning-Free"], value=["Conditioning-Free"], label="Experimental Flags")
preset.change(fn=update_presets, preset.change(fn=update_presets,
inputs=preset, inputs=preset,
outputs=[ outputs=[
@ -322,6 +403,47 @@ def main():
submit = gr.Button(value="Generate") submit = gr.Button(value="Generate")
#stop = gr.Button(value="Stop") #stop = gr.Button(value="Stop")
with gr.Tab("Utilities"):
with gr.Row():
with gr.Column():
audio_in = gr.File(type="file", label="Audio Input", file_types=["audio"])
copy_button = gr.Button(value="Copy Settings")
with gr.Column():
metadata_out = gr.JSON(label="Audio Metadata")
latents_out = gr.File(type="binary", label="Voice Latents")
audio_in.upload(
fn=read_generate_settings,
inputs=audio_in,
outputs=[
metadata_out,
latents_out
]
)
with gr.Tab("Settings"):
with gr.Row():
with gr.Column():
with gr.Box():
exec_arg_share = gr.Checkbox(label="Public Share Gradio", value=args.share)
exec_check_for_updates = gr.Checkbox(label="Check For Updates", value=args.check_for_updates)
exec_arg_low_vram = gr.Checkbox(label="Low VRAM", value=args.low_vram)
exec_arg_cond_latent_max_chunk_size = gr.Number(label="Voice Latents Max Chunk Size", precision=0, value=args.cond_latent_max_chunk_size)
exec_arg_concurrency_count = gr.Number(label="Concurrency Count", precision=0, value=args.concurrency_count)
experimentals = gr.CheckboxGroup(["Half Precision", "Conditioning-Free"], value=["Conditioning-Free"], label="Experimental Flags")
check_updates_now = gr.Button(value="Check for Updates")
exec_inputs = [exec_arg_share, exec_check_for_updates, exec_arg_low_vram, exec_arg_cond_latent_max_chunk_size, exec_arg_concurrency_count]
for i in exec_inputs:
i.change(
fn=export_exec_settings,
inputs=exec_inputs
)
check_updates_now.click(check_for_updates)
input_settings = [ input_settings = [
text, text,
@ -330,7 +452,6 @@ def main():
prompt, prompt,
voice, voice,
mic_audio, mic_audio,
preset,
seed, seed,
candidates, candidates,
num_autoregressive_samples, num_autoregressive_samples,
@ -346,38 +467,42 @@ def main():
outputs=[selected_voice, output_audio, usedSeed], outputs=[selected_voice, output_audio, usedSeed],
) )
#stop.click(fn=None, inputs=None, outputs=None, cancels=[submit_event]) copy_button.click(import_generate_settings,
with gr.Tab("Utilities"):
with gr.Row():
with gr.Column():
audio_in = gr.File(type="file", label="Audio Input", file_types=["audio"])
copy_button = gr.Button(value="Copy Settings")
with gr.Column():
metadata_out = gr.JSON(label="Audio Metadata")
latents_out = gr.File(type="binary", label="Voice Latents")
audio_in.upload(
fn=read_metadata,
inputs=audio_in,
outputs=[
metadata_out,
latents_out
]
)
copy_button.click(copy_settings,
inputs=audio_in, # JSON elements cannt be used as inputs inputs=audio_in, # JSON elements cannt be used as inputs
outputs=input_settings outputs=input_settings
) )
webui.queue().launch(share=args.share) if os.path.isfile('./config/generate.json'):
webui.load(import_generate_settings, inputs=None, outputs=input_settings)
if args.check_for_updates:
webui.load(check_for_updates)
#stop.click(fn=None, inputs=None, outputs=None, cancels=[submit_event])
webui.queue(concurrency_count=args.concurrency_count).launch(share=args.share)
if __name__ == "__main__": if __name__ == "__main__":
default_arguments = {
'share': False,
'check-for-updates': False,
'low-vram': False,
'cond-latent-max-chunk-size': 1000000,
'concurrency-count': 3,
}
if os.path.isfile('./config/exec.json'):
with open(f'./config/exec.json', 'r', encoding="utf-8") as f:
default_arguments = json.load(f)
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument("--share", action='store_true', help="Lets Gradio return a public URL to use anywhere") parser.add_argument("--share", action='store_true', default=default_arguments['share'], help="Lets Gradio return a public URL to use anywhere")
parser.add_argument("--low-vram", action='store_true', help="Disables some optimizations that increases VRAM usage") parser.add_argument("--check-for-updates", action='store_true', default=default_arguments['check-for-updates'], help="Checks for update on startup")
parser.add_argument("--cond-latent-max-chunk-size", type=int, default=1000000, help="Sets an upper limit to audio chunk size when computing conditioning latents") parser.add_argument("--low-vram", action='store_true', default=default_arguments['low-vram'], help="Disables some optimizations that increases VRAM usage")
parser.add_argument("--cond-latent-max-chunk-size", default=default_arguments['cond-latent-max-chunk-size'], type=int, help="Sets an upper limit to audio chunk size when computing conditioning latents")
parser.add_argument("--concurrency-count", type=int, default=default_arguments['concurrency-count'], help="How many Gradio events to process at once")
args = parser.parse_args() args = parser.parse_args()
print("Initializating TorToiSe...") print("Initializating TorToiSe...")

0
config/.gitkeep Executable file
View File