Added settings page, added checking for updates (disabled by default), some other things that I don't remember

2023-02-06 21:43:01 +00:00 · 2023-02-06 21:43:01 +00:00 · d8c88078f3
commit d8c88078f3
parent d1172ead36
3 changed files with 247 additions and 100 deletions
--- a/README.md
+++ b/README.md
@ -97,7 +97,9 @@ Now you're ready to generate clips. With the command prompt still open, simply e

 If you're looking to access your copy of TorToiSe from outside your local network, pass `--share` into the command (for example, `python app.py --share`). You'll get a temporary gradio link to use.

-You'll be presented with a bunch of options, but do not be overwhelmed, as most of the defaults are sane, but below are a rough explanation on which input does what:
+### Generate
+
+You'll be presented with a bunch of options in the default `Generate` tab, but do not be overwhelmed, as most of the defaults are sane, but below are a rough explanation on which input does what:
 * `Prompt`: text you want to be read. You wrap text in `[brackets]` for "prompt engineering", where it'll affect the output, but those words won't actually be read.
 * `Line Delimiter`: String to split the prompt into pieces. The stitched clip will be stored as `combined.wav`
 	- Setting this to `\n` will generate each line as one clip before stitching it. Leave blank to disable this.
@ -115,9 +117,6 @@ You'll be presented with a bunch of options, but do not be overwhelmed, as most
 * `Diffusion Sampler`: sampler method during the diffusion pass. Currently, only `P` and `DDIM` are added, but does not seem to offer any substantial differences in my short tests.
 	`P` refers to the default, vanilla sampling method in `diffusion.py`.
 	To reiterate, this ***only*** is useful for the diffusion decoding path, after the autoregressive outputs are generated.
-Below are an explanation of experimental flags. Messing with these might impact performance, as these are exposed only if you know what you are doing.
-* `Half-Precision`: (attempts to) hint to PyTorch to auto-cast to float16 (half precision) for compute. Disabled by default, due to it making computations slower.
-* `Conditional Free`: a quality boosting improvement at the cost of some performance. Enabled by default, as I think the penaly is negligible in the end.

 After you fill everything out, click `Run`, and wait for your output in the output window. The sampled voice is also returned, but if you're using multiple files, it'll return the first file, rather than a combined file.

@ -129,6 +128,29 @@ As a quick optimization, I modified the script to have the `conditional_latents`

 **!**NOTE**!**: cached `latents.pth` files generated before 2023.02.05 will be ignored, due to a change in computing the conditiona latents. This *should* help bump up voice cloning quality. Apologies for the inconvenience.

+### Utilities
+
+In this tab, you can find some helper utilities that might be of assistance.
+
+For now, an analog to the PNG info found in Voldy's Stable Diffusion Web UI resides here. With it, you can upload an audio file generated with this web UI to view the settings used to generate that output. Additionally, the voice latents used to generate the uploaded audio clip can be extracted.
+
+If you want to reuse its generation settings, simply click "Copy Settings".
+
+### Settings
+
+This tab (should) hold a bunch of other settings, from tunables that shouldn't be tampered with, to settings pertaining to the web UI itself.
+
+Below are settings that override the default launch arguments. Some of these require restarting to work.
+* `Public Share Gradio`: overrides `--share`. Tells Gradio to generate a public URL for the web UI
+* `Check for Updates`: checks for updates on page load and notifies in console. Only works if you pulled this repo from a gitea instance.
+* `Low VRAM`: disables optimizations in TorToiSe that increases VRAM consumption. Suggested if your GPU has under 6GiB.
+* `Voice Latent Max Chunk Size`: during the voice latents calculation pass, this limits how large, in bytes, a chunk can be. Large values can run into VRAM OOM errors.
+* `Concurrency Count`: how many Gradio events the queue can process at once. Leave this over 1 if you want to modify settings in the UI that updates other settings while generating audio clips.
+
+Below are an explanation of experimental flags. Messing with these might impact performance, as these are exposed only if you know what you are doing.
+* `Half-Precision`: (attempts to) hint to PyTorch to auto-cast to float16 (half precision) for compute. Disabled by default, due to it making computations slower.
+* `Conditional Free`: a quality boosting improvement at the cost of some performance. Enabled by default, as I think the penaly is negligible in the end.
+
 ## Example(s)

 Below are some (rather outdated) outputs I deem substantial enough to share. As I continue delving into TorToiSe, I'll supply more examples and the values I use.
--- a/app.py
+++ b/app.py
@ -1,20 +1,24 @@
 import os
 import argparse
-import gradio as gr
-import torch
-import torchaudio
 import time
 import json
 import base64
+import re
+import urllib.request
+
+import torch
+import torchaudio
+import music_tag
+import gradio as gr

 from datetime import datetime
+
 from tortoise.api import TextToSpeech
 from tortoise.utils.audio import load_audio, load_voice, load_voices
 from tortoise.utils.text import split_and_recombine_text

-import music_tag

-def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, candidates, num_autoregressive_samples, diffusion_iterations, temperature, diffusion_sampler, breathing_room, experimentals, progress=gr.Progress()):
+def generate(text, delimiter, emotion, prompt, voice, mic_audio, seed, candidates, num_autoregressive_samples, diffusion_iterations, temperature, diffusion_sampler, breathing_room, experimentals, progress=gr.Progress(track_tqdm=True)):
    if voice != "microphone":
        voices = [voice]
    else:
@ -33,7 +37,7 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c
        sample_voice = voice_samples[0]
        conditioning_latents = tts.get_conditioning_latents(voice_samples, progress=progress, max_chunk_size=args.cond_latent_max_chunk_size)
        if voice != "microphone":
-            torch.save(conditioning_latents, os.path.join(f'./tortoise/voices/{voice}/', f'cond_latents.pth'))
+            torch.save(conditioning_latents, f'./tortoise/voices/{voice}/cond_latents.pth')
        voice_samples = None
    else:
        sample_voice = None
@ -41,8 +45,6 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c
    if seed == 0:
        seed = None

-    print(conditioning_latents)
-
    start_time = time.time()

    settings = {
@ -82,9 +84,10 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c

    audio_cache = {}
    for line, cut_text in enumerate(texts):
-        if emotion == "Custom" and prompt.strip() != "":
-            cut_text = f"[{prompt},] {cut_text}"
-        elif emotion != "None":
+        if emotion == "Custom":
+            if prompt.strip() != "":
+                cut_text = f"[{prompt},] {cut_text}"
+        else:
            cut_text = f"[I am really {emotion.lower()},] {cut_text}"

        print(f"[{str(line+1)}/{str(len(texts))}] Generating line: {cut_text}")
@ -100,15 +103,15 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c
                    'text': cut_text,
                }

-                os.makedirs(os.path.join(outdir, f'candidate_{j}'), exist_ok=True)
-                torchaudio.save(os.path.join(outdir, f'candidate_{j}/result_{line}.wav'), audio, 24000)
+                os.makedirs(f'{outdir}/candidate_{j}', exist_ok=True)
+                torchaudio.save(f'{outdir}/candidate_{j}/result_{line}.wav', audio, 24000)
        else:
            audio = gen.squeeze(0).cpu()
            audio_cache[f"result_{line}.wav"] = {
                'audio': audio,
                'text': cut_text,
            }
-            torchaudio.save(os.path.join(outdir, f'result_{line}.wav'), audio, 24000)
+            torchaudio.save(f'{outdir}/result_{line}.wav', audio, 24000)
 
    output_voice = None
    if len(texts) > 1:
@ -120,17 +123,26 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c
                else:
                    audio = audio_cache[f'result_{line}.wav']['audio']
                audio_clips.append(audio)
-            audio_clips = torch.cat(audio_clips, dim=-1)
-            torchaudio.save(os.path.join(outdir, f'combined_{candidate}.wav'), audio_clips, 24000)
            
+            audio_clips = torch.cat(audio_clips, dim=-1).squeeze(0).cpu()
+            torchaudio.save(f'{outdir}/combined_{candidate}.wav', audio_clips, 24000)
+            
+
+            audio_cache[f'combined_{candidate}.wav'] = {
+                'audio': audio,
+                'text': cut_text,
+            }
+
            if output_voice is None:
-                output_voice = (24000, audio_clips.squeeze().cpu().numpy())
+                output_voice = audio_clips
    else:
        if isinstance(gen, list):
            output_voice = gen[0]
        else:
            output_voice = gen
-        output_voice = (24000, output_voice.squeeze().cpu().numpy())
+    
+    if output_voice is not None:
+        output_voice = (24000, output_voice.numpy())

    info = {
        'text': text,
@ -139,7 +151,6 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c
        'prompt': prompt,
        'voice': voice,
        'mic_audio': mic_audio,
-        'preset': preset,
        'seed': seed,
        'candidates': candidates,
        'num_autoregressive_samples': num_autoregressive_samples,
@ -151,27 +162,31 @@ def generate(text, delimiter, emotion, prompt, voice, mic_audio, preset, seed, c
        'time': time.time()-start_time,
    }
    
-    with open(os.path.join(outdir, f'input.txt'), 'w', encoding="utf-8") as f:
+    with open(f'{outdir}/input.json', 'w', encoding="utf-8") as f:
        f.write(json.dumps(info, indent='\t') )

    if voice is not None and conditioning_latents is not None:
-        with open(os.path.join(f'./tortoise/voices/{voice}/', f'cond_latents.pth'), 'rb') as f:
+        with open(f'./tortoise/voices/{voice}/cond_latents.pth', 'rb') as f:
            info['latents'] = base64.b64encode(f.read()).decode("ascii")

-    print(f"Saved to '{outdir}'")
-

    for path in audio_cache:
        info['text'] = audio_cache[path]['text']

-        metadata = music_tag.load_file(os.path.join(outdir, path))
+        metadata = music_tag.load_file(f"{outdir}/{path}")
        metadata['lyrics'] = json.dumps(info) 
        metadata.save()
 
    if sample_voice is not None:
        sample_voice = (22050, sample_voice.squeeze().cpu().numpy())
 
-    audio_clips = []
+    print(f"Saved to '{outdir}'")
+
+    info['seed'] = settings['use_deterministic_seed']
+    del info['latents']
+    with open(f'./config/generate.json', 'w', encoding="utf-8") as f:
+        f.write(json.dumps(info, indent='\t') )
+
    return (
        sample_voice,
        output_voice, 
@ -192,57 +207,126 @@ def update_presets(value):
    else:
        return (gr.update(), gr.update())

-def read_metadata(file, save_latents=True):
+def read_generate_settings(file, save_latents=True):
    j = None
    latents = None

    if file is not None:
-        metadata = music_tag.load_file(file.name)
-        if 'lyrics' in metadata:
-            j = json.loads(str(metadata['lyrics']))
-
-            if 'latents' in j and save_latents:
-                latents = base64.b64decode(j['latents'])
-                del j['latents']
+        if hasattr(file, 'name'):
+            metadata = music_tag.load_file(file.name)
+            if 'lyrics' in metadata:
+                j = json.loads(str(metadata['lyrics']))
+        elif file[-5:] == ".json":
+            with open(file, 'r') as f:
+                j = json.load(f)
+    
+    if 'latents' in j and save_latents:
+        latents = base64.b64decode(j['latents'])
+        del j['latents']

    if latents and save_latents:
        outdir='/voices/.temp/'
-        os.makedirs(os.path.join(outdir), exist_ok=True)
-        with open(os.path.join(outdir, 'cond_latents.pth'), 'wb') as f:
+        os.makedirs(outdir, exist_ok=True)
+        with open(f'{outdir}/cond_latents.pth', 'wb') as f:
            f.write(latents)
-        latents = os.path.join(outdir, 'cond_latents.pth')
+        latents = f'{outdir}/cond_latents.pth'

    return (
        j,
        latents
    )

-def copy_settings(file):
-    metadata, latents = read_metadata(file, save_latents=False)
+def import_generate_settings(file="./config/generate.json"):
+    settings, _ = read_generate_settings(file, save_latents=False)
    
-    if metadata is None:
+    if settings is None:
        return None

    return (
-        metadata['text'],
-        metadata['delimiter'],
-        metadata['emotion'],
-        metadata['prompt'],
-        metadata['voice'],
-        metadata['mic_audio'],
-        metadata['preset'],
-        metadata['seed'],
-        metadata['candidates'],
-        metadata['num_autoregressive_samples'],
-        metadata['diffusion_iterations'],
-        metadata['temperature'],
-        metadata['diffusion_sampler'],
-        metadata['breathing_room'],
-        metadata['experimentals'],
+        settings['text'],
+        settings['delimiter'],
+        settings['emotion'],
+        settings['prompt'],
+        settings['voice'],
+        settings['mic_audio'],
+        settings['seed'],
+        settings['candidates'],
+        settings['num_autoregressive_samples'],
+        settings['diffusion_iterations'],
+        settings['temperature'],
+        settings['diffusion_sampler'],
+        settings['breathing_room'],
+        settings['experimentals'],
    )

+def curl(url):
+    try:
+        req = urllib.request.Request(url, headers={'User-Agent': 'Python'})
+        conn = urllib.request.urlopen(req)
+        data = conn.read()
+        data = data.decode()
+        data = json.loads(data)
+        conn.close()
+        return data
+    except Exception as e:
+        print(e)
+        return None
+
+def check_for_updates():
+    if not os.path.isfile('./.git/FETCH_HEAD'):
+        print("Cannot check for updates: not from a git repo")
+        return False
+
+    with open(f'./.git/FETCH_HEAD', 'r', encoding="utf-8") as f:
+        head = f.read()
+    
+    match = re.findall(r"^([a-f0-9]+).+?https:\/\/(.+?)\/(.+?)\/(.+?)\n", head)
+    if match is None or len(match) == 0:
+        print("Cannot check for updates: cannot parse FETCH_HEAD")
+        return False
+
+    match = match[0]
+
+    local = match[0]
+    host = match[1]
+    owner = match[2]
+    repo = match[3]
+
+    res = curl(f"https://{host}/api/v1/repos/{owner}/{repo}/branches/") #this only works for gitea instances
+
+    if res is None or len(res) == 0:
+        print("Cannot check for updates: cannot fetch from remote")
+        return False
+
+    remote = res[0]["commit"]["id"]
+
+    if remote != local:
+        print(f"New version found: {local[:8]} => {remote[:8]}")
+        return True
+
+    return False
+
 def update_voices():
-    return gr.Dropdown.update(choices=os.listdir(os.path.join("tortoise", "voices")) + ["microphone"])
+    return gr.Dropdown.update(choices=os.listdir(os.listdir("./tortoise/voices")) + ["microphone"])
+
+def export_exec_settings( share, check_for_updates, low_vram, cond_latent_max_chunk_size, concurrency_count ):
+    args.share = share
+    args.low_vram = low_vram
+    args.check_for_updates = check_for_updates
+    args.cond_latent_max_chunk_size = cond_latent_max_chunk_size
+    args.concurrency_count = concurrency_count
+
+    settings = {
+        'share': args.share,
+        'low-vram':args.low_vram,
+        'check-for-updates':args.check_for_updates,
+        'cond-latent-max-chunk-size': args.cond_latent_max_chunk_size,
+        'concurrency-count': args.concurrency_count,
+    }
+
+    with open(f'./config/exec.json', 'w', encoding="utf-8") as f:
+        f.write(json.dumps(settings, indent='\t') )
+

 def main():
    with gr.Blocks() as webui:
@ -253,15 +337,15 @@ def main():
                    delimiter = gr.Textbox(lines=1, label="Line Delimiter", placeholder="\\n")

                    emotion = gr.Radio(
-                        ["None", "Happy", "Sad", "Angry", "Disgusted", "Arrogant", "Custom"],
-                        value="None",
+                        ["Happy", "Sad", "Angry", "Disgusted", "Arrogant", "Custom"],
+                        value="Custom",
                        label="Emotion",
                        type="value",
                        interactive=True
                    )
                    prompt = gr.Textbox(lines=1, label="Custom Emotion + Prompt (if selected)")
                    voice = gr.Dropdown(
-                        os.listdir(os.path.join("tortoise", "voices")) + ["microphone"],
+                        os.listdir("./tortoise/voices") + ["microphone"],
                        label="Voice",
                        type="value",
                    )
@ -289,8 +373,7 @@ def main():
                    seed = gr.Number(value=0, precision=0, label="Seed")

                    preset = gr.Radio(
-                        ["Ultra Fast", "Fast", "Standard", "High Quality", "None"],
-                        value="None",
+                        ["Ultra Fast", "Fast", "Standard", "High Quality"],
                        label="Preset",
                        type="value",
                    )
@ -306,8 +389,6 @@ def main():
                        type="value",
                    )

-                    experimentals = gr.CheckboxGroup(["Half Precision", "Conditioning-Free"], value=["Conditioning-Free"], label="Experimental Flags")
-
                    preset.change(fn=update_presets,
                        inputs=preset,
                        outputs=[
@ -322,31 +403,6 @@ def main():
                    
                    submit = gr.Button(value="Generate")
                    #stop = gr.Button(value="Stop")
-                    
-                    input_settings = [
-                        text,
-                        delimiter,
-                        emotion,
-                        prompt,
-                        voice,
-                        mic_audio,
-                        preset,
-                        seed,
-                        candidates,
-                        num_autoregressive_samples,
-                        diffusion_iterations,
-                        temperature,
-                        diffusion_sampler,
-                        breathing_room,
-                        experimentals,
-                    ]
-
-                    submit_event = submit.click(generate,
-                        inputs=input_settings,
-                        outputs=[selected_voice, output_audio, usedSeed],
-                    )
-
-                    #stop.click(fn=None, inputs=None, outputs=None, cancels=[submit_event])
        with gr.Tab("Utilities"):
            with gr.Row():
                with gr.Column():
@ -357,27 +413,96 @@ def main():
                    latents_out = gr.File(type="binary", label="Voice Latents")

                    audio_in.upload(
-                        fn=read_metadata,
+                        fn=read_generate_settings,
                        inputs=audio_in,
                        outputs=[
                            metadata_out,
                            latents_out
                        ]
                    )
+        with gr.Tab("Settings"):
+            with gr.Row():
+                with gr.Column():
+                    with gr.Box():
+                        exec_arg_share = gr.Checkbox(label="Public Share Gradio", value=args.share)
+                        exec_check_for_updates = gr.Checkbox(label="Check For Updates", value=args.check_for_updates)
+                        exec_arg_low_vram = gr.Checkbox(label="Low VRAM", value=args.low_vram)
+                        exec_arg_cond_latent_max_chunk_size = gr.Number(label="Voice Latents Max Chunk Size", precision=0, value=args.cond_latent_max_chunk_size)
+                        exec_arg_concurrency_count = gr.Number(label="Concurrency Count", precision=0, value=args.concurrency_count)

-                    copy_button.click(copy_settings,
-                        inputs=audio_in, # JSON elements cannt be used as inputs
-                        outputs=input_settings
-                    )

-    webui.queue().launch(share=args.share)
+                    experimentals = gr.CheckboxGroup(["Half Precision", "Conditioning-Free"], value=["Conditioning-Free"], label="Experimental Flags")
+
+                    check_updates_now = gr.Button(value="Check for Updates")
+
+                    exec_inputs = [exec_arg_share, exec_check_for_updates, exec_arg_low_vram, exec_arg_cond_latent_max_chunk_size, exec_arg_concurrency_count]
+
+                    for i in exec_inputs:
+                        i.change(
+                            fn=export_exec_settings,
+                            inputs=exec_inputs
+                        )
+
+                    check_updates_now.click(check_for_updates)
+
+        input_settings = [
+            text,
+            delimiter,
+            emotion,
+            prompt,
+            voice,
+            mic_audio,
+            seed,
+            candidates,
+            num_autoregressive_samples,
+            diffusion_iterations,
+            temperature,
+            diffusion_sampler,
+            breathing_room,
+            experimentals,
+        ]
+
+        submit_event = submit.click(generate,
+            inputs=input_settings,
+            outputs=[selected_voice, output_audio, usedSeed],
+        )
+
+        copy_button.click(import_generate_settings,
+            inputs=audio_in, # JSON elements cannt be used as inputs
+            outputs=input_settings
+        )
+
+        if os.path.isfile('./config/generate.json'):
+            webui.load(import_generate_settings, inputs=None, outputs=input_settings)
+        
+        if args.check_for_updates:
+            webui.load(check_for_updates)
+
+        #stop.click(fn=None, inputs=None, outputs=None, cancels=[submit_event])
+
+    webui.queue(concurrency_count=args.concurrency_count).launch(share=args.share)


 if __name__ == "__main__":
+
+    default_arguments = {
+        'share': False,
+        'check-for-updates': False,
+        'low-vram': False,
+        'cond-latent-max-chunk-size': 1000000,
+        'concurrency-count': 3,
+    }
+
+    if os.path.isfile('./config/exec.json'):
+        with open(f'./config/exec.json', 'r', encoding="utf-8") as f:
+            default_arguments = json.load(f)
+
    parser = argparse.ArgumentParser()
-    parser.add_argument("--share", action='store_true', help="Lets Gradio return a public URL to use anywhere")
-    parser.add_argument("--low-vram", action='store_true', help="Disables some optimizations that increases VRAM usage")
-    parser.add_argument("--cond-latent-max-chunk-size", type=int, default=1000000, help="Sets an upper limit to audio chunk size when computing conditioning latents")
+    parser.add_argument("--share", action='store_true', default=default_arguments['share'], help="Lets Gradio return a public URL to use anywhere")
+    parser.add_argument("--check-for-updates", action='store_true', default=default_arguments['check-for-updates'], help="Checks for update on startup")
+    parser.add_argument("--low-vram", action='store_true', default=default_arguments['low-vram'], help="Disables some optimizations that increases VRAM usage")
+    parser.add_argument("--cond-latent-max-chunk-size", default=default_arguments['cond-latent-max-chunk-size'], type=int, help="Sets an upper limit to audio chunk size when computing conditioning latents")
+    parser.add_argument("--concurrency-count", type=int, default=default_arguments['concurrency-count'], help="How many Gradio events to process at once")
    args = parser.parse_args()

    print("Initializating TorToiSe...")
--- a/config/.gitkeep
+++ b/config/.gitkeep