Initial refractor

2023-02-17 00:08:27 +00:00 · 2023-02-17 00:08:27 +00:00 · 3a078df95e
commit 3a078df95e
parent 0456f71ec3
22 changed files with 1816 additions and 3 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,141 @@
 # ignores user files
 /tortoise-venv/
 /tortoise/voices/
 /models/
 /config/*
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
 *$py.class
 # C extensions
 *.so
 # Distribution / packaging
 .Python
 build/
 develop-eggs/
 dist/
 downloads/
 eggs/
 .eggs/
 lib/
 lib64/
 parts/
 sdist/
 var/
 wheels/
 pip-wheel-metadata/
 share/python-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 MANIFEST
 # PyInstaller
 #  Usually these files are written by a python script from a template
 #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 *.manifest
 *.spec
 # Installer logs
 pip-log.txt
 pip-delete-this-directory.txt
 # Unit test / coverage reports
 htmlcov/
 .tox/
 .nox/
 .coverage
 .coverage.*
 .cache
 nosetests.xml
 coverage.xml
 *.cover
 *.py,cover
 .hypothesis/
 .pytest_cache/
 # Translations
 *.mo
 *.pot
 # Django stuff:
 *.log
 local_settings.py
 db.sqlite3
 db.sqlite3-journal
 # Flask stuff:
 instance/
 .webassets-cache
 # Scrapy stuff:
 .scrapy
 # Sphinx documentation
 docs/_build/
 # PyBuilder
 target/
 # Jupyter Notebook
 .ipynb_checkpoints
 # IPython
 profile_default/
 ipython_config.py
 # pyenv
 .python-version
 # pipenv
 #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 #   install all needed dependencies.
 #Pipfile.lock
 # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 __pypackages__/
 # Celery stuff
 celerybeat-schedule
 celerybeat.pid
 # SageMath parsed files
 *.sage.py
 # Environments
 .env
 .venv
 env/
 venv/
 ENV/
 env.bak/
 venv.bak/
 # Spyder project settings
 .spyderproject
 .spyproject
 # Rope project settings
 .ropeproject
 # mkdocs documentation
 /site
 # mypy
 .mypy_cache/
 .dmypy.json
 dmypy.json
 # Pyre type checker
 .pyre/
 .idea/*
 .models/*
 .custom/*
 results/*
 debug_states/*
--- a/0
+++ b/0
--- a/README.md
+++ b/README.md
@ -1,3 +1,299 @@
-# ai-voice-cloning
+# AI Voice Cloning
-Collection of utilities aimed to voice clone through AI
+This [repo](https://git.ecker.tech/mrq/ai-voice-cloning)/[rentry](https://rentry.org/AI-Voice-Cloning/) aims to serve as both a foolproof guide for setting up AI voice cloning tools for legitimate, local use on Windows, as well as a stepping stone for anons that genuinely want to play around with [TorToiSe](https://github.com/neonbjb/tortoise-tts).
 Similar to my own findings for Stable Diffusion image generation, this rentry may appear a little disheveled as I note my new findings with TorToiSe. Please keep this in mind if the guide seems to shift a bit or sound confusing.
 >\>Ugh... why bother when I can just abuse 11.AI?
 I very much encourage (You) to use 11.AI while it's still viable to use. For the layman, it's easier to go through the hoops of coughing up the $5 or abusing the free trial over actually setting up a TorToiSe environment and dealing with its quirks.
 However, I also encourage your own experimentation with TorToiSe, as it's very, very promising, it just takes a little love and elbow grease.
 This is not endorsed by [neonbjb](https://github.com/neonbjb/). I do not expect this to run into any ethical issues, as it seems (like me), this is mostly for making funny haha vidya characters say funny lines.
 ## Glossary
 To try and keep the terminology used here (somewhat) consistent and coherent, below are a list of terms, and their definitions (or at least, the way I'm using them):
 * `voice cloning`: synthesizing speech to accurately replicate a subject's voice.
 * `input clips` / `voice clips` / `audio input` / `voice samples` : the original voice source of the subject you're trying to clone.
 * `waveform`: the raw audio.
 * `sampling rate`: the bandwidth of a given waveform, represented as twice the frequency of the waveform it represents.
 * `voice latents` / `conditional latents` / `latents`: computated traits of a voice.
 * `autoregressive samples` (`samples` / `tokens`): the initial generation pass to output tokens, and (usually) the most computationally expensive. More samples = better "cloning".
 * `CLVP`: Contrastive Language-Voice Pretraining: an analog to CLIP, but for voices. After the autoregressive samples pass, those samples/tokens are compared against the CLVP to find the best candidates.
 * `CVVP`: Contrastive Voice-Voice Pretraining: a (deprecated) model that can be used weighted in junction with the CLVP.
 * `candidates`: results from the comparing against the CLVP/CVVP models. (Assumed to be) ordered from best to worst.
 * `diffusion decoder` / `vocoder`: these passes are responsible for encoding the tokens into a MEL spectrogram into a waveform.
 * `diffusion iterations`: how many passes to put into generating the output waveform. More iterations = better audio quality.
 * `diffusion sampler` / `sampler`: the sampling method used during the diffusion decoding pass, albeit a bit of a misnomer. Currently, only two samplers are implemented.
 ## Modifications
 My fork boasts the following additions, fixes, and optimizations:
 * a competent web UI made in Gradio to expose a lot of tunables and options
 * cleaned up output structure of resulting audio files
 * caching computed conditional latents for faster re-runs
 	- additionally, regenerating them if the script detects they're out of date
 * uses the entire audio sample instead of the first four seconds of each sound file for better reproducing
 * activated unused DDIM sampler
 * use of some optimizations like `kv_cache`ing for the autoregression sample pass, and keeping data on GPU 
 * compatibilty with DirectML
 * easy install scripts
 * and more!
 ## Colab Notebook
 A colab-ready notebook to quickly set up and use this repo is included and available [here](https://git.ecker.tech/mrq/ai-voice-cloning/raw/branch/master/notebook.ipynb): https://git.ecker.tech/mrq/ai-voice-cloning/raw/branch/master/notebook.ipynb
 Simply go [here](https://colab.research.google.com/) and upload the file.
 For the unfortunate using Paperspace, this notebook should also work for it.
 ## Installing
 Outside of the very small prerequisites, everything needed to get TorToiSe working is included in the repo.
 ### Pre-Requirements
 Windows:
 * Python 3.9: https://www.python.org/downloads/release/python-3913/
 * Git (optional): https://git-scm.com/download/win
 * CUDA drivers, if NVIDIA
 Linux:
 * python3.x (tested with 3.10)
 * git
 * ROCm for AMD, CUDA for NVIDIA
 ### Setup
 #### Windows
 Download Python and Git and run their installers.
 After installing Python, open the Start Menu and search for `Command Prompt`. Type `cd `, then drag and drop the folder you want to work in (experienced users can just `cd <path>` directly), then hit Enter.
 Paste `git clone https://git.ecker.tech/mrq/tortoise-tts` to download TorToiSe and additional scripts, then hit Enter. Inexperienced users can just download the repo as a ZIP, and extract.
 Afterwards, run the setup script, depending on your GPU, to automatically set things up.
 * AMD: `setup-directml.bat`
 * NVIDIA: `setup-cuda.bat`
 If you've done everything right, you shouldn't have any errors.
 ##### Note on DirectML Support
 PyTorch-DirectML is very, very experimental and is still not production quality. There's some headaches with the need for hairy kludgy patches.
 These patches rely on transfering the tensor between the GPU and CPU as a hotfix, so performance is definitely harmed.
 Both the conditional latent computation and the vocoder pass have to be done on the CPU entirely because of some quirks with DirectML.
 On my 6800XT, VRAM usage climbs almost the entire 16GiB, so be wary if you OOM somehow. Low VRAM flags may NOT have any additional impact from the constant copying anyways.
 For AMD users, I still might suggest using Linux+ROCm as it's (relatively) headache free, but I had stability problems.
 #### Linux
 First, make sure you have both `python3.x` and `git` installed, as well as the required compute platform according to your GPU (ROCm or CUDA).
 Simply run the following block:
 ```
 git clone https://git.ecker.tech/mrq/tortoise-tts
 cd tortoise-tts
 chmod +x *.sh
 ```
 Then, depending on your GPU:
 * AMD: `./setup-rocm.sh`
 * NVIDIA: `./setup-cuda.sh`
 And you should be done!
 ### Updating
 To check for updates, simply run `update.bat` (or `update.sh`). It should pull from the repo, as well as fetch for any new dependencies.
 ### Pitfalls You May Encounter
 I'll try and make a list of "common" (or what I feel may be common that I experience) issues with getting TorToiSe set up:
 * `CUDA is NOT available for use.`: If you're on Linux, you failed to set up CUDA (if NVIDIA) or ROCm (if AMD). Please make sure you have these installed on your system.
 	If you're on Windows with an AMD card, you're stuck out of luck, as ROCm is not available on Windows (without major hoops to be jumped). If you're on an NVIDIA GPU, then I'm not sure what went wrong.
 * `failed reading zip archive: failed finding central directory`: You had a file fail to download completely during the model downloading initialization phase. Please open either `.\models\tortoise\` or `.\models\transformers\`, and delete the offending file.
 	You can deduce what that file is by reading the stack trace. A few lines above the last like will be a line trying to read a model path.
 * `torch.cuda.OutOfMemoryError: CUDA out of memory.`: You most likely have a GPU with low VRAM (~4GiB), and the small optimizations with keeping data on the GPU is enough to OOM. Please open the `start.bat` file and add `--low-vram` to the command (for example: `py app.py --low-vram`) to disable those small optimizations.
 * `WavFileWarning: Chunk (non-data) not understood, skipping it.`: something about your WAVs are funny, and its best to remux your audio files with FFMPEG (included batch file in `.\convert\`).
 	Honestly, I don't know if this does impact output quality, as I feel it's placebo when I do try and correct this.
 ## Preparing Voice Samples
 Now that the tough part is dealt with, it's time to prepare voice clips to use.
 Unlike training embeddings for AI image generations, preparing a "dataset" for voice cloning is very simple.
 As a general rule of thumb, try to source clips that aren't noisy, solely the subject you are trying to clone, and doesn't contain any non-words (like yells, guttural noises, etc.). If you must, run your source through a background music/noise remover (how to is an exercise left to the reader). It isn't entirely a detriment if you're unable to provide clean audio, however. Just be wary that you might have some headaches with getting acceptable output.
 Nine times out of ten, you should be fine using as many clips as possible. There's (now) no preference between combining your audio into one file, or leaving it split. However, if you're aiming for a specific delivery, it *should* be best for you to narrow down to just using that as your provided source (for example, changing one word in a line).
 There's no hard specifics on how many, or how long, your sources should be.
 If you're looking to trim your clips, in my opinion, ~~Audacity~~ Tenacity works good enough, as you can easily output your clips into the proper format (22050 Hz sampling rate).
 Power users with FFMPEG already installed can simply used the provided conversion script in `.\convert\`.
 After preparing your clips as WAV files at a sample rate of 22050 Hz, open up the `tortoise-tts` folder you're working in, navigate to the `voices` folder, create a new folder in whatever name you want, then dump your clips into that folder. While you're in the `voice` folder, you can take a look at the other provided voices.
 **!**NOTE**!**: Before 2023.02.10, voices used to be stored under `.\tortoise\voices\`, but has been moved up one folder. Compatibily is maintained with the old voice folder, but will take priority.
 **!**NOTE**!**: The speed at which a voice's conditional latents are computed will greatly depend on the size of the smallest file.
 ## Using the Software
 Now you're ready to generate clips. With the command prompt still open, simply enter `start.bat` (or `start.sh`), and wait for it to print out a URL to open in your browser, something like `http://127.0.0.1:7860`.
 If you're looking to access your copy of TorToiSe from outside your local network, tick the `Public Share Gradio` button in the `Settings` tab, then restart.
 ### Generate
 You'll be presented with a bunch of options in the default `Generate` tab, but do not be overwhelmed, as most of the defaults are sane, but below are a rough explanation on which input does what:
 * `Prompt`: text you want to be read. You wrap text in `[brackets]` for "prompt engineering", where it'll affect the output, but those words won't actually be read.
 * `Line Delimiter`: String to split the prompt into pieces. The stitched clip will be stored as `combined.wav`
 	- Setting this to `\n` will generate each line as one clip before stitching it. Leave blank to disable this.
 * `Emotion`: the "emotion" used for the delivery. This is a shortcut to utilizing "prompt engineering" by starting with `[I am really <emotion>,]` in your prompt. This is merely a suggestion, not a guarantee.
 * `Custom Emotion + Prompt`: a non-preset "emotion" used for the delivery. This is a shortcut to utilizing "prompt engineering" by starting with `[<emotion>]` in your prompt.
 * `Voice`: the voice you want to clone. You can select `microphone` if you want to use input from your microphone.
 * `Microphone Source`: Use your own voice from a line-in source.
 * `Candidates`: number of outputs to generate, starting from the best candidate. Depending on your iteration steps, generating the final sound files could be cheap, but they only offer alternatives to the samples generated to pull from (in other words, the later candidates perform worse), so don't be compelled to generate a ton of candidates.
 * `Seed`: initializes the PRNG to this value. Use this if you want to reproduce a generated voice.
 * `Preset`: shortcut values for sample count and iteration steps. Clicking a preset will update its corresponding values. Higher presets result in better quality at the cost of computation time.
 * `Samples`: analogous to samples in image generation. More samples = better resemblance / clone quality, at the cost of performance. This strictly affects clone quality.
 * `Iterations`: influences audio sound quality in the final output. More iterations = higher quality sound. This step is relatively cheap, so do not be discouraged from increasing this. This strictly affects quality in the actual sound.
 * `Temperature`: how much randomness to introduce to the generated samples. Lower values = better resemblance to the source samples, but some temperature is still required for great output.
 	- **!**NOTE**!**: This value is very inconsistent and entirely depends on the input voice. In other words, some voices will be receptive to playing with this value, while others won't make much of a difference.
 	- **!**NOTE**!**: some voices will be very receptive to this, where it speaks slowly at low temperatures, but nudging it a hair and it speaks too fast.
 * `Pause Size`: Governs how large pauses are at the end of a clip (in token size, not seconds). Increase this if your output gets cut off at the end.
 	- **!**NOTE**!**: too large of a pause size can lead to unexpected behavior.
 * `Diffusion Sampler`: sampler method during the diffusion pass. Currently, only `P` and `DDIM` are added, but does not seem to offer any substantial differences in my short tests.
 	`P` refers to the default, vanilla sampling method in `diffusion.py`.
 	To reiterate, this ***only*** is useful for the diffusion decoding path, after the autoregressive outputs are generated.
 Below are an explanation of experimental flags. Messing with these might impact performance, as these are exposed only if you know what you are doing.
 * `Half-Precision`: (attempts to) hint to PyTorch to auto-cast to float16 (half precision) for compute. Disabled by default, due to it making computations slower.
 * `Conditional Free`: a quality boosting improvement at the cost of some performance. Enabled by default, as I think the penaly is negligible in the end.
 * `CVVP Weight`: governs how much weight the CVVP model should influence candidates. The original documentation mentions this is deprecated as it does not really influence things, but you're still free to play around with it.
 	Currently, setting requires regenerating your voice latents, as I forgot to have it return some extra data that weighing against the CVVP model uses. Oops.
 	Setting this to 1 leads to bad behavior.
 * `Top P`: P value used in nucleus sampling; lower values mean the decoder produces more "likely" (aka boring) outputs.
 * `Diffusion Temperature`: the variance of the noise fed into the diffusion model; values at 0 are the "mean" prediction of the diffusion network and will sound bland and smeared.
 * `Length Penalty`: a length penalty applied to the autoregressive decoder; higher settings causes the model to produce more terse outputs.
 * `Repetition Penalty`: a penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or "uhhhhhhs", etc.
 * `Conditioning-Free K`: determintes balancing the conditioning free signal with the conditioning-present signal. 
 After you fill everything out, click `Run`, and wait for your output in the output window. The sampled voice is also returned, but if you're using multiple files, it'll return the first file, rather than a combined file.
 All outputs are saved under `./result/[voice name]/[timestamp]/` as `result.wav`, and the settings in `input.txt`. There doesn't seem to be an inherent way to add a Download button in Gradio, so keep that folder in mind.
 To save you from headaches, I strongly recommend playing around with shorter sentences first to find the right values for the voice you're using before generating longer sentences.
 As a quick optimization, I modified the script to have the `conditional_latents` are saved after loading voice samples, and subsequent uses will load that file directly (at the cost of not returning the `Sample voice` to the web UI). Additionally, these `conditional_latents` are also computed in a way to use the entire clip, rather than the first four seconds the original tortoise-tts uses. If there's voice samples that have a modification time newer than this cached file, it'll skip loading it and load the normal WAVs instead.
 **!**NOTE**!**: cached `latents.pth` files generated before 2023.02.05 will be ignored, due to a change in computing the conditiona latents. This *should* help bump up voice cloning quality. Apologies for the inconvenience.
 ### History
 In this tab, a rudimentary way of viewing past results can be found here.
 With it, you just select a voice, then you can quickly view their generation settings.
 To play a file, select a specific file with the second dropdown list.
 To reuse a voice file's settings, click `Copy Settings`.
 ### Utilities
 In this tab, you can find some helper utilities that might be of assistance.
 For now, an analog to the PNG info found in Voldy's Stable Diffusion Web UI resides here. With it, you can upload an audio file generated with this web UI to view the settings used to generate that output. Additionally, the voice latents used to generate the uploaded audio clip can be extracted.
 If you want to reuse its generation settings, simply click `Copy Settings`.
 To import a voice, click `Import Voice`. Remember to click `Refresh Voice List` in the `Generate` panel afterwards, if it's a new voice.
 ### Settings
 This tab (should) hold a bunch of other settings, from tunables that shouldn't be tampered with, to settings pertaining to the web UI itself.
 Below are settings that override the default launch arguments. Some of these require restarting to work.
 * `Listen`: sets the hostname, port, and/or path for the web UI to listen on.
 	- For example, `0.0.0.0:80` will have the web UI accept all connections on port 80
 	- For example, `10.0.0.1:8008/gradio` will have the web UI only accept connections through `10.0.0.1`, at the path `/gradio`
 * `Public Share Gradio`: Tells Gradio to generate a public URL for the web UI. Ignored if specifying a path through the `Listen` setting.
 * `Check for Updates`: checks for updates on page load and notifies in console. Only works if you pulled this repo from a gitea instance.
 * `Only Load Models Locally`: enforces offline mode for loading models. This is the equivalent of setting the env var: `TRANSFORMERS_OFFLINE`
 * `Low VRAM`: disables optimizations in TorToiSe that increases VRAM consumption. Suggested if your GPU has under 6GiB.
 * `Embed Output Metadata`: enables embedding the settings and latents used to generate that audio clip inside that audio clip. Metadata is stored as a JSON string in the `lyrics` tag.
 * `Slimmer Computed Latents`: falls back to the original, 12.9KiB way of storing latents (without the extra bits required for using the CVVP model).
 * `Voice Fixer`: runs each generated audio clip through `voicefixer`, if available and installed.
 * `Voice Latent Max Chunk Size`: during the voice latents calculation pass, this limits how large, in bytes, a chunk can be. Large values can run into VRAM OOM errors.
 * `Sample Batch Size`: sets the batch size when generating autoregressive samples. Bigger batches result in faster compute, at the cost of increased VRAM consumption. Leave to 0 to calculate a "best" fit.
 * `Concurrency Count`: how many Gradio events the queue can process at once. Leave this over 1 if you want to modify settings in the UI that updates other settings while generating audio clips.
 * `Output Sample Rate`: the sample rate to save the generated audio as. It provides a bit of slight bump in quality
 * `Output Volume`: adjusts the volume through amplitude scaling
 ## Example(s)
 Below are some (rather outdated) outputs I deem substantial enough to share. As I continue delving into TorToiSe, I'll supply more examples and the values I use.
 Source (Patrick Bateman): 
 * https://files.catbox.moe/skzumo.zip
 Output (`My name is Patrick Bateman.`, `fast` preset):
 * https://files.catbox.moe/cw88t5.wav
 * https://files.catbox.moe/bwunfo.wav
 * https://files.catbox.moe/ppxprv.wav
 I trimmed up some of the samples to end up with ten short clips of about 10 seconds each. With a 2060, it took a hair over a minute to generate the initial samples, then five to ten seconds for each clip of a total of three. Not too bad for something running on consumer grade shitware.
 Source (Harry Mason):
 * https://files.catbox.moe/n2xor1.mp3
 * https://files.catbox.moe/bbfke3.mp3
 Output (The McDonalds building creepypasta, custom preset of 128 samples, 256 iterations):
 * https://voca.ro/16XSgdlcC5uT
 This took quite a while, over the course of a day half-paying-attention at the command prompt to generate the next piece. I only had to regenerate one section that sounded funny, but compared to 11.AI requiring tons of regenerations for something usable, this is nice to just let run and forget. Initially he sounds rather passable as Harry Mason, but as it goes on it seems to kinda falter. Sound effects and music are added in post and aren't generated by TorToiSe.
 Source (James Sunderland):
 * https://files.catbox.moe/ynoeld.mp3
 * https://files.catbox.moe/lxgbsm.mp3
 Output (The McDonalds building creepypasta, 256 samples, 256 iterations, 0.1 temp, pause size 8, DDIM, conditioning free, seed 1675690127):
 * https://vocaroo.com/1nXmip0oJu8Z
 This took a while to generate while I slept (and even managed to wake up before it finished). Using the batch function, this took 6.919 hours on my 2060 to generate the 27 pieces with zero editing on my end.
 I'm providing this even with its nasty warts to highlight the quirks: the weird gaps where there's a strange sound instead, the random pauses for "thought", etc.
 I think this also highlights how just combining your entire source sample gung-ho isn't a good idea, as he's not as high of a pitch in his delivery compared to how he usually is throughout most of the game (a sort of average between his two ranges). I can't gauge how well it did in reproducing it, since my ears are pretty much burnt out from listening to so many clips, but I believe he's pretty believable as a James Sunderland.
 Output (`Is that really you, Mary?`, Ultra Fast preset, settings and latents embedded)
 * https://files.catbox.moe/gy1jvz.wav
 This was just a quick test for an adjustable setting, but this one turned out really nice (for being a quick test) on the off chance. It's not the original delivery, and it definitely sounds robotic still, but it's on the Ultra Fast preset, as expected.
 ## Caveats (and Upsides)
 To me, I find a few problems with TorToiSe over 11.AI:
 * computation time is quite an issue. Despite Stable Diffusion proving to be adequate on my 2060, TorToiSe takes quite some time with modest settings.
 	- However, on my 6800XT, performance was drastically uplifted due to having more VRAM for larger batch sizes (at the cost of Krashing).
 * reproducability in a voice depends on the "compatibilty" with the model TorToiSe was trained on.
 	- However, this also appears to be similar to 11.AI, where it was mostly trained on audiobook readings.
 * the lack of an obvious analog to the "stability" and "similarity" sliders kind of sucks, but it's not the end of the world.
 	However, the `temperature` option seems to prove to be a proper analog to either of these.
 Although, I can look past these as TorToiSe offers, in comparison to 11.AI:
 * the "speaking too fast" issue does not exist with TorToiSe. I don't need to fight with it by pretending I'm a Gaia user in the early 2000s by sprinkling ellipses.
 * the overall delivery seems very natural, sometimes small, dramatic pauses gets added at the legitimately most convenient moments, and the inhales tend to be more natural. Many of vocaroos from 11.AI where it just does not seem properly delivered.
 * being able to run it locally means I do not have to worry about some Polack seeing me use the "dick" word.
--- a/bin/.gitkeep
+++ b/bin/.gitkeep
--- a/notebook.ipynb
+++ b/notebook.ipynb
@ -0,0 +1,123 @@
 {
   "nbformat":4,
   "nbformat_minor":0,
   "metadata":{
      "colab":{
         "private_outputs":true,
         "provenance":[
         ]
      },
      "kernelspec":{
         "name":"python3",
         "display_name":"Python 3"
      },
      "language_info":{
         "name":"python"
      },
      "accelerator":"GPU",
      "gpuClass":"standard"
   },
   "cells":[
      {
         "cell_type":"markdown",
         "source":[
            "## Initialization"
         ],
         "metadata":{
            "id":"ni41hmE03DL6"
         }
      },
      {
         "cell_type":"code",
         "execution_count":null,
         "metadata":{
            "id":"FtsMKKfH18iM"
         },
         "outputs":[
         ],
         "source":[
            "!git clone https://git.ecker.tech/mrq/ai-voice-cloning/\n",
            "%cd ai-voice-cloning\n",
            "!python -m pip install --upgrade pip\n",
            "!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116\n",
            "!python -m pip install -r ./requirements.txt"
         ]
      },
      {
         "cell_type":"code",
         "source":[
            "# colab requires the runtime to restart before use\n",
            "exit()"
         ],
         "metadata":{
            "id":"FVUOtSASCSJ8"
         },
         "execution_count":null,
         "outputs":[
         ]
      },
      {
         "cell_type":"markdown",
         "source":[
            "## Running"
         ],
         "metadata":{
            "id":"o1gkfw3B3JSk"
         }
      },
      {
         "cell_type":"code",
         "source":[
            "%cd ai-voice-cloning\n",
            "import src.webui as mrq\n",
            "import sys\n",
            "sys.argv = [\"\"]\n",
            "\n",
            "mrq.args = mrq.setup_args()\n",
            "mrq.webui = mrq.setup_gradio()\n",
            "mrq.webui.launch(share=True, prevent_thread_lock=True, height=1000)\n",
            "mrq.tts = mrq.setup_tortoise()\n",
            "mrq.webui.block_thread()"
         ],
         "metadata":{
            "id":"c_EQZLTA19c7"
         },
         "execution_count":null,
         "outputs":[
         ]
      },
      {
         "cell_type":"markdown",
         "source":[
            "## Exporting"
         ],
         "metadata":{
            "id":"2AnVQxEJx47p"
         }
      },
      {
         "cell_type":"code",
         "source":[
            "!apt install -y p7zip-full\n",
            "from datetime import datetime\n",
            "timestamp = datetime.now().strftime('%m-%d-%Y_%H:%M:%S')\n",
            "!mkdir -p \"../{timestamp}\"\n",
            "!mv ./results/* \"../{timestamp}/.\"\n",
            "!7z a -t7z -m0=lzma2 -mx=9 -mfb=64 -md=32m -ms=on \"../{timestamp}.7z\" \"../{timestamp}/\"\n",
            "!ls ~/\n",
            "!echo \"Finished zipping, archive is available at {timestamp}.7z\""
         ],
         "metadata":{
            "id":"YOACiDCXx72G"
         },
         "execution_count":null,
         "outputs":[
         ]
      }
   ]
 }
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,7 @@
 git+https://git.ecker.tech/mrq/tortoise-tts.git
 # git+https://git.ecker.tech/mrq/DL-Art-School.git
 whisper
 gradio
 music-tag
 voicefixer
--- a/results/.gitkeep
+++ b/results/.gitkeep
--- a/setup-cuda.bat
+++ b/setup-cuda.bat
@ -0,0 +1,7 @@
 python -m venv venv
 call .\venv\Scripts\activate.bat
 python -m pip install --upgrade pip
 python -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
 python -m pip install -r ./requirements.txt
 deactivate
 pause
--- a/setup-cuda.sh
+++ b/setup-cuda.sh
@ -0,0 +1,6 @@
 python -m venv venv
 source ./venv/bin/activate
 python -m pip install --upgrade pip
 pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
 python -m pip install -r ./requirements.txt
 deactivate
--- a/setup-directml.bat
+++ b/setup-directml.bat
@ -0,0 +1,7 @@
 python -m venv venv
 call .\venv\Scripts\activate.bat
 python -m pip install --upgrade pip
 python -m pip install torch torchvision torchaudio torch-directml==0.1.13.1.dev230119
 python -m pip install -r ./requirements.txt
 deactivate
 pause
--- a/setup-rocm.sh
+++ b/setup-rocm.sh
@ -0,0 +1,7 @@
 python -m venv venv
 source ./venv/bin/activate
 python -m pip install --upgrade pip
 # ROCM
 pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.1.1 # 5.2 does not work for me desu
 python -m pip install -r ./requirements.txt
 deactivate
--- a/src/list_devices.py
+++ b/src/list_devices.py
@ -0,0 +1,5 @@
 import torch
 devices = [f"cuda:{i} => {torch.cuda.get_device_name(i)}" for i in range(torch.cuda.device_count())]
 print(devices)
--- a/src/main.py
+++ b/src/main.py
@ -0,0 +1,36 @@
 import os
 from utils import *
 from webui import *
 if 'TORTOISE_MODELS_DIR' not in os.environ:
 	os.environ['TORTOISE_MODELS_DIR'] = os.path.realpath(os.path.join(os.getcwd(), './models/tortoise/'))
 if 'TRANSFORMERS_CACHE' not in os.environ:
 	os.environ['TRANSFORMERS_CACHE'] = os.path.realpath(os.path.join(os.getcwd(), './models/transformers/'))
 if __name__ == "__main__":
 	args = setup_args()
 	if args.listen_path is not None and args.listen_path != "/":
 		import uvicorn
 		uvicorn.run("main:app", host=args.listen_host, port=args.listen_port if not None else 8000)
 	else:
 		webui = setup_gradio()
 		tts = setup_tortoise()
 		webui.launch(share=args.share, prevent_thread_lock=True, show_error=True, server_name=args.listen_host, server_port=args.listen_port)
 		webui.block_thread()
 elif __name__ == "main":
 	from fastapi import FastAPI
 	import gradio as gr
 	import sys
 	sys.argv = [sys.argv[0]]
 	app = FastAPI()
 	args = setup_args()
 	webui = setup_gradio()
 	app = gr.mount_gradio_app(app, webui, path=args.listen_path)
 	tts = setup_tortoise()
--- a/src/utils.py
+++ b/src/utils.py
@ -0,0 +1,434 @@
 import os
 if 'XDG_CACHE_HOME' not in os.environ:
 	os.environ['XDG_CACHE_HOME'] = os.path.realpath(os.path.join(os.getcwd(), './models/'))
 if 'TORTOISE_MODELS_DIR' not in os.environ:
 	os.environ['TORTOISE_MODELS_DIR'] = os.path.realpath(os.path.join(os.getcwd(), './models/tortoise/'))
 if 'TRANSFORMERS_CACHE' not in os.environ:
 	os.environ['TRANSFORMERS_CACHE'] = os.path.realpath(os.path.join(os.getcwd(), './models/transformers/'))
 import argparse
 import time
 import json
 import base64
 import re
 import urllib.request
 import torch
 import torchaudio
 import music_tag
 import gradio as gr
 import gradio.utils
 from datetime import datetime
 from tortoise.api import TextToSpeech
 from tortoise.utils.audio import load_audio, load_voice, load_voices, get_voice_dir
 from tortoise.utils.text import split_and_recombine_text
 from tortoise.utils.device import get_device_name, set_device_name
 args = None
 tts = None
 webui = None
 voicefixer = None
 whisper = None
 dlas = None
 def get_args():
 	global args
 	return args
 def setup_args():
 	global args
 	default_arguments = {
 		'share': False,
 		'listen': None,
 		'check-for-updates': False,
 		'models-from-local-only': False,
 		'low-vram': False,
 		'sample-batch-size': None,
 		'embed-output-metadata': True,
 		'latents-lean-and-mean': True,
 		'voice-fixer': True,
 		'voice-fixer-use-cuda': True,
 		'force-cpu-for-conditioning-latents': False,
 		'device-override': None,
 		'concurrency-count': 2,
 		'output-sample-rate': 44100,
 		'output-volume': 1,
 	}
 	if os.path.isfile('./config/exec.json'):
 		with open(f'./config/exec.json', 'r', encoding="utf-8") as f:
 			overrides = json.load(f)
 			for k in overrides:
 				default_arguments[k] = overrides[k]
 	parser = argparse.ArgumentParser()
 	parser.add_argument("--share", action='store_true', default=default_arguments['share'], help="Lets Gradio return a public URL to use anywhere")
 	parser.add_argument("--listen", default=default_arguments['listen'], help="Path for Gradio to listen on")
 	parser.add_argument("--check-for-updates", action='store_true', default=default_arguments['check-for-updates'], help="Checks for update on startup")
 	parser.add_argument("--models-from-local-only", action='store_true', default=default_arguments['models-from-local-only'], help="Only loads models from disk, does not check for updates for models")
 	parser.add_argument("--low-vram", action='store_true', default=default_arguments['low-vram'], help="Disables some optimizations that increases VRAM usage")
 	parser.add_argument("--no-embed-output-metadata", action='store_false', default=not default_arguments['embed-output-metadata'], help="Disables embedding output metadata into resulting WAV files for easily fetching its settings used with the web UI (data is stored in the lyrics metadata tag)")
 	parser.add_argument("--latents-lean-and-mean", action='store_true', default=default_arguments['latents-lean-and-mean'], help="Exports the bare essentials for latents.")
 	parser.add_argument("--voice-fixer", action='store_true', default=default_arguments['voice-fixer'], help="Uses python module 'voicefixer' to improve audio quality, if available.")
 	parser.add_argument("--voice-fixer-use-cuda", action='store_true', default=default_arguments['voice-fixer-use-cuda'], help="Hints to voicefixer to use CUDA, if available.")
 	parser.add_argument("--force-cpu-for-conditioning-latents", default=default_arguments['force-cpu-for-conditioning-latents'], action='store_true', help="Forces computing conditional latents to be done on the CPU (if you constantyl OOM on low chunk counts)")
 	parser.add_argument("--device-override", default=default_arguments['device-override'], help="A device string to override pass through Torch")
 	parser.add_argument("--sample-batch-size", default=default_arguments['sample-batch-size'], type=int, help="Sets how many batches to use during the autoregressive samples pass")
 	parser.add_argument("--concurrency-count", type=int, default=default_arguments['concurrency-count'], help="How many Gradio events to process at once")
 	parser.add_argument("--output-sample-rate", type=int, default=default_arguments['output-sample-rate'], help="Sample rate to resample the output to (from 24KHz)")
 	parser.add_argument("--output-volume", type=float, default=default_arguments['output-volume'], help="Adjusts volume of output")
 	args = parser.parse_args()
 	args.embed_output_metadata = not args.no_embed_output_metadata
 	set_device_name(args.device_override)
 	args.listen_host = None
 	args.listen_port = None
 	args.listen_path = None
 	if args.listen:
 		try:
 			match = re.findall(r"^(?:(.+?):(\d+))?(\/.+?)?$", args.listen)[0]
 			args.listen_host = match[0] if match[0] != "" else "127.0.0.1"
 			args.listen_port = match[1] if match[1] != "" else None
 			args.listen_path = match[2] if match[2] != "" else "/"
 		except Exception as e:
 			pass
 	if args.listen_port is not None:
 		args.listen_port = int(args.listen_port)
 	return args
 def generate(
 	text,
 	delimiter,
 	emotion,
 	prompt,
 	voice,
 	mic_audio,
 	voice_latents_chunks,
 	seed,
 	candidates,
 	num_autoregressive_samples,
 	diffusion_iterations,
 	temperature,
 	diffusion_sampler,
 	breathing_room,
 	cvvp_weight,
 	top_p,
 	diffusion_temperature,
 	length_penalty,
 	repetition_penalty,
 	cond_free_k,
 	experimental_checkboxes,
 	progress=None
 ):
 	global args
 	global tts
 	try:
 		tts
 	except NameError:
 		raise gr.Error("TTS is still initializing...")
 	if voice != "microphone":
 		voices = [voice]
 	else:
 		voices = []
 	if voice == "microphone":
 		if mic_audio is None:
 			raise gr.Error("Please provide audio from mic when choosing `microphone` as a voice input")
 		mic = load_audio(mic_audio, tts.input_sample_rate)
 		voice_samples, conditioning_latents = [mic], None
 	elif voice == "random":
 		voice_samples, conditioning_latents = None, tts.get_random_conditioning_latents()
 	else:
 		progress(0, desc="Loading voice...")
 		voice_samples, conditioning_latents = load_voice(voice)
 	if voice_samples is not None:
 		sample_voice = torch.cat(voice_samples, dim=-1).squeeze().cpu()
 		conditioning_latents = tts.get_conditioning_latents(voice_samples, return_mels=not args.latents_lean_and_mean, progress=progress, slices=voice_latents_chunks, force_cpu=args.force_cpu_for_conditioning_latents)
 		if len(conditioning_latents) == 4:
 			conditioning_latents = (conditioning_latents[0], conditioning_latents[1], conditioning_latents[2], None)
 		if voice != "microphone":
 			torch.save(conditioning_latents, f'{get_voice_dir()}/{voice}/cond_latents.pth')
 		voice_samples = None
 	else:
 		if conditioning_latents is not None:
 			sample_voice, _ = load_voice(voice, load_latents=False)
 			sample_voice = torch.cat(sample_voice, dim=-1).squeeze().cpu()
 		else:
 			sample_voice = None
 	if seed == 0:
 		seed = None
 	if conditioning_latents is not None and len(conditioning_latents) == 2 and cvvp_weight > 0:
 		print("Requesting weighing against CVVP weight, but voice latents are missing some extra data. Please regenerate your voice latents.")
 		cvvp_weight = 0
 	settings = {
 		'temperature': float(temperature),
 		'top_p': float(top_p),
 		'diffusion_temperature': float(diffusion_temperature),
 		'length_penalty': float(length_penalty),
 		'repetition_penalty': float(repetition_penalty),
 		'cond_free_k': float(cond_free_k),
 		'num_autoregressive_samples': num_autoregressive_samples,
 		'sample_batch_size': args.sample_batch_size,
 		'diffusion_iterations': diffusion_iterations,
 		'voice_samples': voice_samples,
 		'conditioning_latents': conditioning_latents,
 		'use_deterministic_seed': seed,
 		'return_deterministic_state': True,
 		'k': candidates,
 		'diffusion_sampler': diffusion_sampler,
 		'breathing_room': breathing_room,
 		'progress': progress,
 		'half_p': "Half Precision" in experimental_checkboxes,
 		'cond_free': "Conditioning-Free" in experimental_checkboxes,
 		'cvvp_amount': cvvp_weight,
 	}
 	if delimiter == "\\n":
 		delimiter = "\n"
 	if delimiter != "" and delimiter in text:
 		texts = text.split(delimiter)
 	else:
 		texts = split_and_recombine_text(text)
 	full_start_time = time.time()
 	outdir = f"./results/{voice}/"
 	os.makedirs(outdir, exist_ok=True)
 	audio_cache = {}
 	resample = None
 	# not a ternary in the event for some reason I want to rely on librosa's upsampling interpolator rather than torchaudio's, for some reason
 	if tts.output_sample_rate != args.output_sample_rate:
 		resampler = torchaudio.transforms.Resample(
 			tts.output_sample_rate,
 			args.output_sample_rate,
 			lowpass_filter_width=16,
 			rolloff=0.85,
 			resampling_method="kaiser_window",
 			beta=8.555504641634386,
 		)
 	volume_adjust = torchaudio.transforms.Vol(gain=args.output_volume, gain_type="amplitude") if args.output_volume != 1 else None
 	idx = 0
 	idx_cache = {}
 	for i, file in enumerate(os.listdir(outdir)):
 		filename = os.path.basename(file)
 		extension = os.path.splitext(filename)[1]
 		if extension != ".json" and extension != ".wav":
 			continue
 		match = re.findall(rf"^{voice}_(\d+)(?:.+?)?{extension}$", filename)
 		key = int(match[0])
 		idx_cache[key] = True
 	if len(idx_cache) > 0:
 		keys = sorted(list(idx_cache.keys()))
 		idx = keys[-1] + 1
 	# I know there's something to pad I don't care
 	pad = ""
 	for i in range(4,0,-1):
 		if idx < 10 ** i:
 			pad = f"{pad}0"
 	idx = f"{pad}{idx}"
 	def get_name(line=0, candidate=0, combined=False):
 		name = f"{idx}"
 		if combined:
 			name = f"{name}_combined"
 		elif len(texts) > 1:
 			name = f"{name}_{line}"
 		if candidates > 1:
 			name = f"{name}_{candidate}"
 		return name
 	for line, cut_text in enumerate(texts):
 		if emotion == "Custom":
 			if prompt.strip() != "":
 				cut_text = f"[{prompt},] {cut_text}"
 		else:
 			cut_text = f"[I am really {emotion.lower()},] {cut_text}"
 		progress.msg_prefix = f'[{str(line+1)}/{str(len(texts))}]'
 		print(f"{progress.msg_prefix} Generating line: {cut_text}")
 		start_time = time.time()
 		gen, additionals = tts.tts(cut_text, **settings )
 		seed = additionals[0]
 		run_time = time.time()-start_time
 		print(f"Generating line took {run_time} seconds")
 		if not isinstance(gen, list):
 			gen = [gen]
 		for j, g in enumerate(gen):
 			audio = g.squeeze(0).cpu()
 			name = get_name(line=line, candidate=j)
 			audio_cache[name] = {
 				'audio': audio,
 				'text': cut_text,
 				'time': run_time
 			}
 			# save here in case some error happens mid-batch
 			torchaudio.save(f'{outdir}/{voice}_{name}.wav', audio, tts.output_sample_rate)
 	for k in audio_cache:
 		audio = audio_cache[k]['audio']
 		if resampler is not None:
 			audio = resampler(audio)
 		if volume_adjust is not None:
 			audio = volume_adjust(audio)
 		audio_cache[k]['audio'] = audio
 		torchaudio.save(f'{outdir}/{voice}_{k}.wav', audio, args.output_sample_rate)
 	output_voices = []
 	for candidate in range(candidates):
 		if len(texts) > 1:
 			audio_clips = []
 			for line in range(len(texts)):
 				name = get_name(line=line, candidate=candidate)
 				audio = audio_cache[name]['audio']
 				audio_clips.append(audio)
 			name = get_name(candidate=candidate, combined=True)
 			audio = torch.cat(audio_clips, dim=-1)
 			torchaudio.save(f'{outdir}/{voice}_{name}.wav', audio, args.output_sample_rate)
 			audio = audio.squeeze(0).cpu()
 			audio_cache[name] = {
 				'audio': audio,
 				'text': text,
 				'time': time.time()-full_start_time,
 				'output': True
 			}
 		else:
 			name = get_name(candidate=candidate)
 			audio_cache[name]['output'] = True
 	info = {
 		'text': text,
 		'delimiter': '\\n' if delimiter == "\n" else delimiter,
 		'emotion': emotion,
 		'prompt': prompt,
 		'voice': voice,
 		'seed': seed,
 		'candidates': candidates,
 		'num_autoregressive_samples': num_autoregressive_samples,
 		'diffusion_iterations': diffusion_iterations,
 		'temperature': temperature,
 		'diffusion_sampler': diffusion_sampler,
 		'breathing_room': breathing_room,
 		'cvvp_weight': cvvp_weight,
 		'top_p': top_p,
 		'diffusion_temperature': diffusion_temperature,
 		'length_penalty': length_penalty,
 		'repetition_penalty': repetition_penalty,
 		'cond_free_k': cond_free_k,
 		'experimentals': experimental_checkboxes,
 		'time': time.time()-full_start_time,
 	}
 	# kludgy yucky codesmells
 	for name in audio_cache:
 		if 'output' not in audio_cache[name]:
 			continue
 		output_voices.append(f'{outdir}/{voice}_{name}.wav')
 		with open(f'{outdir}/{voice}_{name}.json', 'w', encoding="utf-8") as f:
 			f.write(json.dumps(info, indent='\t') )
 	if args.voice_fixer and voicefixer:
 		fixed_output_voices = []
 		for path in progress.tqdm(output_voices, desc="Running voicefix..."):
 			fixed = path.replace(".wav", "_fixed.wav")
 			voicefixer.restore(
 				input=path,
 				output=fixed,
 				cuda=get_device_name() == "cuda" and args.voice_fixer_use_cuda,
 				#mode=mode,
 			)
 			fixed_output_voices.append(fixed)
 		output_voices = fixed_output_voices
 	if voice is not None and conditioning_latents is not None:
 		with open(f'{get_voice_dir()}/{voice}/cond_latents.pth', 'rb') as f:
 			info['latents'] = base64.b64encode(f.read()).decode("ascii")
 	if args.embed_output_metadata:
 		for name in progress.tqdm(audio_cache, desc="Embedding metadata..."):
 			info['text'] = audio_cache[name]['text']
 			info['time'] = audio_cache[name]['time']
 			metadata = music_tag.load_file(f"{outdir}/{voice}_{name}.wav")
 			metadata['lyrics'] = json.dumps(info) 
 			metadata.save()
 	if sample_voice is not None:
 		sample_voice = (tts.input_sample_rate, sample_voice.numpy())
 	print(f"Generation took {info['time']} seconds, saved to '{output_voices[0]}'\n")
 	info['seed'] = settings['use_deterministic_seed']
 	if 'latents' in info:
 		del info['latents']
 	with open(f'./config/generate.json', 'w', encoding="utf-8") as f:
 		f.write(json.dumps(info, indent='\t') )
 	stats = [
 		[ seed, "{:.3f}".format(info['time']) ]
 	]
 	return (
 		sample_voice,
 		output_voices,
 		stats,
 	)
 def setup_tortoise(restart=False):
 	global args
 	global tts
 	global voicefixer
 	if args.voice_fixer and not restart:
 		try:
 			from voicefixer import VoiceFixer
 			print("Initializating voice-fixer")
 			voicefixer = VoiceFixer()
 			print("initialized voice-fixer")
 		except Exception as e:
 			print(f"Error occurred while tring to initialize voicefixer: {e}")
 	print("Initializating TorToiSe...")
 	tts = TextToSpeech(minor_optimizations=not args.low_vram)
 	print("TorToiSe initialized, ready for generation.")
 	return tts
--- a/src/webui.py
+++ b/src/webui.py
@ -0,0 +1,718 @@
 import os
 import argparse
 import time
 import json
 import base64
 import re
 import urllib.request
 import torch
 import torchaudio
 import music_tag
 import gradio as gr
 import gradio.utils
 from datetime import datetime
 import tortoise.api
 from tortoise.utils.audio import get_voice_dir
 from utils import *
 args = setup_args()
 def compute_latents(voice, voice_latents_chunks, progress=gr.Progress(track_tqdm=True)):
 	global tts
 	global args
 	try:
 		tts
 	except NameError:
 		raise gr.Error("TTS is still initializing...")
 	voice_samples, conditioning_latents = load_voice(voice, load_latents=False)
 	if voice_samples is None:
 		return
 	conditioning_latents = tts.get_conditioning_latents(voice_samples, return_mels=not args.latents_lean_and_mean, progress=progress, slices=voice_latents_chunks, force_cpu=args.force_cpu_for_conditioning_latents)
 	if len(conditioning_latents) == 4:
 		conditioning_latents = (conditioning_latents[0], conditioning_latents[1], conditioning_latents[2], None)
 	torch.save(conditioning_latents, f'{get_voice_dir()}/{voice}/cond_latents.pth')
 	return voice
 def update_presets(value):
 	PRESETS = {
 		'Ultra Fast': {'num_autoregressive_samples': 16, 'diffusion_iterations': 30, 'cond_free': False},
 		'Fast': {'num_autoregressive_samples': 96, 'diffusion_iterations': 80},
 		'Standard': {'num_autoregressive_samples': 256, 'diffusion_iterations': 200},
 		'High Quality': {'num_autoregressive_samples': 256, 'diffusion_iterations': 400},
 	}
 	if value in PRESETS:
 		preset = PRESETS[value]
 		return (gr.update(value=preset['num_autoregressive_samples']), gr.update(value=preset['diffusion_iterations']))
 	else:
 		return (gr.update(), gr.update())
 def read_generate_settings(file, read_latents=True, read_json=True):
 	j = None
 	latents = None
 	if file is not None:
 		if hasattr(file, 'name'):
 			file = file.name
 		if file[-4:] == ".wav":
 			metadata = music_tag.load_file(file)
 			if 'lyrics' in metadata:
 				j = json.loads(str(metadata['lyrics']))
 		elif file[-5:] == ".json":
 			with open(file, 'r') as f:
 				j = json.load(f)
 	if j is None:
 		gr.Error("No metadata found in audio file to read")
 	else:
 		if 'latents' in j:
 			if read_latents:
 				latents = base64.b64decode(j['latents'])
 			del j['latents']
 		if "time" in j:
 			j["time"] = "{:.3f}".format(j["time"])
 	return (
 		j,
 		latents,
 	)
 def import_voice(file, saveAs = None):
 	global args
 	j, latents = read_generate_settings(file, read_latents=True)
 	if j is not None and saveAs is None:
 		saveAs = j['voice']
 	if saveAs is None or saveAs == "":
 		raise gr.Error("Specify a voice name")
 	outdir = f'{get_voice_dir()}/{saveAs}/'
 	os.makedirs(outdir, exist_ok=True)
 	if latents:
 		with open(f'{outdir}/cond_latents.pth', 'wb') as f:
 			f.write(latents)
 		latents = f'{outdir}/cond_latents.pth'
 		print(f"Imported latents to {latents}")
 	else:
 		filename = file.name
 		if filename[-4:] != ".wav":
 			raise gr.Error("Please convert to a WAV first")
 		path = f"{outdir}/{os.path.basename(filename)}"
 		waveform, sampling_rate = torchaudio.load(filename)
 		if args.voice_fixer:
 			# resample to best bandwidth since voicefixer will do it anyways through librosa
 			if sampling_rate != 44100:
 				print(f"Resampling imported voice sample: {path}")
 				resampler = torchaudio.transforms.Resample(
 					sampling_rate,
 					44100,
 					lowpass_filter_width=16,
 					rolloff=0.85,
 					resampling_method="kaiser_window",
 					beta=8.555504641634386,
 				)
 				waveform = resampler(waveform)
 				sampling_rate = 44100
 			torchaudio.save(path, waveform, sampling_rate)
 			print(f"Running 'voicefixer' on voice sample: {path}")
 			voicefixer.restore(
 				input = path,
 				output = path,
 				cuda=get_device_name() == "cuda" and args.voice_fixer_use_cuda,
 				#mode=mode,
 			)
 		else:
 			torchaudio.save(path, waveform, sampling_rate)
 		print(f"Imported voice to {path}")
 def import_generate_settings(file="./config/generate.json"):
 	settings, _ = read_generate_settings(file, read_latents=False)
 	if settings is None:
 		return None
 	return (
 		None if 'text' not in settings else settings['text'],
 		None if 'delimiter' not in settings else settings['delimiter'],
 		None if 'emotion' not in settings else settings['emotion'],
 		None if 'prompt' not in settings else settings['prompt'],
 		None if 'voice' not in settings else settings['voice'],
 		None,
 		None,
 		None if 'seed' not in settings else settings['seed'],
 		None if 'candidates' not in settings else settings['candidates'],
 		None if 'num_autoregressive_samples' not in settings else settings['num_autoregressive_samples'],
 		None if 'diffusion_iterations' not in settings else settings['diffusion_iterations'],
 		0.8 if 'temperature' not in settings else settings['temperature'],
 		"DDIM" if 'diffusion_sampler' not in settings else settings['diffusion_sampler'],
 		8   if 'breathing_room' not in settings else settings['breathing_room'],
 		0.0 if 'cvvp_weight' not in settings else settings['cvvp_weight'],
 		0.8 if 'top_p' not in settings else settings['top_p'],
 		1.0 if 'diffusion_temperature' not in settings else settings['diffusion_temperature'],
 		1.0 if 'length_penalty' not in settings else settings['length_penalty'],
 		2.0 if 'repetition_penalty' not in settings else settings['repetition_penalty'],
 		2.0 if 'cond_free_k' not in settings else settings['cond_free_k'],
 		None if 'experimentals' not in settings else settings['experimentals'],
 	)
 def curl(url):
 	try:
 		req = urllib.request.Request(url, headers={'User-Agent': 'Python'})
 		conn = urllib.request.urlopen(req)
 		data = conn.read()
 		data = data.decode()
 		data = json.loads(data)
 		conn.close()
 		return data
 	except Exception as e:
 		print(e)
 		return None
 def check_for_updates():
 	if not os.path.isfile('./.git/FETCH_HEAD'):
 		print("Cannot check for updates: not from a git repo")
 		return False
 	with open(f'./.git/FETCH_HEAD', 'r', encoding="utf-8") as f:
 		head = f.read()
 	match = re.findall(r"^([a-f0-9]+).+?https:\/\/(.+?)\/(.+?)\/(.+?)\n", head)
 	if match is None or len(match) == 0:
 		print("Cannot check for updates: cannot parse FETCH_HEAD")
 		return False
 	match = match[0]
 	local = match[0]
 	host = match[1]
 	owner = match[2]
 	repo = match[3]
 	res = curl(f"https://{host}/api/v1/repos/{owner}/{repo}/branches/") #this only works for gitea instances
 	if res is None or len(res) == 0:
 		print("Cannot check for updates: cannot fetch from remote")
 		return False
 	remote = res[0]["commit"]["id"]
 	if remote != local:
 		print(f"New version found: {local[:8]} => {remote[:8]}")
 		return True
 	return False
 def reload_tts():
 	global tts
 	del tts
 	tts = setup_tortoise(restart=True)
 def cancel_generate():
 	tortoise.api.STOP_SIGNAL = True
 def get_voice_list(dir=get_voice_dir()):
 	os.makedirs(dir, exist_ok=True)
 	return sorted([d for d in os.listdir(dir) if os.path.isdir(os.path.join(dir, d)) and len(os.listdir(os.path.join(dir, d))) > 0 ]) + ["microphone", "random"]
 def update_voices():
 	return (
 		gr.Dropdown.update(choices=get_voice_list()),
 		gr.Dropdown.update(choices=get_voice_list("./results/")),
 	)
 def export_exec_settings( listen, share, check_for_updates, models_from_local_only, low_vram, embed_output_metadata, latents_lean_and_mean, voice_fixer, voice_fixer_use_cuda, force_cpu_for_conditioning_latents, device_override, sample_batch_size, concurrency_count, output_sample_rate, output_volume ):
 	global args
 	args.listen = listen
 	args.share = share
 	args.check_for_updates = check_for_updates
 	args.models_from_local_only = models_from_local_only
 	args.low_vram = low_vram
 	args.force_cpu_for_conditioning_latents = force_cpu_for_conditioning_latents
 	args.device_override = device_override
 	args.sample_batch_size = sample_batch_size
 	args.embed_output_metadata = embed_output_metadata
 	args.latents_lean_and_mean = latents_lean_and_mean
 	args.voice_fixer = voice_fixer
 	args.voice_fixer_use_cuda = voice_fixer_use_cuda
 	args.concurrency_count = concurrency_count
 	args.output_sample_rate = output_sample_rate
 	args.output_volume = output_volume
 	settings = {
 		'listen': None if args.listen else args.listen,
 		'share': args.share,
 		'low-vram':args.low_vram,
 		'check-for-updates':args.check_for_updates,
 		'models-from-local-only':args.models_from_local_only,
 		'force-cpu-for-conditioning-latents': args.force_cpu_for_conditioning_latents,
 		'device-override': args.device_override,
 		'sample-batch-size': args.sample_batch_size,
 		'embed-output-metadata': args.embed_output_metadata,
 		'latents-lean-and-mean': args.latents_lean_and_mean,
 		'voice-fixer': args.voice_fixer,
 		'voice-fixer-use-cuda': args.voice_fixer_use_cuda,
 		'concurrency-count': args.concurrency_count,
 		'output-sample-rate': args.output_sample_rate,
 		'output-volume': args.output_volume,
 	}
 	with open(f'./config/exec.json', 'w', encoding="utf-8") as f:
 		f.write(json.dumps(settings, indent='\t') )
 def setup_gradio():
 	global args
 	global ui
 	if not args.share:
 		def noop(function, return_value=None):
 			def wrapped(*args, **kwargs):
 				return return_value
 			return wrapped
 		gradio.utils.version_check = noop(gradio.utils.version_check)
 		gradio.utils.initiated_analytics = noop(gradio.utils.initiated_analytics)
 		gradio.utils.launch_analytics = noop(gradio.utils.launch_analytics)
 		gradio.utils.integration_analytics = noop(gradio.utils.integration_analytics)
 		gradio.utils.error_analytics = noop(gradio.utils.error_analytics)
 		gradio.utils.log_feature_analytics = noop(gradio.utils.log_feature_analytics)
 		#gradio.utils.get_local_ip_address = noop(gradio.utils.get_local_ip_address, 'localhost')
 	if args.models_from_local_only:
 		os.environ['TRANSFORMERS_OFFLINE']='1'
 	with gr.Blocks() as ui:
 		with gr.Tab("Generate"):
 			with gr.Row():
 				with gr.Column():
 					text = gr.Textbox(lines=4, label="Prompt")
 			with gr.Row():
 				with gr.Column():
 					delimiter = gr.Textbox(lines=1, label="Line Delimiter", placeholder="\\n")
 					emotion = gr.Radio(
 						["Happy", "Sad", "Angry", "Disgusted", "Arrogant", "Custom"],
 						value="Custom",
 						label="Emotion",
 						type="value",
 						interactive=True
 					)
 					prompt = gr.Textbox(lines=1, label="Custom Emotion + Prompt (if selected)")
 					voice = gr.Dropdown(
 						get_voice_list(),
 						label="Voice",
 						type="value",
 					)
 					mic_audio = gr.Audio(
 						label="Microphone Source",
 						source="microphone",
 						type="filepath",
 					)
 					refresh_voices = gr.Button(value="Refresh Voice List")
 					voice_latents_chunks = gr.Slider(label="Voice Chunks", minimum=1, maximum=64, value=1, step=1)
 					recompute_voice_latents = gr.Button(value="(Re)Compute Voice Latents")
 					recompute_voice_latents.click(compute_latents,
 						inputs=[
 							voice,
 							voice_latents_chunks,
 						],
 						outputs=voice,
 					)
 					prompt.change(fn=lambda value: gr.update(value="Custom"),
 						inputs=prompt,
 						outputs=emotion
 					)
 					mic_audio.change(fn=lambda value: gr.update(value="microphone"),
 						inputs=mic_audio,
 						outputs=voice
 					)
 				with gr.Column():
 					candidates = gr.Slider(value=1, minimum=1, maximum=6, step=1, label="Candidates")
 					seed = gr.Number(value=0, precision=0, label="Seed")
 					preset = gr.Radio(
 						["Ultra Fast", "Fast", "Standard", "High Quality"],
 						label="Preset",
 						type="value",
 					)
 					num_autoregressive_samples = gr.Slider(value=128, minimum=0, maximum=512, step=1, label="Samples")
 					diffusion_iterations = gr.Slider(value=128, minimum=0, maximum=512, step=1, label="Iterations")
 					temperature = gr.Slider(value=0.2, minimum=0, maximum=1, step=0.1, label="Temperature")
 					breathing_room = gr.Slider(value=8, minimum=1, maximum=32, step=1, label="Pause Size")
 					diffusion_sampler = gr.Radio(
 						["P", "DDIM"], # + ["K_Euler_A", "DPM++2M"],
 						value="P",
 						label="Diffusion Samplers",
 						type="value",
 					)
 					preset.change(fn=update_presets,
 						inputs=preset,
 						outputs=[
 							num_autoregressive_samples,
 							diffusion_iterations,
 						],
 					)
 					show_experimental_settings = gr.Checkbox(label="Show Experimental Settings")
 					reset_generation_settings_button = gr.Button(value="Reset to Default")
 				with gr.Column(visible=False) as col:
 					experimental_column = col
 					experimental_checkboxes = gr.CheckboxGroup(["Half Precision", "Conditioning-Free"], value=["Conditioning-Free"], label="Experimental Flags")
 					cvvp_weight = gr.Slider(value=0, minimum=0, maximum=1, label="CVVP Weight")
 					top_p = gr.Slider(value=0.8, minimum=0, maximum=1, label="Top P")
 					diffusion_temperature = gr.Slider(value=1.0, minimum=0, maximum=1, label="Diffusion Temperature")
 					length_penalty = gr.Slider(value=1.0, minimum=0, maximum=8, label="Length Penalty")
 					repetition_penalty = gr.Slider(value=2.0, minimum=0, maximum=8, label="Repetition Penalty")
 					cond_free_k = gr.Slider(value=2.0, minimum=0, maximum=4, label="Conditioning-Free K")
 					show_experimental_settings.change(
 						fn=lambda x: gr.update(visible=x),
 						inputs=show_experimental_settings,
 						outputs=experimental_column
 					)
 				with gr.Column():
 					submit = gr.Button(value="Generate")
 					stop = gr.Button(value="Stop")
 					generation_results = gr.Dataframe(label="Results", headers=["Seed", "Time"], visible=False)
 					source_sample = gr.Audio(label="Source Sample", visible=False)
 					output_audio = gr.Audio(label="Output")
 					candidates_list = gr.Dropdown(label="Candidates", type="value", visible=False)
 					output_pick = gr.Button(value="Select Candidate", visible=False)
 		with gr.Tab("History"):
 			with gr.Row():
 				with gr.Column():
 					headers = {
 						"Name": "",
 						"Samples": "num_autoregressive_samples",
 						"Iterations": "diffusion_iterations",
 						"Temp.": "temperature",
 						"Sampler": "diffusion_sampler",
 						"CVVP": "cvvp_weight",
 						"Top P": "top_p",
 						"Diff. Temp.": "diffusion_temperature",
 						"Len Pen": "length_penalty",
 						"Rep Pen": "repetition_penalty",
 						"Cond-Free K": "cond_free_k",
 						"Time": "time",
 					}
 					history_info = gr.Dataframe(label="Results", headers=list(headers.keys()))
 			with gr.Row():
 				with gr.Column():
 					history_voices = gr.Dropdown(
 						get_voice_list("./results/"),
 						label="Voice",
 						type="value",
 					)
 					history_view_results_button = gr.Button(value="View Files")
 				with gr.Column():
 					history_results_list = gr.Dropdown(label="Results",type="value", interactive=True)
 					history_view_result_button = gr.Button(value="View File")
 				with gr.Column():
 					history_audio = gr.Audio()
 					history_copy_settings_button = gr.Button(value="Copy Settings")
 				def history_view_results( voice ):
 					results = []
 					files = []
 					outdir = f"./results/{voice}/"
 					for i, file in enumerate(sorted(os.listdir(outdir))):
 						if file[-4:] != ".wav":
 							continue
 						metadata, _ = read_generate_settings(f"{outdir}/{file}", read_latents=False)
 						if metadata is None:
 							continue
 						values = []
 						for k in headers:
 							v = file
 							if k != "Name":
 								v = metadata[headers[k]]
 							values.append(v)
 						files.append(file)
 						results.append(values)
 					return (
 						results,
 						gr.Dropdown.update(choices=sorted(files))
 					)
 				history_view_results_button.click(
 					fn=history_view_results,
 					inputs=history_voices,
 					outputs=[
 						history_info,
 						history_results_list,
 					]
 				)
 				history_view_result_button.click(
 					fn=lambda voice, file: f"./results/{voice}/{file}",
 					inputs=[
 						history_voices,
 						history_results_list,
 					],
 					outputs=history_audio
 				)
 		with gr.Tab("Utilities"):
 			with gr.Row():
 				with gr.Column():
 					audio_in = gr.File(type="file", label="Audio Input", file_types=["audio"])
 					copy_button = gr.Button(value="Copy Settings")
 					import_voice_name = gr.Textbox(label="Voice Name")
 					import_voice_button = gr.Button(value="Import Voice")
 				with gr.Column():
 					metadata_out = gr.JSON(label="Audio Metadata")
 					latents_out = gr.File(type="binary", label="Voice Latents")
 					def read_generate_settings_proxy(file, saveAs='.temp'):
 						j, latents = read_generate_settings(file)
 						if latents:
 							outdir = f'{get_voice_dir()}/{saveAs}/'
 							os.makedirs(outdir, exist_ok=True)
 							with open(f'{outdir}/cond_latents.pth', 'wb') as f:
 								f.write(latents)
 							latents = f'{outdir}/cond_latents.pth'
 						return (
 							j,
 							gr.update(value=latents, visible=latents is not None),
 							None if j is None else j['voice']
 						)
 					audio_in.upload(
 						fn=read_generate_settings_proxy,
 						inputs=audio_in,
 						outputs=[
 							metadata_out,
 							latents_out,
 							import_voice_name
 						]
 					)
 				import_voice_button.click(
 					fn=import_voice,
 					inputs=[
 						audio_in,
 						import_voice_name,
 					]
 				)
 		with gr.Tab("Settings"):
 			with gr.Row():
 				exec_inputs = []
 				with gr.Column():
 					exec_inputs = exec_inputs + [
 						gr.Textbox(label="Listen", value=args.listen, placeholder="127.0.0.1:7860/"),
 						gr.Checkbox(label="Public Share Gradio", value=args.share),
 						gr.Checkbox(label="Check For Updates", value=args.check_for_updates),
 						gr.Checkbox(label="Only Load Models Locally", value=args.models_from_local_only),
 						gr.Checkbox(label="Low VRAM", value=args.low_vram),
 						gr.Checkbox(label="Embed Output Metadata", value=args.embed_output_metadata),
 						gr.Checkbox(label="Slimmer Computed Latents", value=args.latents_lean_and_mean),
 						gr.Checkbox(label="Voice Fixer", value=args.voice_fixer),
 						gr.Checkbox(label="Use CUDA for Voice Fixer", value=args.voice_fixer_use_cuda),
 						gr.Checkbox(label="Force CPU for Conditioning Latents", value=args.force_cpu_for_conditioning_latents),
 						gr.Textbox(label="Device Override", value=args.device_override),
 					]
 					gr.Button(value="Check for Updates").click(check_for_updates)
 					gr.Button(value="Reload TTS").click(reload_tts)
 				with gr.Column():
 					exec_inputs = exec_inputs + [
 						gr.Number(label="Sample Batch Size", precision=0, value=args.sample_batch_size),
 						gr.Number(label="Concurrency Count", precision=0, value=args.concurrency_count),
 						gr.Number(label="Ouptut Sample Rate", precision=0, value=args.output_sample_rate),
 						gr.Slider(label="Ouptut Volume", minimum=0, maximum=2, value=args.output_volume),
 					]
 				for i in exec_inputs:
 					i.change(
 						fn=export_exec_settings,
 						inputs=exec_inputs
 					)
 		input_settings = [
 			text,
 			delimiter,
 			emotion,
 			prompt,
 			voice,
 			mic_audio,
 			voice_latents_chunks,
 			seed,
 			candidates,
 			num_autoregressive_samples,
 			diffusion_iterations,
 			temperature,
 			diffusion_sampler,
 			breathing_room,
 			cvvp_weight,
 			top_p,
 			diffusion_temperature,
 			length_penalty,
 			repetition_penalty,
 			cond_free_k,
 			experimental_checkboxes,
 		]
 		# YUCK
 		def run_generation(
 			text,
 			delimiter,
 			emotion,
 			prompt,
 			voice,
 			mic_audio,
 			voice_latents_chunks,
 			seed,
 			candidates,
 			num_autoregressive_samples,
 			diffusion_iterations,
 			temperature,
 			diffusion_sampler,
 			breathing_room,
 			cvvp_weight,
 			top_p,
 			diffusion_temperature,
 			length_penalty,
 			repetition_penalty,
 			cond_free_k,
 			experimental_checkboxes,
 			progress=gr.Progress(track_tqdm=True)
 		):
 			try:
 				sample, outputs, stats = generate(
 					text,
 					delimiter,
 					emotion,
 					prompt,
 					voice,
 					mic_audio,
 					voice_latents_chunks,
 					seed,
 					candidates,
 					num_autoregressive_samples,
 					diffusion_iterations,
 					temperature,
 					diffusion_sampler,
 					breathing_room,
 					cvvp_weight,
 					top_p,
 					diffusion_temperature,
 					length_penalty,
 					repetition_penalty,
 					cond_free_k,
 					experimental_checkboxes,
 					progress
 				)
 			except Exception as e:
 				message = str(e)
 				if message == "Kill signal detected":
 					reload_tts()
 				raise gr.Error(message)
 			return (
 				outputs[0],
 				gr.update(value=sample, visible=sample is not None),
 				gr.update(choices=outputs, value=outputs[0], visible=len(outputs) > 1, interactive=True),
 				gr.update(visible=len(outputs) > 1),
 				gr.update(value=stats, visible=True),
 			)
 		refresh_voices.click(update_voices,
 			inputs=None,
 			outputs=[
 				voice,
 				history_voices
 			]
 		)
 		output_pick.click(
 			lambda x: x,
 			inputs=candidates_list,
 			outputs=output_audio,
 		)
 		submit.click(
 			lambda: (gr.update(visible=False), gr.update(visible=False), gr.update(visible=False), gr.update(visible=False)),
 			outputs=[source_sample, candidates_list, output_pick, generation_results],
 		)
 		submit_event = submit.click(run_generation,
 			inputs=input_settings,
 			outputs=[output_audio, source_sample, candidates_list, output_pick, generation_results],
 		)
 		copy_button.click(import_generate_settings,
 			inputs=audio_in, # JSON elements cannot be used as inputs
 			outputs=input_settings
 		)
 		def reset_generation_settings():
 			with open(f'./config/generate.json', 'w', encoding="utf-8") as f:
 				f.write(json.dumps({}, indent='\t') )
 			return import_generate_settings()
 		reset_generation_settings_button.click(
 			fn=reset_generation_settings,
 			inputs=None,
 			outputs=input_settings
 		)
 		def history_copy_settings( voice, file ):
 			settings = import_generate_settings( f"./results/{voice}/{file}" )
 			return settings
 		history_copy_settings_button.click(history_copy_settings,
 			inputs=[
 				history_voices,
 				history_results_list,
 			],
 			outputs=input_settings
 		)
 		if os.path.isfile('./config/generate.json'):
 			ui.load(import_generate_settings, inputs=None, outputs=input_settings)
 		if args.check_for_updates:
 			ui.load(check_for_updates)
 		stop.click(fn=cancel_generate, inputs=None, outputs=None, cancels=[submit_event])
 	ui.queue(concurrency_count=args.concurrency_count)
 	webui = ui
 	return webui
--- a/start.bat
+++ b/start.bat
@ -0,0 +1,4 @@
 call .\venv\Scripts\activate.bat
 python .\src\main.py
 deactivate
 pause
--- a/start.sh
+++ b/start.sh
@ -0,0 +1,3 @@
 source ./venv/bin/activate
 python3 ./src/main.py
 deactivate
--- a/update-force.bat
+++ b/update-force.bat
@ -0,0 +1,3 @@
 git fetch --all
 git reset --hard origin/main
 call .\update.bat
--- a/update-force.sh
+++ b/update-force.sh
@ -0,0 +1,3 @@
 git fetch --all
 git reset --hard origin/main
 ./update.sh
--- a/update.bat
+++ b/update.bat
@ -0,0 +1,7 @@
 git pull
 python -m venv venv
 call .\venv\Scripts\activate.bat
 python -m pip install --upgrade pip
 python -m pip install -r ./requirements.txt
 deactivate
 pause
--- a/update.sh
+++ b/update.sh
@ -0,0 +1,6 @@
 git pull
 python -m venv venv
 source ./venv/bin/activate
 python -m pip install --upgrade pip
 python -m pip install -r ./requirements.txt
 deactivate
--- a/voices/.gitkeep
+++ b/voices/.gitkeep