Wiki squash

master
mrq 2023-02-17 19:08:06 +07:00 committed by mrq
commit da17c022ad
11 changed files with 551 additions and 0 deletions

@ -0,0 +1,31 @@
## Sourcing
Now that the easy part of setting up is done, it's time for the hard part: sourcing your voice samples.
As this fork was worked on entirely in the scope of videogame characters, the sources below pertain mostly to that. There might be some overlap, given some of these are sourced from other /v/irgins, so you might as well take a peak in case your not-videogames voice is in there:
* [Sounds Resource](https://www.sounds-resource.com/) has a plethora of sound content, including voice lines.
* [Silent Hill Media](http://silenthillmedia.net/home.htm) has voice lines for Silent Hill characters, albeit given the nature, some of them might be mixed together (for example, exchanges between Maria and James).
* [the AIVoiceStuff rentry from /v/](https://rentry.org/AIVoiceStuff) during the prime of AI voice cloning threads there has a good repository of sources, not just for vidya. However, since that rentry is scoped around 11.AI, the provided output examples are from 11.AI. I suppose you can use those to compare against TorToiSe's quality. Given cloning has been removed from free tier (and the threads pretty much died down), that rentry isn't getting updated.
* [a mega.nz collection from /v/](https://mega.nz/folder/AHtCyYRa#WoWv9ug6vg27XfXOjfga-Q), also during the prime of AI voice cloning threads, also have a decent source of samples, not just for vidya.
**!**NOTE**!**: I'm merely providing easy directions to acquire sources.
## Preparing
Now that the tough part is dealt with, it's time to prepare voice clips to use.
Unlike training embeddings for AI image generations, preparing a "dataset" for voice cloning is very simple.
As a general rule of thumb, try to source clips that aren't noisy, solely the subject you are trying to clone, and doesn't contain any non-words (like yells, guttural noises, etc.). If you must, run your source through a background music/noise remover (how to is an exercise left to the reader). It isn't entirely a detriment if you're unable to provide clean audio, however. Just be wary that you might have some headaches with getting acceptable output.
Nine times out of ten, you should be fine using as many clips as possible. There's (now) no preference between combining your audio into one file, or leaving it split. However, if you're aiming for a specific delivery, it *should* be best for you to narrow down to just using that as your provided source (for example, changing one word in a line).
There's no hard specifics on how many, or how long, your sources should be.
If you're looking to trim your clips, in my opinion, ~~Audacity~~ Tenacity works good enough. Power users with FFMPEG already installed can simply used the provided conversion script in `.\tortoise\convert\`.
After preparing your clips as WAVs, you can now add in your new voice source:
* open up the `voices` folder
* create a new folder in whatever name you want
* dump your clips into that folder

@ -0,0 +1,59 @@
## Generate
### Quick Usage
If you're easily overwhelmed with the below section, fear not, as it's very simple. Just:
* input your text
* select your voice
* increase the `Voice Chunks` slider as needed
* click `Ultra Fast`
* click `Generate`
If you're not too happy with the outputted result, you're free to play around with the below settings to dial in what you want.
### Detailed Usage
You'll be presented with a bunch of options in the default `Generate` tab, but do not be overwhelmed, as most of the defaults are sane, but below are a rough explanation on which input does what:
* `Prompt`: text you want to be read. You wrap text in `[brackets]` for "prompt engineering", where it'll affect the output, but those words won't actually be read.
* `Line Delimiter`: String to split the prompt into pieces. The stitched clip will be stored as `combined.wav`. To split by a new line, enter `\n`.
* `Emotion`: the "emotion" used for the delivery. This is a shortcut to utilizing "prompt engineering" by starting with `[I am really <emotion>,]` in your prompt. This is merely a suggestion, not a guarantee.
* `Custom Emotion + Prompt`: a non-preset "emotion" used for the delivery. This is a shortcut to utilizing "prompt engineering" by starting with `[<emotion>]` in your prompt.
* `Voice`: the voice you want to clone. You can select `microphone` if you want to use input from your microphone. You can also use `random` for it to use a randomly generated voice.
* `Microphone Source`: Use your own voice from a line-in source.
* `Voice Chunks`: a slider to determine how many pieces to process your voice dataset when computing conditional latents. The lower the number, the bigger the pieces and the more VRAM is needed. The smaller the pieces, the less VRAM is needed, but the more likely you will slice mid-phoneme. Playing around with this will most definitely affect the output of your cloning, as some datasets will work better with different values.
* `Refresh Voice List`: updates the voice list
* `(Re)Compute Voice Latents`: (re)computes the conditional latents for a given voice.
Below are generation settings, which affect the technical aspects of how your inputs are processed:
* `Candidates`: number of outputs to generate, starting from the best candidate. Depending on your iteration steps, generating the final sound files could be cheap, but they only offer alternatives to the samples generated to pull from (in other words, the later candidates perform worse), so don't be compelled to generate a ton of candidates.
* `Seed`: initializes the PRNG to this value. Use this if you want to reproduce a generated voice.
* `Preset`: shortcut values for sample count and iteration steps. Clicking a preset will update its corresponding values. Higher presets result in better quality at the cost of computation time.
* `Samples`: analogous to samples in image generation. More samples = better resemblance / clone quality, at the cost of performance. This strictly affects clone quality.
* `Iterations`: influences audio sound quality in the final output. More iterations = higher quality sound. This step is relatively cheap, so do not be discouraged from increasing this. This strictly affects quality in the actual sound.
* `Temperature`: how much randomness to introduce to the generated samples. Lower values = better resemblance to the source samples, but some temperature is still required for great output.
- **!**NOTE**!**: This value is very inconsistent and entirely depends on the input voice. In other words, some voices will be receptive to playing with this value, while others won't make much of a difference.
- **!**NOTE**!**: some voices will be very receptive to this, where it speaks slowly at low temperatures, but nudging it a hair and it speaks too fast.
* `Pause Size`: Governs how large pauses are at the end of a clip (in token size, not seconds). Increase this if your output gets cut off at the end.
- **!**NOTE**!**: sometimes this is merely a suggestion and not a guarantee. Some generations will be sensitive to this, while others will not.
- **!**NOTE**!**: too large of a pause size can lead to unexpected behavior.
* `Diffusion Sampler`: sampler method during the diffusion pass. Currently, only `P` and `DDIM` are added, but does not seem to offer any substantial differences in my short tests.
`P` refers to the default, vanilla sampling method in `diffusion.py`.
To reiterate, this ***only*** is useful for the diffusion decoding path, after the autoregressive outputs are generated.
Below are an explanation of experimental flags. Messing with these might impact performance, as these are exposed only if you know what you are doing.
* `Half-Precision`: (attempts to) hint to PyTorch to auto-cast to float16 (half precision) for compute. Disabled by default, due to it making computations slower.
* `Conditional Free`: a quality boosting improvement at the cost of some performance. Enabled by default, as I think the penaly is negligible in the end.
* `CVVP Weight`: governs how much weight the CVVP model should influence candidates. The original documentation mentions this is deprecated as it does not really influence things, but you're still free to play around with it.
Currently, setting requires regenerating your voice latents, as I forgot to have it return some extra data that weighing against the CVVP model uses. Oops.
Setting this to 1 leads to bad behavior.
* `Top P`: P value used in nucleus sampling; lower values mean the decoder produces more "likely" (aka boring) outputs.
* `Diffusion Temperature`: the variance of the noise fed into the diffusion model; values at 0 are the "mean" prediction of the diffusion network and will sound bland and smeared.
* `Length Penalty`: a length penalty applied to the autoregressive decoder; higher settings causes the model to produce more terse outputs.
* `Repetition Penalty`: a penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or "uhhhhhhs", etc.
* `Conditioning-Free K`: determintes balancing the conditioning free signal with the conditioning-present signal.
After you fill everything out, click `Run`, and wait for your output in the output window. The sampled voice is also returned, but if you're using multiple files, it'll return the first file, rather than a combined file.
All outputs are saved under `./result/[voice name]/`. On some browsers, you're able to directly download the file with the three-dot menu in the HTML5 audio element.
To save you from headaches, I strongly recommend playing around with shorter sentences first to find the right values for the voice you're using before generating longer sentences.

@ -0,0 +1,9 @@
## History
In this tab, a rudimentary way of viewing past results can be found here.
With it, you just select a voice, then you can quickly view their generation settings.
To play a file, select a specific file with the second dropdown list.
To reuse a voice file's settings, click `Copy Settings`.

@ -0,0 +1,55 @@
# AI Voice Cloning
This [repo](https://git.ecker.tech/mrq/ai-voice-cloning)/[rentry](https://rentry.org/AI-Voice-Cloning/) aims to serve as both a foolproof guide for setting up AI voice cloning tools for legitimate, local use on Windows/Linux, as well as a stepping stone for anons that genuinely want to play around with [TorToiSe](https://github.com/neonbjb/tortoise-tts).
Similar to my own findings for Stable Diffusion image generation, this rentry may appear a little disheveled as I note my new findings with TorToiSe. Please keep this in mind if the guide seems to shift a bit or sound confusing.
>\>Ugh... why bother when I can just abuse 11.AI?
You're more than welcome to, but TorToiSe is shaping up to be a very promising tool, especially with finetuning being feasible.
This is not endorsed by [neonbjb](https://github.com/neonbjb/). I do not expect this to run into any ethical issues, as it seems (like me), this is mostly for making funny haha vidya characters say funny lines.
## Glossary
To try and keep the terminology used here (somewhat) consistent and coherent, below are a list of terms, and their definitions (or at least, the way I'm using them):
* `voice cloning`: synthesizing speech to accurately replicate a subject's voice.
* `input clips` / `voice clips` / `audio input` / `voice samples` : the original voice source of the subject you're trying to clone.
* `waveform`: the raw audio.
* `sampling rate`: the bandwidth of a given waveform, represented as twice the frequency of the waveform it represents.
* `voice latents` / `conditional latents` / `latents`: computated traits of a voice.
* `autoregressive samples` (`samples` / `tokens`): the initial generation pass to output tokens, and (usually) the most computationally expensive. More samples = better "cloning".
* `CLVP`: Contrastive Language-Voice Pretraining: an analog to CLIP, but for voices. After the autoregressive samples pass, those samples/tokens are compared against the CLVP to find the best candidates.
* `CVVP`: Contrastive Voice-Voice Pretraining: a (deprecated) model that can be used weighted in junction with the CLVP.
* `candidates`: results from the comparing against the CLVP/CVVP models. (Assumed to be) ordered from best to worst.
* `diffusion decoder` / `vocoder`: these passes are responsible for encoding the tokens into a MEL spectrogram into a waveform.
* `diffusion iterations`: how many passes to put into generating the output waveform. More iterations = better audio quality.
* `diffusion sampler` / `sampler`: the sampling method used during the diffusion decoding pass, albeit a bit of a misnomer. Currently, only two samplers are implemented.
* `training` / `finetuning`: re-training the model to learn better traits for a given use-case, be it voice traits, accents, or even languages.
* `dataset`: an LJSpeech formatted text file with transcriptions assigned to voice files
* `learning rate`: the rate of change when training a model. Lower values are safer, but require much time. Higher values get change faster, at the risk of frying a model.
* `epoch`: a unit related to training, equal to the number of iterations to complete one entire pass over your dataset.
* `iteration` / `step`: a step to complete one batch size worth of training
* `learning rate schedule`: a list of epoch/iteration points to decay the learning rate
* `half precision`: trains at a lower precision to (theoretically) reduce VRAM consumption and increase throughput. Entirely useless when using...
* `bitsandbytes optimizations`: trains at a lower precision (integer8) by leveraging dedicated silicon and other optimizations to achieve large reductions in VRAM consumption. Arguably less stable, due to the nature of quantizing, but the downside is negligible compared to what it offers.
* `loss rate`: a value to measure the deviation from a model's generated output from the source output. Lower values = better, but too low of a value results in overfitment, and effectively botches your model, as extrapolation is tougher. I'd argue a value around `0.1` is fine, as it works for Textual Inversion, but I'm having second doubts. Just play around with different iterations of your models.
## Modifications
My fork boasts the following additions, fixes, and optimizations:
* a competent web UI made in Gradio to expose a lot of tunables and options
* cleaned up output structure of resulting audio files
* caching computed conditional latents for faster re-runs
- additionally, regenerating them if the script detects they're out of date
* uses the entire audio sample instead of the first four seconds of each sound file for better reproducing
* activated unused DDIM sampler
* use of some optimizations like `kv_cache`ing for the autoregression sample pass, and keeping data on GPU
* compatibility with DirectML
* easy install scripts
* integrated training
* very simple training configuration tool
* LJSpeech-formatted dataset creation
* leverages bitsandbytes for training/finetuning, and (albeit unnoticeable) for inferencing
* and more!

@ -0,0 +1,101 @@
## Colab Notebook
A colab-ready notebook to quickly set up and use this repo is included and available [here](https://git.ecker.tech/mrq/ai-voice-cloning/raw/branch/master/notebook.ipynb): https://git.ecker.tech/mrq/ai-voice-cloning/raw/branch/master/notebook.ipynb
Simply go [here](https://colab.research.google.com/) and upload the file.
For the unfortunate using Paperspace, this notebook should also work for it.
## Installing
Outside of the very small prerequisites, everything needed to get TorToiSe working is included in the repo.
### Pre-Requirements
Windows:
* Python: https://www.python.org/downloads/windows/
- Tested on python3.9: https://www.python.org/downloads/release/python-3913/
- Briefly tested on python3.10
* Git: https://git-scm.com/download/win
* CUDA drivers, if NVIDIA
* FFMPEG: https://ffmpeg.org/download.html#build-windows
- only needed when preparing datasets for training/finetuning
Linux:
* python3.x (tested with 3.10)
* git
* ROCm for AMD, CUDA for NVIDIA
* FFMPEG:
- only needed when preparing datasets for training/finetuning
#### CUDA Version
For NVIDIA cards, the setup script assumes your card support CUDA 11.7. If your GPU does not, simply edit the setup script to the right CUDA version. For example: `pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113` instead.
### Setup
#### Windows
Download Python and Git and run their installers.
After installing Python, open the Start Menu and search for `Command Prompt`. Type `cd `, then drag and drop the folder you want to work in (experienced users can just `cd <path>` directly), then hit Enter.
Paste `git clone https://git.ecker.tech/mrq/ai-voice-cloning` to download TorToiSe and additional scripts, then hit Enter.
Afterwards, run the setup script, depending on your GPU, to automatically set things up.
* AMD: `setup-directml.bat`
* NVIDIA: `setup-cuda.bat`
If you've done everything right, you shouldn't have any errors.
##### Note on DirectML Support
PyTorch-DirectML is very, very experimental and is still not production quality. There's some headaches with the need for hairy kludgy patches.
These patches rely on transfering the tensor between the GPU and CPU as a hotfix, so performance is definitely harmed.
Both the conditional latent computation and the vocoder pass have to be done on the CPU entirely because of some quirks with DirectML.
On my 6800XT, VRAM usage climbs almost the entire 16GiB, so be wary if you OOM somehow. Low VRAM flags may NOT have any additional impact from the constant copying anyways.
For AMD users, I still might suggest using Linux+ROCm as it's (relatively) headache free, but I had stability problems.
Training is currently very, very improbably, due to how integrated it seems to be with CUDA. If you're fiending to train on your AMD card, please use Linux+ROCm, but I have not tested this myself.
#### Linux
First, make sure you have both `python3.x` and `git` installed, as well as the required compute platform according to your GPU (ROCm or CUDA).
Simply run the following block:
```
git clone https://git.ecker.tech/mrq/ai-voice-cloning
cd tortoise-tts
chmod +x *.sh
```
Then, depending on your GPU:
* AMD: `./setup-rocm.sh`
* NVIDIA: `./setup-cuda.sh`
And you should be done!
#### Note for AMD users
Due to the nature of ROCm, some little problems may occur.
Additionally, training on AMD cards cannot leverage BitsAndBytes optimizations, as those are tied to CUDA runtimes.
### Updating
To check for updates, simply run `update.bat` (or `update.sh`). It should pull from the repo, as well as fetch for any new dependencies.
### Migrating from [mrq/tortoise-tts](https://git.ecker.tech/mrq/tortoise-tts)
If you're migrating from [mrq/tortoise-tts](https://git.ecker.tech/mrq/tortoise-tts), you can simply clone this repo, then move the following folders:
* `./config/`
* `./results/`
* `./voices/`
* `./models/`
then run the setup script.

@ -0,0 +1,106 @@
## Pitfalls You May Encounter
I'll try and make a list of "common" (or what I feel may be common that I experience) issues with getting TorToiSe set up:
### `No hardware acceleration is available, falling back to CPU...`
Be sure you have used the right script for setup:
* NVIDIA: `setup-cuda.bat` / `setup-cuda.sh`
* AMD: `setup-directml.bat` / `setup-rocm.sh`
On NVIDIA, it seems some users require additional drivers for CUDA capabilities to be exposed. I'm not too sure about this myself, as I have the bare minimum drivers for my 2060, and might have gotten some CUDA runtimes through Nsight.
On Linux + AMD, you also need to ensure you have the ROCm-capable drivers/runtime installed. Please consult your distro's literature on how to install ROCm-capable drivers/runtime.
On Windows + AMD, I'm not too sure how this would be thrown, as DirectML does some DX12 wizardry for compute.
### `failed reading zip archive: failed finding central directory`
You had a file fail to download completely during the model downloading initialization phase.
Please open either `.\models\tortoise\` or `.\models\transformers\`, and delete the offending file.
You can deduce what that file is by reading the stack trace. A few lines above the last like will be a line trying to read a model path.
### Voicefixer is taking forever to download
Lately, it seems it just takes way too long to download Voicefixer's models. Just be patient.
### `torch.cuda.OutOfMemoryError: CUDA out of memory.`
#### Generation
You most likely have a GPU with low VRAM (~4GiB), and the small optimizations with keeping data on the GPU is enough to OOM. Please check the `Low VRAM` option under the `Settings` tab.
If you do have a beefy GPU:
* if you have very large voice input files, increase the `Voice Chunk` slider, as the scripts will try and compute a voice's latents in pieces, rather than in one chunk.
* if you're trying to generate a long sentence, please break your sentences into pieces, and set the `Line Delimiter` to `\n`.
* if you're simply trying to generate something small, please reduce your `Sample Batch Size` under the `Settings` tab.
* if you're getting this during a `voicefixer` pass, while using CUDA for it is enabled, please try disabling CUDA for Voice Fixer under the `Settings` tab, as it has its own model it loads into VRAM.
* if you're trying to create an LJSpeech dataset under `Train` > `Prepare Dataset`, please use a smaller Whisper model size under `Settings`.
#### Training
On Pascal-and-before cards, training is pretty much an impossible feat, as consumer cards lack the VRAM necessary to train, or the dedicated silicon to leverage optimizations like BitsAndBytes.
If you have a Turing (or beyond) card, you may have too large of a batch size, or a mega batch factor. Please try and reduce it before trying again, and ensure TorToiSe is NOT loaded by using the `Do Not Load TTS On Startup` option and restarting the web UI.
If you're in dire need to train, please try to train on a Colab notebook.
### `WavFileWarning: Chunk (non-data) not understood, skipping it.`
This is a rather innocuous error. I don't think generation quality is impacted at all, but if you insist on making it go away, remux your WAVs with something like `ffmpeg`.
### `AttributeError: module 'ffmpeg' has no attribute 'input'`
The python package `ffmpeg-python` is very meticulous when installing it through openai/whisper. In a command prompt, with the current working directory set to the repo, run:
* Windows:
```
call .\venv\Scripts\activate.bat
python -m pip uninstall ffmpeg ffmpeg-python
python -m pip install ffmpeg-python
```
* Linux:
```
source ./venv/bin/activate
pip uninstall ffmpeg ffmpeg-python
pip install torch ffmpeg-python
```
### `AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'`
You more than likely need to reinstall PyTorch. In a command prompt, with the current working directory set to the repo, run:
* Windows:
```
call .\venv\Scripts\activate.bat
python -m pip uninstall torch torchvision torchaudio
python -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
```
* Linux:
```
source ./venv/bin/activate
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
```
### `[WinError 5] Access is denied:`
This occurs when you:
* finetune from zero
* stop a training process
* start finetuning again from zero
* DLAS will attempt to backup and move the old folder, but because of a zombie Python process still having "ownership" of the folder, the folder cannot be removed
Open a command prompt and type `tskill python` to kill all Python processes. Relaunch the web UI, and try to train again.
## Reporting Other Errors
I do not have all possible errors documented, so if you encounter one that you can't resolve, please open an Issue with adequate information, including:
* python version
* GPU
* stack trace (or full console output preferred), wrapped in \`\`\`\[trace\]\`\`\`
* summary of what you were doing
and I'll try my best to remedy it, even if it's something small like not reading the documentation.
***Please, please, please*** provide either a full stack trace of the error (if running the web UI) or the command prompt output (if running a script). I will not know what's wrong if you only provide the error message itself, as errors are heavily predicated on the full state it happened. Without it, I cannot help you, as I would only be able to make assumptions.

@ -0,0 +1,29 @@
## Settings
This tab (should) hold a bunch of other settings, from tunables that shouldn't be tampered with, to settings pertaining to the web UI itself.
Below are settings that override the default launch arguments. Some of these require restarting to work.
* `Listen`: sets the hostname, port, and/or path for the web UI to listen on.
- For example, `0.0.0.0:80` will have the web UI accept all connections on port 80
- For example, `10.0.0.1:8008/gradio` will have the web UI only accept connections through `10.0.0.1`, at the path `/gradio`
* `Public Share Gradio`: Tells Gradio to generate a public URL for the web UI. Ignored if specifying a path through the `Listen` setting.
* `Check for Updates`: checks for updates on page load and notifies in console. Only works if you pulled this repo from a gitea instance.
* `Only Load Models Locally`: enforces offline mode for loading models. This is the equivalent of setting the env var: `TRANSFORMERS_OFFLINE`
* `Low VRAM`: disables optimizations in TorToiSe that increases VRAM consumption. Suggested if your GPU has under 6GiB.
* `Embed Output Metadata`: enables embedding the settings and latents used to generate that audio clip inside that audio clip. Metadata is stored as a JSON string in the `lyrics` tag.
* `Slimmer Computed Latents`: falls back to the original, 12.9KiB way of storing latents (without the extra bits required for using the CVVP model).
* `Voice Fixer`: runs each generated audio clip through `voicefixer`, if available and installed.
* `Use CUDA for Voice Fixer`: allows voicefixer to use CUDA. Speeds up cleaning the output, but at the cost of more VRAM consumed. Disable if you OOM.
* `Do Not Load TTS On Startup`: skips loading TorToiSe on initialization, but will get loaded when anything that requires it needs it. This is useful if you're doing non-TTS functions that require VRAM, but you'll OOM while doing it when the model is loaded (for example, training).
* `Device Override`: overrides the device name used to pass to PyTorch for hardware acceleration. You can use the accompanied `list_devices.py` script to map valid strings to GPU names. You can also pass `cpu` if you want to fallback to software mode.
* `Sample Batch Size`: sets the batch size when generating autoregressive samples. Bigger batches result in faster compute, at the cost of increased VRAM consumption. Leave to 0 to calculate a "best" fit.
* `Gradio Concurrency Count`: how many Gradio events the queue can process at once. Leave this over 1 if you want to modify settings in the UI that updates other settings while generating audio clips.
* `Output Sample Rate`: the sample rate to save the generated audio as. It provides a bit of slight bump in quality
* `Output Volume`: adjusts the volume through amplitude scaling.
* `Autoregressive Model`: the autoregressive model to use for generating audio output. This will look for models under `./models/finetunes/` and `./training/{voice}-finetune/models/`.
* `Whisper Model`: the specific model to use for Whisper transcription, when preparing a dataset to finetune with.
* `Use Whisper.cpp`: leverages [lightmare/whispercpp.py](https://git.ecker.tech/lightmare/whispercpp.py) for transcription and trimming. **!**NOTE**!** this is highly experimental, and I haven't actually tested this myself. There's some caveats.
* `Refresh Model List`: updates the above dropdown with models
* `Check for Updates`: manually checks for an update for this rep.
* `(Re)Load TTS`: either initializes or reinitializes TorToiSe. You should not need to use this unless you change some settings, like Low VRAM.

@ -0,0 +1,127 @@
## Training / Finetuning
This tab will contain a collection of sub-tabs pertaining to training.
### Pre-Requisites
Before continuing, please be sure you have an adequate GPU. The bare minimum requirements depend on your GPU's architecture, VRAM capacity, and OS.
* Windows + AMD: unfortunately, I cannot leverage DirectML to provide compatibility for training on Windows + AMD systems. If you're insistent on training, please use a Colab notebook.
* Linux + AMD: (**!**UNVERIFIED**!**)
- a card with at least 16GiB of VRAM (without bitsandbytes)
- (theoretically) a card with at least 6GiB of VRAM, with [broncotc/bitsandbytes-rocm](https://github.com/broncotc/bitsandbytes-rocm)
* NVIDIA:
- Pascal (10-series) and before: a card with at least 16GiB of VRAM.
- Turing (20-series) and beyond: a card with at least 6GiB of VRAM.
Unfortunately, only Turing cards (and beyond) have the necessary dedicated silicon to do integer8 calculations, an optimization leveraged by BitsAndBytes to allow for even the low-end consumer cards to train. However, BitsAndByte's documentation says this restriction is only for inferencing, and instead, the real requirement is Kepler and beyond. Unfortunately, I have no real way to test this, as it seems users with Kepler/Pascal cards are getting esoteric CUDA errors when using BitsAndBytes.
If you're on Windows and using an installation of this software from before 2023.02.24, and you want to (and can) use BitsAndBytes, please consult https://git.ecker.tech/mrq/ai-voice-cloning/issues/25 for a simple guide to copying the right files.
If you're on Windows using an installation after 2023.02.24, the setup should have already taken care of copying the necessary files to use BitsAndBytes.
To check if it works, you should see a message saying it is using BitsAndBytes optimizations on training startup.
### Capabilities
Training/finetuning a model offers a lot of improvements over using the base model. This can range from:
* better matching to a given voice's traits
- for example, getting better David Hayter's Solid Snake
* capturing an accent to generate voice samples from
- personally untested, but has been done
* teaching it an entire new language, like Japanese
- personally tested on a dataset size of 14920 audio clips from a gacha I haven't played in ages, Japanese is replicated pretty decently
If any of the above is of interest, then you're on the right track.
## Prepare Dataset
This section will aid in preparing the dataset for fine-tuning.
Dataset sizes can range from a few sentences, to a large collection of lines. However, do note that smalller dataset sizes require more epochs to finetune against, as there's less iterations invested to train per epoch.
Simply put your voice sources in its own folder under `./voices/` (as you normally would when using a voice for generaton), specify the language to transcribe to (default: English), then click Prepare.
This utility will leverage [openai/whisper](https://github.com/openai/whisper/) to transcribe the audio. Then, it'll slice the audio into pieces that the transcription found fit. Afterwards, it'll output this transcript as an LJSpeech-formatted text file: `train.txt`.
As whisper uses `fffmpeg` to handle it's audio processing, you must have a copy of `ffmpeg` exposed and accessible through your PATH environment variable. On Linux, this is simply having it installed through your package manager. On Windows, you can just download a copy of `ffmeg.exe` and drop it into the `./bin/` folder.
Transcription is not perfect, however. Be sure to manually quality check the outputted transcription, and edit any errors it might face. For things like Japanese, it's expected for things that would be spoken katakana to be coerced into kanji. In addition, when generating a finetuned model trained on Japanese:
* some kanji might get coerced into the wrong pronunciation.
* small kana like the `っ` of `あたしって` gets coerced as the normal kana.
* some punctuation like `、` may prematurely terminate a sentence.
**!**NOTE**!**: you might get some funky errors; consult this [issue](Issues#user-content-attributeerror-module-ffmpeg-has-no-attribute-input) if you do.
## Generate Configuration
This will generate the YAML necessary to feed into training. Here, you can set some parameters on how training will be done:
* `Epochs`: how many times you want training to loop through your data. This *should* be dependent on your dataset size, as I've had decent results with 500 epochs for a dataset size of about 60.
* `Learning Rate`: rate that determines how fast a model will "learn". Higher values train faster, but at the risk of frying the model, overfitting, or other problems. The default is "sane" enough for safety, especially in the scope of retraining, but definitely needs some adjustments. If you want faster training, bump this up to `0.0001` (1e-5), but be wary you may fry your finetune without tighter scheduling.
* `Text_CE LR Weight`: an experimental setting to govern how much weight to factor in with the provided learning rate. This is ***a highly experimental tunable***, and is only exposed so I don't need to edit it myself when testing it. ***Leave this to the default 0.01 unless you know what you are doing.***
* `Learning Rate Schedule`: a list of epochs on when to decay the learning rate. You really should leave this as the default.
* `Batch Size`: how large of a batch size for training. Larger batch sizes will result in faster training steps, but at the cost of increased VRAM consumption. This value must exceed the size of your dataset, and *should* be evenly divisible by your dataset size.
* `Mega Batch Factor`: According to the documentation, `DLAS also supports "mega batching", where multiple forward passes contribute to a single backward pass`. If you can spare the VRAM, I suppose you can bump this to 8. If you're pressed for VRAM, you can lower this down to 1. If you have really small batch sizes, use what the validator gives out.
* `Print Frequency`: how often the trainer should print its training statistics in epochs. Printing takes a little bit of time, but it's a nice way to gauge how a finetune is baking, as it lists your losses and other statistics. This is purely for debugging and babysitting if a model is being trained adequately. The web UI *should* parse the information from stdout and grab the total loss and report it back.
* `Save Frequency`: how often to save a copy of the model during training in epochs. It seems the training will save a normal copy, an `ema` version of the model, *AND* a backup archive containing both to resume from. If you're training on a Colab with your Drive mounted, these can easily rack up and eat your allotted space. You *can* delete older copies from training, but it's wise not to in case you want to resume from an older state.
* `Resume State Path`: the last training state saved to resume from. The general path structure is what the placeholder value is. This will resume from whatever iterations it was last at, and iterate from there until the target step count (for example, resuming from iteration 2500, while requesting 5000 iterations, will iterate 2500 more times).
* `Half-Precision`: setting this will convert the base model to float16 and train at half precision. This *might* be faster, but quality during generation *might* be hindered. I've trained against a small dataset (size 17) of Solid Snake for 3000 epochs, and it *works*, but you *must* enable Half-Precision for generation when using half-precision models. On CUDA systems, this is irrelevant, as everything is secretly trained using integer8 with bitsandbyte's optimizations.
* `BitsAndBytes`: specifies if you want to train with BitsAndBytes optimizations enabled. Enabling this makes the above setting redundant. You ***should*** really leave this enabled unless you absolutely are sure of what you're doing, as this is crucial to reduce VRAM usage.
* `Source Model`: the source model to finetune against. With it, you can re-finetune already finetuned models (for example, taking a Japanese finetune that can speak Japanese well, but you want to refine it for a specific voice). You *should* leave this as the default autoregressive model unless you are sure of what you're doing.
* `Dataset`: a dataset generated from the `Prepare Dataset` tab.
and, some buttons:
* `Refresh Dataset List`: updates the dataset list, required when new datasets are added
* `Import Existing Dataset Settings`: pulls the settings used for a dataset. This will check for an existing training output first, before it checks the actual dataset in the `./training/` folder. This is primarily a shortcut for me when I'm testing settings.
* `Validate Training Configuration`: does some sanity checks to make sure that training won't throw an error, and offer suggested settings. You're free to change them after the fact, as validation is not done on save.
* `Save Training Configuration`: writes the settings to the training YAML, for loading with the training script.
After filling in the values, click `Save Training Configuration`, and it should print a message when it's done.
### Resuming Training
You can easily resume from a previous training state within the web UI as well.
* select the `Dataset` you want to resume from
* click `Import Dataset`
* it'll pull up the last used settings and grab the last saved state to resume from
- feel free to adjust any other settings, like increasing the epoch count
- **!**NOTE**!**: sometimes-but-not-all-the-time, the numbers might be a bit mismatched, due to some rounding errors when converting back from iterations as a unit to epochs as a unit
* click `Save Training Setting`
- you're free to revalidate your settings, but it shouldn't be necessary if you changed nothing
And you should be good to resume your training.
I've done this plenty of times and haven't had anything nuked or erased. As a safety precaution, DLAS will always move the existing folder as a backup if it's starting from a new training and not resuming. If it resumes, it won't do that, and nothing should be overwritten.
In the future, I'll adjust the "resume state" to provide a dropdown instead when selecting a dataset, rather than requiring to import and deduce the most recent state, to make things easier.
### Changing Base Model
Currently in the web UI, there's no way to specify picking a different model (such as, using a finetune to train from). You must manually edit the `train.yaml` by specifying the path to the model you want to fine tune at line 117.
I have not tested if this is feasible, but I have tested that you can finetune from a model you have already finetuned from. For example, if you were to train a large dataset for a different language (Japanese), but you also want to finetune for a specific voice, you can re-finetune the Japanese model.
## Run Training
After preparing your dataset and configuration file, you are ready to train. Simply select a generated configuration file, click train, then keep an eye on either the console window to the right for output, or console output in your terminal/command prompt.
If you check `Verbose Console Output`, *all* output from the training process gets forwarded to the console window on the right. This output is buffered, up to the `Console Buffer Size` specified (for example, the last eight lines if 8).
If you bump up the `Keep X Previous States` above 0, it will keep the last X number of saved models and training states, and clean up the rest on training start, and every save. **!**NOTE**!** I did not extensively test this, only on test data, and it did not nuke my saves. I don't expect it to happen, but be wary.
If everything is done right, you'll see a progress bar and some helpful metrics. Below that, is a graph of the total GPT loss rate.
After every `print rate` iterations, the loss rate will update and get reported back to you. This will update the graph below with the current loss rate. This is useful to see how "ready" your model/finetune is. The general rule of thumb is the lower, the better. I used to swear by values around `0.15` and `0.1`, but I've had nicer results when it's lower. But be wary, as this *may* be grounds for overfitment, as is the usual problem with training/finetuning.
If something goes wrong, please consult the output, as, more than likely, you've ran out of memory.
After you're done, the process will close itself, and you are now free to use the newly baked model.
You can then head on over to the `Settings` tab, reload the model listings, and select your newly trained model in the `Autoregressive Model` dropdown.
### Training Output
Due to the nature of the interfacing with training, some discrepancies may occur:
* the UI bases its units in epochs, and converts to the unit the training script bases itself in: iterations. Some slight rounding errors may occur. For example, at the last epoch, it might save one iteration before how many iterations given to train.
* the training script calculates what an epoch is slightly different than what the UI calculates an epoch as. This might be due to how it determines what lines in the dataset gets culled out from non-evenly-divisible dataset sizes by batch sizes. For example, it might think a given amount of iterations will fill 99 epochs instead of 100.
* because I have to reparse the training output, some statistics may seem a little inconsistent. For example, the ETA is extrapolated by the last delta between iterations. I could do better ways for this (like by delta time between epochs, averaging delta time between iterations and extrapolating from it, etc.).
* for long, long generations on a publically-facing Gradio instance (using `share=True`), the UI may disconnect from the program. This can be remedied using the `Reconnect` button, but the UI will appear to update every other iteration. This is because it's still trying to "update" the initial connection, and it'll grab the line of output from stdio, and will alternate between the two sessions.

@ -0,0 +1,9 @@
## Utilities
In this tab, you can find some helper utilities that might be of assistance.
For now, an analog to the PNG info found in Voldy's Stable Diffusion Web UI resides here. With it, you can upload an audio file generated with this web UI to view the settings used to generate that output. Additionally, the voice latents used to generate the uploaded audio clip can be extracted.
If you want to reuse its generation settings, simply click `Copy Settings`.
To import a voice, click `Import Voice`. Remember to click `Refresh Voice List` in the `Generate` panel afterwards, if it's a new voice.

@ -0,0 +1,15 @@
## Using the Software
Now you're ready to generate clips. With the command prompt still open, simply enter `start.bat` (or `start.sh`), and wait for it to print out a URL to open in your browser, something like `http://127.0.0.1:7860`.
If you're looking to access your copy of TorToiSe from outside your local network, tick the `Public Share Gradio` button in the `Settings` tab, then restart.
Before actually using the software, please consult the [Collecting Samples](Collecting-Samples) page, as use of this software is under the heavy assumption you're using it to voice clone, rather than just synthesis voice. You can still use it with the `random` voice feature, but it's not the purpose.
TorToiSe, the underlying TTS software, is a zero-shot speech synthesizer: no further training is required, and some voices can get pretty decent output as-is with the default model.
However, some voices (or languages) require some fine-tuning of the base model to get better, stronger output. If you're looking to get better output, consider finetuning with the [Training](Training) tab.
If you're not sure, it does not hurt to play around with the default models, and see what works.
For more information for a given tab, consult the Sidebar.

@ -0,0 +1,10 @@
* [Home](Home)
* [Installation](Installation)
* [Tips on Collecting Samples](Collecting-Samples)
* [Using the Web UI](Web-UI)
- [Generating](Generate)
- [History](History)
- [Utilities](Utilities)
- [Training/Finetuning](Training)
- [Settings](Settings)
* [Issues](Issues)