Update 'Generate'

mrq 2023-03-09 15:11:10 +00:00
parent b3ea59d499
commit 8c51ee6865

@ -16,11 +16,12 @@ You'll be presented with a bunch of options in the default `Generate` tab, but d
* `Prompt`: text you want to be read. You wrap text in `[brackets]` for "prompt engineering", where it'll affect the output, but those words won't actually be read.
* `Line Delimiter`: String to split the prompt into pieces. The stitched clip will be stored as `combined.wav`. To split by a new line, enter `\n`.
* `Emotion`: the "emotion" used for the delivery. This is a shortcut to utilizing "prompt engineering" by starting with `[I am really <emotion>,]` in your prompt. This is merely a suggestion, not a guarantee.
* `Custom Emotion + Prompt`: a non-preset "emotion" used for the delivery. This is a shortcut to utilizing "prompt engineering" by starting with `[<emotion>]` in your prompt.
* `Custom Emotion + Prompt` (if `Custom` is selected): a non-preset "emotion" used for the delivery. This is a shortcut to utilizing "prompt engineering" by starting with `[<emotion>]` in your prompt.
* `Voice`: the voice you want to clone. You can select `microphone` if you want to use input from your microphone. You can also use `random` for it to use a randomly generated voice.
* `Microphone Source`: Use your own voice from a line-in source.
* `Voice Chunks`: a slider to determine how many pieces to process your voice dataset when computing conditional latents. The lower the number, the bigger the pieces and the more VRAM is needed. The smaller the pieces, the less VRAM is needed, but the more likely you will slice mid-phoneme. Playing around with this will most definitely affect the output of your cloning, as some datasets will work better with different values.
- a "safe" value will be automatically calculated based on the total duration of your voice input samples, to avoid blindly OOMing. You can adjust this factor in `Settings`.
- if you've created an LJSpeech dataset (Under `Training` > `Prepare Dataset`), this will automatically set to 0, hinting for the routine to use the dataset audio and padding them to a common size, for a little more accurate capturing of the latents.
* `Refresh Voice List`: updates the voice list
* `(Re)Compute Voice Latents`: (re)computes the conditional latents for a given voice.
@ -33,16 +34,17 @@ Below are generation settings, which affect the technical aspects of how your in
* `Temperature`: how much randomness to introduce to the generated samples. Lower values = better resemblance to the source samples, but some temperature is still required for great output.
- **!**NOTE**!**: This value is very inconsistent and entirely depends on the input voice. In other words, some voices will be receptive to playing with this value, while others won't make much of a difference.
- **!**NOTE**!**: some voices will be very receptive to this, where it speaks slowly at low temperatures, but nudging it a hair and it speaks too fast.
- **!**NOTE**!**: this appears to "replace" larger autoregressive sample sizes, for example, 16 samples to 0.8 temperature yields similar results to something like 64 to 0.4 (roughly).
Below are an explanation of experimental flags. Messing with these might impact performance, as these are exposed only if you know what you are doing.
* `Half-Precision`: (attempts to) hint to PyTorch to auto-cast to float16 (half precision) for compute. Disabled by default, due to it making computations slower. With BitsAndBytes optimizations, this seems to offer a little boost in subjective quality. This is required if using a model at half-precision.
* `Conditional Free`: a quality boosting improvement at the cost of some performance. Enabled by default, as I think the penaly is negligible in the end.
* `Pause Size`: Governs how large pauses are at the end of a clip (in token size, not seconds). Increase this if your output gets cut off at the end.
- **!**NOTE**!**: sometimes this is merely a suggestion and not a guarantee. Some generations will be sensitive to this, while others will not.
- **!**NOTE**!**: too large of a pause size can lead to unexpected behavior.
* `Diffusion Sampler`: sampler method during the diffusion pass. Currently, only `P` and `DDIM` are added, but does not seem to offer any substantial differences in my short tests.
`P` refers to the default, vanilla sampling method in `diffusion.py`.
To reiterate, this ***only*** is useful for the diffusion decoding path, after the autoregressive outputs are generated.
Below are an explanation of experimental flags. Messing with these might impact performance, as these are exposed only if you know what you are doing.
* `Half-Precision`: (attempts to) hint to PyTorch to auto-cast to float16 (half precision) for compute. Disabled by default, due to it making computations slower.
* `Conditional Free`: a quality boosting improvement at the cost of some performance. Enabled by default, as I think the penaly is negligible in the end.
* `CVVP Weight`: governs how much weight the CVVP model should influence candidates. The original documentation mentions this is deprecated as it does not really influence things, but you're still free to play around with it.
Currently, setting requires regenerating your voice latents, as I forgot to have it return some extra data that weighing against the CVVP model uses. Oops.
Setting this to 1 leads to bad behavior.
@ -50,7 +52,7 @@ Below are an explanation of experimental flags. Messing with these might impact
* `Diffusion Temperature`: the variance of the noise fed into the diffusion model; values at 0 are the "mean" prediction of the diffusion network and will sound bland and smeared.
* `Length Penalty`: a length penalty applied to the autoregressive decoder; higher settings causes the model to produce more terse outputs.
* `Repetition Penalty`: a penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or "uhhhhhhs", etc.
* `Conditioning-Free K`: determintes balancing the conditioning free signal with the conditioning-present signal.
* `Conditioning-Free K`: determines balancing the conditioning free signal with the conditioning-present signal.
After you fill everything out, click `Run`, and wait for your output in the output window. The sampled voice is also returned, but if you're using multiple files, it'll return the first file, rather than a combined file.
@ -58,6 +60,10 @@ All outputs are saved under `./result/[voice name]/`. On some browsers, you're a
To save you from headaches, I strongly recommend playing around with shorter sentences first to find the right values for the voice you're using before generating longer sentences.
### `random` voice
Due to how the entire generation pass is handles, the random voice will very, very, very much be random, even with the same latents. If you want a persistent random voice, you're free to manually "loopback" the input, by treating the generated output as a new voice.
### Prompt Setting Editing
If you want to procedurally edit any generation settings (for example, switch between voices), you can add to the start of the line a JSON string containing the settings you want to override. For example: