Update 'Training'

master
mrq 2023-03-16 21:09:29 +07:00
parent f1a6d6e78d
commit 4b76470ae1
1 changed files with 23 additions and 3 deletions

@ -40,7 +40,7 @@ This section will cover how to prepare a dataset for training.
* `Dataset Source`: a valid folder under `./voice/`, as if you were using it to generate with.
* `Language`: language code to transcribe to (leave blank to auto-deduce):
- beware, as specifying the wrong language ***will*** let whisper translate it, which is ultimately pointless if you're trying to train aganst.
- beware, as specifying the wrong language ***will*** let whisper translate it, which is ultimately pointless if you're trying to train against.
* `Validation Text Length Threshold`: transcription text lengths that are below this value are culled and placed in the validation dataset. Set 0 to ignore.
* `Validation Audio Length Threshold`: audio lengths that are below this value are culled and placed in the validation dataset. Set 0 to ignore.
* `Skip Already Transcribed`: skip transcribing a file if it's already processed and exists in the `whisper.json` file. Perfect if you're adding new files, and want to skip old files, while allowing you to re-transcribe files.
@ -56,7 +56,7 @@ This section will cover how to prepare a dataset for training.
- `openai/whisper`: the default, GPU backed implementation.
- `lightmare/whispercpp`: an additional implementation. Leverages WhisperCPP with python bindings, lighter model sizes, and CPU backed.
+ **!**NOTE**!**: whispercpp is practically Linux only, as it requires a compiling environment that won't kick you in the balls like MSVC would on Windows.
* `Whisper Model`: whisper model to transcribe against. Larger models boast more accuracy, at the cost of longer processing time, and VRAM comsumption.
* `Whisper Model`: whisper model to transcribe against. Larger models boast more accuracy, at the cost of longer processing time, and VRAM consumption.
- **!**NOTE**!**: the large model allegedly has problems with timestamps, moreso than the medium one.
This tab will leverage any voice you have under the `./voices/` folder, and transcribes your voice samples using [openai/whisper](https://github.com/openai/whisper) to prepare an LJSpeech-formatted dataset to train against.
@ -83,6 +83,14 @@ A lot of it should be fairly hand-held, but the biggest point is to double check
* **!**NOTE**!**: be very careful with naively trusting how well the audio is segmented. Be sure to manually curate how well they were segmented
### Phonemizer
**!**NOTE**!**: use of [`phonemizer`](https://github.com/bootphon/phonemizer) requires `espeak-ng` installed, or an equivalent backend. Any errors thrown from it are an issue with `phonemizer` itself.
As a shortcut, if you've set your `Tokenizer JSON Path` to the provided `ipa.json`, the text will get outputted as IPA phonemes. This leverages `phonemizer` to convert the transcription text into phonemes (I tried an audio-based one, and it didn't give favorable results).
With this, you can leverage better speech synthesis by training on the actual phonemes, rather than tokens loosely representing phonemes.
## Generate Configuration
This will generate the YAML necessary to feed into training. For documentation's sake, below are details for what each parameter does:
@ -117,6 +125,18 @@ and, some buttons:
After filling in the values, click `Save Training Configuration`, and it should print a message when it's done.
This will also let you use the selected `Tokenizer JSON` under the `Settings` tab, if you wish to replace it.
### Tokenizer Vocab.
**!**NOTE**!**: training on a different tokenizer vocab. is highly experimental.
The provided tokenizer vocab. is tailored for English, and the base AR model has been heavily trained against it. If you're having problems with pronunciation with a language, you can create a new tokenizer JSON.
Keep in mind, you should replace any tokens *after* `z` (index 39) with whatever additional phonemes you want. You'll also want to provide a good list of merged-text like `th`, as well as provide a definition of the token merge in the `merges` section.
However, you'll need to train with a high text LR ratio, as you're effectively re-defining what a text token means.
### Suggested Settings
The following settings are robust enough that I can suggest them, for small or large datasets.
@ -199,7 +219,7 @@ Typically, a "good model" has the text-loss a higher than the mel-loss, and the
The autoregressive model predicts tokens in as `<speech conditioning>:<text tokens>:<MEL tokens>` string, where:
* speech conditioning is a vector representing a voice's latents
- I still need to look into specifically how a voice's latents are computed, but I imagine it's by inferencing given a set of mel tokens.
* text tokens (I believe) represents phonemes, and in turn, a sequence of phonemes represents language.
* text tokens (I believe) represents virtual-phonemes, and in turn, a sequence of virtual-phonemes represents language.
- this governs the language side of the model
- later, these tokens are compared against the CLVP to pick the most likely samples given a sequence ot text tokens.
* mel tokens represent the speech (how phonemes sound)