Update 'Training'

2023-03-16 21:09:29 +00:00 · 2023-03-16 21:09:29 +00:00 · 4b76470ae1
commit 4b76470ae1
parent f1a6d6e78d
1 changed files with 23 additions and 3 deletions
--- a/Training.md
+++ b/Training.md
@ -40,7 +40,7 @@ This section will cover how to prepare a dataset for training.

 * `Dataset Source`: a valid folder under `./voice/`, as if you were using it to generate with.
 * `Language`: language code to transcribe to (leave blank to auto-deduce):
-	- beware, as specifying the wrong language ***will*** let whisper translate it, which is ultimately pointless if you're trying to train aganst.
+	- beware, as specifying the wrong language ***will*** let whisper translate it, which is ultimately pointless if you're trying to train against.
 * `Validation Text Length Threshold`: transcription text lengths that are below this value are culled and placed in the validation dataset. Set 0 to ignore.
 * `Validation Audio Length Threshold`: audio lengths that are below this value are culled and placed in the validation dataset. Set 0 to ignore.
 * `Skip Already Transcribed`: skip transcribing a file if it's already processed and exists in the `whisper.json` file. Perfect if you're adding new files, and want to skip old files, while allowing you to re-transcribe files.
@ -56,7 +56,7 @@ This section will cover how to prepare a dataset for training.
 	- `openai/whisper`: the default, GPU backed implementation.
    - `lightmare/whispercpp`: an additional implementation. Leverages WhisperCPP with python bindings, lighter model sizes, and CPU backed.
    	+ **!**NOTE**!**: whispercpp is practically Linux only, as it requires a compiling environment that won't kick you in the balls like MSVC would on Windows.
-* `Whisper Model`: whisper model to transcribe against. Larger models boast more accuracy, at the cost of longer processing time, and VRAM comsumption.
+* `Whisper Model`: whisper model to transcribe against. Larger models boast more accuracy, at the cost of longer processing time, and VRAM consumption.
 	- **!**NOTE**!**: the large model allegedly has problems with timestamps, moreso than the medium one.

 This tab will leverage any voice you have under the `./voices/` folder, and transcribes your voice samples using [openai/whisper](https://github.com/openai/whisper) to prepare an LJSpeech-formatted dataset to train against.
@ -83,6 +83,14 @@ A lot of it should be fairly hand-held, but the biggest point is to double check

 * **!**NOTE**!**: be very careful with naively trusting how well the audio is segmented. Be sure to manually curate how well they were segmented

+### Phonemizer
+
+**!**NOTE**!**: use of [`phonemizer`](https://github.com/bootphon/phonemizer) requires `espeak-ng` installed, or an equivalent backend. Any errors thrown from it are an issue with `phonemizer` itself.
+
+As a shortcut, if you've set your `Tokenizer JSON Path` to the provided `ipa.json`, the text will get outputted as IPA phonemes. This leverages `phonemizer` to convert the transcription text into phonemes (I tried an audio-based one, and it didn't give favorable results).
+
+With this, you can leverage better speech synthesis by training on the actual phonemes, rather than tokens loosely representing phonemes.
+
 ## Generate Configuration

 This will generate the YAML necessary to feed into training. For documentation's sake, below are details for what each parameter does:
@ -117,6 +125,18 @@ and, some buttons:

 After filling in the values, click `Save Training Configuration`, and it should print a message when it's done.

+This will also let you use the selected `Tokenizer JSON` under the `Settings` tab, if you wish to replace it.
+
+### Tokenizer Vocab.
+
+**!**NOTE**!**: training on a different tokenizer vocab. is highly experimental.
+
+The provided tokenizer vocab. is tailored for English, and the base AR model has been heavily trained against it. If you're having problems with pronunciation with a language, you can create a new tokenizer JSON.
+
+Keep in mind, you should replace any tokens *after* `z` (index 39) with whatever additional phonemes you want. You'll also want to provide a good list of merged-text like `th`, as well as provide a definition of the token merge in the `merges` section.
+
+However, you'll need to train with a high text LR ratio, as you're effectively re-defining what a text token means.
+
 ### Suggested Settings

 The following settings are robust enough that I can suggest them, for small or large datasets.
@ -199,7 +219,7 @@ Typically, a "good model" has the text-loss a higher than the mel-loss, and the
 The autoregressive model predicts tokens in as `<speech conditioning>:<text tokens>:<MEL tokens>` string, where:
 * speech conditioning is a vector representing a voice's latents
 	- I still need to look into specifically how a voice's latents are computed, but I imagine it's by inferencing given a set of mel tokens.
-* text tokens (I believe) represents phonemes, and in turn, a sequence of phonemes represents language.
+* text tokens (I believe) represents virtual-phonemes, and in turn, a sequence of virtual-phonemes represents language.
 	- this governs the language side of the model
    - later, these tokens are compared against the CLVP to pick the most likely samples given a sequence ot text tokens.
 * mel tokens represent the speech (how phonemes sound)