Update 'Training'

2023-03-22 22:14:47 +00:00 · 2023-03-22 22:14:47 +00:00 · 19624fa58c
commit 19624fa58c
parent b942cdd6de
1 changed files with 5 additions and 0 deletions
--- a/Training.md
+++ b/Training.md
@ -56,11 +56,16 @@ This section will cover how to prepare a dataset for training.
 	- `openai/whisper`: the default, GPU backed implementation.
    - `lightmare/whispercpp`: an additional implementation. Leverages WhisperCPP with python bindings, lighter model sizes, and CPU backed.
    	+ **!**NOTE**!**: whispercpp is practically Linux only, as it requires a compiling environment that won't kick you in the balls like MSVC would on Windows.
+    - `m-bain/whisperx` leverages stuff like VAD and phoneme-level ASR for better transcription.
+    	+ **!**NOTE**!**: better transcription requires an HF-token. If you do not provide one within `./config/exec.json`, you're better off just using another backend.
+    	+ **!**NOTE**!**: this is not installed automatically, as it's a dependency parasite. You're on your own for installing this yourself.
 * `Whisper Model`: whisper model to transcribe against. Larger models boast more accuracy, at the cost of longer processing time, and VRAM consumption.
 	- **!**NOTE**!**: the large model allegedly has problems with timestamps, moreso than the medium one.

 This tab will leverage any voice you have under the `./voices/` folder, and transcribes your voice samples using [openai/whisper](https://github.com/openai/whisper) to prepare an LJSpeech-formatted dataset to train against.

+Transcribing will also collapse very-short segments with the previous segment after transcription to avoid harsh cuts.
+
 It's not required to dedicate a small portion of your dataset for validation purposes, but it's recommended, as it helps remove data that's too small to be useful for. Using a validation dataset will help measure how well the finetune is at synthesizing speech from an input that it has not trained against.

 If you're transcribing English text that's already stored as separate sound files (for example, one sentence per file), there isn't much of a concern with utilizing a larger whisper model, as transcription of English is already very decent with even the smaller models.