doc update, added automatically deducing language from a given text, also checks if the input is already phonemized text to allow direct control without being cringe (procrastinating adding WER/SIM-O)

This commit is contained in:
mrq 2024-12-07 22:34:25 -06:00
parent 5d80a2d0d4
commit a032ff588f
13 changed files with 87 additions and 27 deletions

View File

@ -26,6 +26,7 @@
<th>Text</th>
<th>Prompt</th>
<th>Our VALL-E</th>
<th>F5-TTS</th>
<th>Ground Truth</th>
</tr>
</thead>

View File

@ -34,14 +34,19 @@ However, at this point and time, the implementation is rather divorced from VALL
* [x] clean up the README, and document, document, document.
* [x] extend to multiple languages ([VALL-E X](https://arxiv.org/abs/2303.03926)).
- reference model is trained against English, Japanese, French, and German.
- [ ] improve multi-lingual support
- [ ] improve cross-lingual support
* [ ] extend to addditional tasks ([SpeechX](https://arxiv.org/abs/2308.06873)).
- `stt` (Speech-to-Text) seems to be working fine for the most part.
- `stt` (Speech-to-Text) seems to be working fine for the most part, but is very much a second-class feature.
- other tasks seem to require a ton of VRAM......
* [ ] extend using [VALL-E 2](https://arxiv.org/pdf/2406.05370)'s features (grouped code modeling + repetition aware sampling)
- SpeechX tasks might need to be reworked to fit well within the `NAR-len` context to make full use of masking (for example, for speech editing)
- ***possibly*** voice conversion through the `NAR-len` with clever demasking tricks (for example, the tokens that are masked are from the source voice)
* [ ] ~~extend using [VALL-E 2](https://arxiv.org/pdf/2406.05370)'s features (grouped code modeling + repetition aware sampling)~~
- desu these don't seem to be worthwhile improvements, as inferencing is already rather fast, and RAS is just a fancy sampler.
* [ ] audio streaming
- this *technically* can work without any additional architecture changes, just clever tricks with sampling-then-decoding-to-audio.
- something similar to HiFiGAN (or the one for TorToiSe) trained on the last hidden states of the AR *might* also enable an alternate way for streaming.
- desu the `NAR-len` can be fast enough with short enough utterances to generate audio >1x speeds
* [ ] speed up inferencing for the AR
- KV caching both yields broken output and quadratically slow output, unless I'm doing something grossly wrong.
* [x] provide a pure NAR model that foregoes most of the inferencing slowdowns a regular AR+NAR model will provide.
@ -58,13 +63,15 @@ However, at this point and time, the implementation is rather divorced from VALL
- a small model trained to handle converting text to phonemes might work, but has it's own problems (another model to carry around, as accurate as the dataset it was trained against, requires training for each language... etc).
* [ ] smarter/clever inferencing, such as:
* [x] "rolling" context, where the last generated sentence is the prefix for the next sentence.
* for the AR, stop inferencing sequences in the batch that has already hit its stop token
* [ ] explore exotic features like:
* using a pure text vocab rather than IPA phonemes (as a transformer should be "smart" enough to map text tokens)
* interleaving by using summed embedding tokens:
* for example, `<RVQ 0-7><RVQ 0>` => `<RVQ 0-7><RVQ 0-1>` => `<RVQ 0-7><RVQ 0-2>` (etc.)
* however, I imagine the sequences to train for this are *too* exotic.
* mixing multiple speakers through summing input prompt embeddings
* I do not expect this to work, but you never know...
* [ ] objective metrics such as WER / SIM-O
* [ ] WER simply requires transcribing audio then computing word error rates through the transcriptions
* this does require subjugating an STT model though (like Whisper(X))
* [ ] SIM-O requires passing the raw waveform through a speaker-similarity model
## "Postmortem"
@ -86,7 +93,9 @@ However, while this solution boasts being lightweight, there are some caveats fo
* guidance distillation would be nice, but distillation in general harms finetuning (assuming this just as likely harms it)
* rolling context/prefix does solve this
* VALL-E Continuous (prefixing with the input prompt) could also fix this, but technically makes it one-shot and not zero-shot
* multi-lingual support is a bit of an afterthought
* supported non-English speakers have the confidence problem for some speakers but exacerbated
* there seems to be a regression with an increase in the word error rate, although it might only be inherent to the `NAR-len`
## Notices and Citations

View File

@ -68,6 +68,8 @@ This class defines the hyperparameters to use during training.
For the most part, when using `prodigyopt`, the only dials to care about is `batch_size` and `gradient_accumulation_step`.
For knowledge distillation, its corresponding hyperparameters live here, rather than alongside a given model's configuration.
## `Evaluation`
This class governs the behavior during the evaluation / validation pass during training.

View File

@ -17,6 +17,8 @@ By default, punctuation, stress markers, and stripping are enabled by default, b
To avoid memory leaking through `phonemizer`, backends and instances are cached for further reuse.
The language for a given text can be automatically deduced with `langdetect` by passing `auto` as a language.
### Text Tokens
Despite being an audio LM, the model still needs some form of text as the input prompt.

View File

@ -11,7 +11,8 @@ For invoking this model in another Python package, refer to `webui.py` and `demo
To synthesize speech: `python -m vall_e <text> <ref_path> <out_path> --yaml=<yaml_path>` (or `--model=<model_path>`)
Some additional flags you can pass are:
* `--language`: specifies the language for phonemizing the text, and helps guide inferencing when the model is trained against that language.
* `--language`: specifies the language for guiding guide inferencing when the model is trained against that language. Use `auto` to automatically deduce this.
* `--text-language`: the language to phonemize the input text under. Leave blank to tie it to the above value.
* `--task`: task to perform. Defaults to `tts`, but accepts `stt` for transcriptions.
* `--max-duration`: maximum token-duration for inferencing through the AR aspect of the model. Every second corresponds to 75 steps.
* `--max-steps`: maximum steps for inferencing through the NAR-len aspect of the model.
@ -44,13 +45,6 @@ And some experimental sampling flags you can use too (your mileage will ***defin
* `--dry-multiplier`: (AR only) performs DRY sampling, the scalar factor.
* `--dry-base`: (AR only) for DRY sampling, the base of the exponent factor.
* `--dry-allowed-length`: (AR only) for DRY sampling, the window to perform DRY sampling within.
* `--layer-skip` enables early-exit layer skipping if the model is confident enough (for compatible models)
* `--layer-skip-exit-layer`: maximum layer to use
* `--layer-skip-entropy-threshold`: the maximum the logits' entropy (confidence) needs to be before exiting early
* `--layer-skip-varentropy-threshold`: the maximum the logits' varentropy (confidence spread) needs to be before exiting early
* `--refine-on-stop`: (AR only) uses the last steps' logits for the entire final output sequence, rather than the step-by-step iterative sequence.
+ This needs experimenting with to see if there's any downside.
+ to-do: compare the probability scores with the original output sequence, and pick the best one.
Some arguments are able to be prefixed with `ar-` and `nar-` to only use that setting for its respective pass. At the moment through the CLI, this includes:
* `temperature`
@ -60,4 +54,5 @@ Some arguments are able to be prefixed with `ar-` and `nar-` to only use that se
The `ar+nar-tts+stt-llama-8` (now the reference model) model has received additional training for a speech-to-text task against EnCodec-encoded audio.
Currently, the model only transcribes back into the IPA phonemes it was trained against, as an additional model or external program is required to translate the IPA phonemes back into text.
* this does make a model that can phonemize text, and unphonemize text, more desirable in the future to replace espeak (having an additional task to handle this requires additional embeddings, output heads, and possible harm to the model as actual text is not a modality the model is trained on).
* this does make a model that can phonemize text, and unphonemize text, more desirable in the future to replace espeak (having an additional task to handle this requires additional embeddings, output heads, and possible harm to the model as actual text is not a modality the model is trained on).
* it seems to really want to only transcribe the first sentence for a given utterance. I imagine this is simply a problem with how it was trained.

View File

@ -12,6 +12,8 @@ The inputs are automatically sequenced in a way that a given task requires, and
While the original paper called for a separate AR model and a NAR model, and by treating the AR and the NAR as unique tasks, you can actually train a unified model (`AR+NAR`) for effectively free, as the internal states of the two should overlap quite a lot.
* Additionally, you can even train a `NAR-len` model on top of an existing model.
Later papers for discrete TTS solutions work around the multiple codebook problem by introducing exotic interleaving patterns to work around existing problems. For all intents and purposes, these aren't necessary, as the current sequencing of prioritizng the first codebook (RVQ level 0). The remaining RVQ levels can be easily deduced from the prior level in parallel.
## The AR (Autoregressive) Model
The AR is responsible for generating the first RVQ level of the audio codes for a given output. References to "outputs from the AR" refers to this level, as it contibutes to the final waveform the most.

View File

@ -70,6 +70,15 @@ As training under `deepspeed` and Windows is not (easily) supported, under your
Creature comforts like `float16`, `amp`, and multi-GPU training *should* work under the `local` backend, but extensive testing still needs to be done to ensure it all functions.
## Knowledge Distillation
Performing knowledge distillation from a teacher to a student is simple. All that's needed is to reference the teacher model in under `cfg.models`, and mark `teacher: True`, and the student model will automatically reference the teacher model.
Additional hyperparameters can be tuned to what you want under `cfg.hyperparameters`, but the defaults are sane:
* `teacher_alpha`: the alpha to blend between the normal logits, and the soft targets from comparing the probability distribution from the student model to the teacher model. `0.5` works fine enough.
* `teacher_temperature`: the temperature to apply to the logits for both the student and the teacher, that is then also applied to the soft targets. `1.0` seems fine.
* `teacher_loss_fn`: the type of loss function to use. `kl` will use `kl_div` on the probability distributions, while `mse_loss` will apply to the raw logits before applying softmax. Either are fine: `kl` is commonly used, while some literature swear by `mse_loss` for a trivial gain.
# `train.py`
This script handles the VALL-E specific training code.

View File

@ -78,7 +78,8 @@ setup(
# for the web UI
"gradio",
"nltk",
"nltk", # for parsing text inputs down to pieces
"langdetect", # for detecting the language of a text
],
extras_require = {
"all": [

View File

@ -12,8 +12,8 @@ def main():
parser = argparse.ArgumentParser("VALL-E TTS")
parser.add_argument("text")
parser.add_argument("references", type=path_list, default=None)
parser.add_argument("--language", type=str, default="auto")
parser.add_argument("--text-language", type=str, default=None)
parser.add_argument("--language", type=str, default="en")
parser.add_argument("--task", type=str, default="tts")
parser.add_argument("--modality", type=str, default="auto")
parser.add_argument("--out-path", type=Path, default=None)

View File

@ -70,7 +70,7 @@ def main():
parser.add_argument("--yaml", type=Path, default=None)
parser.add_argument("--model", type=Path, default=None)
parser.add_argument("--batch-size", type=int, default=0)
parser.add_argument("--batch-size", type=int, default=cfg.inference.batch_size)
parser.add_argument("--demo-dir", type=Path, default=None)
parser.add_argument("--skip-existing", action="store_true")
@ -83,7 +83,7 @@ def main():
parser.add_argument("--preamble", type=str, default=None)
parser.add_argument("--output-filename", type=str, default="index.html")
parser.add_argument("--language", type=str, default="en")
parser.add_argument("--language", type=str, default="auto")
parser.add_argument("--task", type=str, default="tts")
parser.add_argument("--modality", type=str, default="auto")
parser.add_argument("--out-path", type=Path, default=None)
@ -324,7 +324,7 @@ def main():
samples = []
speakers = [ dir for dir in sample_dir.iterdir() if dir.is_dir() ]
sources = [ "ms_valle", "f5" ]
sources = [ "ms_valle", "f5" ] if k == "librispeech" else ["f5"]
# generate demo output
for dir in tqdm(speakers, desc=f"Generating demo for {k}"):

View File

@ -13,15 +13,32 @@ from tqdm import tqdm
try:
import pykakasi
except Exception as e:
pykakasi = None
print(f'Error while importing pykakasi: {str(e)}')
pass
try:
import langdetect
except Exception as e:
langdetect = None
print(f'Error while importing langdetect: {str(e)}')
@cache
def detect_language( text ):
if langdetect is None:
raise Exception('langdetect is not installed.')
return langdetect.detect( text )
def _get_graphs(path):
with open(path, "r") as f:
graphs = f.read()
return graphs
@cache
def romanize( runes, sep="" ):
if pykakasi is None:
raise Exception('pykakasi is not installed.')
kks = pykakasi.kakasi()
result = kks.convert( runes )
return sep.join([ res['hira'] for res in result ])
@ -52,7 +69,10 @@ def _get_backend( language="en-us", backend="espeak", punctuation=True, stress=T
return phonemizer
def encode(text: str, language="en-us", backend="auto", punctuation=True, stress=True, strip=True) -> list[str]:
def encode(text: str, language="auto", backend="auto", punctuation=True, stress=True, strip=True) -> list[str]:
if language == "auto":
language = detect_language( text )
language = coerce_language( language )
# Convert to kana because espeak does not like kanji...

View File

@ -94,11 +94,17 @@ class TTS():
def disable_lora( self ):
return self.enable_lora( enabled=False )
def encode_text( self, text, language="en" ):
def encode_text( self, text, language="auto", precheck=True ):
# already a tensor, return it
if isinstance( text, Tensor ):
return text
# check if tokenizes without any unks (for example, if already phonemized text is passes)
if precheck and "<unk>" in self.symmap:
tokens = tokenize( text )
if self.symmap["<unk>"] not in tokens:
return torch.tensor( tokens )
content = g2p.encode(text, language=language)
tokens = tokenize( content )
@ -210,6 +216,9 @@ class TTS():
dtype = sampling_kwargs.pop("dtype", self.dtype)
amp = sampling_kwargs.pop("amp", self.amp)
if batch_size < 1:
batch_size = 1
model_ar = None
model_len = None
model_nar = None
@ -236,7 +245,7 @@ class TTS():
references = [ None for _ in range(samples) ]
# fill with english
if not languages:
languages = [ "en" for _ in range(samples) ]
languages = [ "auto" for _ in range(samples) ]
if not out_paths:
out_paths = [ None for _ in range(samples) ]
# use the audio language to phonemize the text
@ -245,6 +254,10 @@ class TTS():
# tensorfy inputs
for i in range( samples ):
# detect language
if languages[i] == "auto":
languages[i] = g2p.detect_language( texts[i] )
texts[i] = self.encode_text( texts[i], language=text_languages[i] )
references[i] = self.encode_audio( references[i], trim_length=input_prompt_length ) if references[i] else None
languages[i] = self.encode_lang( languages[i] )
@ -325,7 +338,7 @@ class TTS():
self,
text,
references,
language="en",
language="auto",
text_language=None,
task="tts",
out_path=None,
@ -339,6 +352,9 @@ class TTS():
dtype = sampling_kwargs.pop("dtype", self.dtype)
amp = sampling_kwargs.pop("amp", self.amp)
if language == "auto":
language = g2p.detect_language( text )
if not text_language:
text_language = language

View File

@ -122,7 +122,7 @@ def get_speakers():
return cfg.dataset.training
def get_languages():
return get_lang_symmap().keys()
return list(get_lang_symmap().keys()) + ["auto"]
#@gradio_wrapper(inputs=layout["dataset"]["inputs"].keys())
def load_sample( speaker ):
@ -265,6 +265,9 @@ def do_inference_tts( progress=gr.Progress(track_tqdm=True), *args, **kwargs ):
elif args.split_text_by == "none":
args.split_text_by = None
if args.text_language == "auto":
args.text_language = None
tts = init_tts()
gr.Info(f"Inferencing... (Modality: {tts.modality(args.modality.lower())})")
@ -447,8 +450,8 @@ with ui:
with gr.Row():
layout["inference_tts"]["inputs"]["cfg-strength"] = gr.Slider(value=1.0, minimum=0.0, maximum=14.0, step=0.05, label="CFG Strength", info="Classifier Free Guidance scale (AR needs 1, NAR-len needs 3).")
layout["inference_tts"]["inputs"]["cfg-rescale"] = gr.Slider(value=0.75, minimum=0.0, maximum=1.0, step=0.05, label="CFG Rescale (Phi)", info="Factor when rescaling for Classifier Free Guidance (0 to disable).")
layout["inference_tts"]["inputs"]["language"] = gr.Dropdown(choices=get_languages(), label="Language (Output)", value="en", info="Target language/accent to output.")
layout["inference_tts"]["inputs"]["text-language"] = gr.Dropdown(choices=get_languages(), label="Language (Text)", value="en", info="Language the input text is in.")
layout["inference_tts"]["inputs"]["language"] = gr.Dropdown(choices=get_languages(), label="Language (Output)", value="auto", info="Target language/accent to output.")
layout["inference_tts"]["inputs"]["text-language"] = gr.Dropdown(choices=get_languages(), label="Language (Text)", value="auto", info="Language the input text is in.")
with gr.Row():
layout["inference_tts"]["inputs"]["split-text-by"] = gr.Dropdown(choices=["sentences", "lines"], label="Text Delimiter", info="Splits the text into pieces.", value="sentences")
layout["inference_tts"]["inputs"]["context-history"] = gr.Slider(value=0, minimum=0, maximum=4, step=1, label="(Rolling) Context History", info="How many prior lines to serve as the context/prefix (0 to disable).")