.. | ||
config.md | ||
data.md | ||
demo.md | ||
emb.md | ||
engines.md | ||
export.md | ||
ext.md | ||
inferenece.md | ||
models.md | ||
plot.md | ||
README.md | ||
samplers.md | ||
train.md | ||
utils.md | ||
webui.md |
What is VALL-E?
VALL-E describes how treating text-to-speech synthesis as a language problem can easily be solved with a language model. The original paper utilizes a basic transformer as the underlying architecture to perform zero-shot text-to-speech synthesis using a short audio prompt as reference.
Why VALL-E?
At the time, state-of-the-art neural-based TTS solutions were sparing. TorToiSe had a similar approach to treating TTS as a language problem, but required a ton of additional cruft on top of its ensemble. Thus, when VALL-E's paper released, it was simple yet effective with it requiring, at the time, just an AR and a NAR model, and leaving EnCodec to handle the rest (feature extraction, encoding audio, decoding audio). Vocos then improves upon EnCodec's decoding to produce better quality audio.
Why this VALL-E?
Unlike the paper, this VALL-E aims to:
- be lightweight as possible, only requiring one model to load and use (and EnCodec/Vocos as an audio encoder/decoder).
- Even the original VALL-E requires a separate AR and a NAR.
- keep training and finetuning (be it the base model or through LoRAs) accessible to anyone.
- Bark was needlessly complex in providing even additional voices to use.
- Current SoTA such as F5-TTS supports it, but seems to have a rather high ceiling to finetune it.
- provide decent zero-shot text-to-speech synthesis, both without requiring sampling adjustments and providing thorough sampler settings.
- provide additional, easy to use functionality, that other solutions don't offer.
However, at this point and time, the implementation is rather divorced from VALL-E and its derivating papers, but the core principle is still followed.
Model Specifications
The reference model (ar+nar-llama-8
/ar+nar-len-llama-8
):
- boasts 220M parameters
- supports English, German, French, and Japanese
- support for Korean and Chinese (Mandarin?) soon™
- has several modalities of inferencing:
- the primary audio level (RVQ level 0) can be inferenced both autoregressively (
AR
) or non-autoregressively (NAR-len
)- pure-NAR can yield faster-than-realtime output
- supports predicting the duration of an input
- supports Speech-to-Text (although it's a second-class feature)
- additional tasks such as noise reduction, speech removal, editing, and voice conversion eventually™ (just need to train on it)
- the primary audio level (RVQ level 0) can be inferenced both autoregressively (
- trained on
?
samples /?
hours of EnCodec-quantized audio at 24KHz
To-Do
- train and release a serviceable model for finetuning against.
- train and release a good zero-shot model.
- for what it's worth it's decent enough for me to finally be happy with it.
- well-integrated training through the Web UI (without the kludge from ai-voice-cloning)
explore alternative setups, like a NAR-only model or Descript-Audio-Codec- the current experiment of an AR length-predictor + NAR for the rest seems to fall apart...
- Descript-Audio-Codec 44KHz has NAR issues, but this might be user error.
explore better sampling techniques- the AR doesn't need exotic sampling techniques, as they're bandaids for a bad AR.
- the NAR benefits from greedy sampling, and anything else just harms output quality.
- clean up the README, and document, document, document.
- extend to multiple languages (VALL-E X).
- reference model is trained against English, Japanese, French, and German.
- improve multi-lingual support
- improve cross-lingual support
- extend to addditional tasks (SpeechX).
stt
(Speech-to-Text) seems to be working fine for the most part, but is very much a second-class feature.- other tasks seem to require a ton of VRAM......
- SpeechX tasks might need to be reworked to fit well within the
NAR-len
context to make full use of masking (for example, for speech editing) - possibly voice conversion through the
NAR-len
with clever demasking tricks (for example, the tokens that are masked are from the source voice)
extend using VALL-E 2's features (grouped code modeling + repetition aware sampling)- desu these don't seem to be worthwhile improvements, as inferencing is already rather fast, and RAS is just a fancy sampler.
- audio streaming
- this technically can work without any additional architecture changes, just clever tricks with sampling-then-decoding-to-audio.
- something similar to HiFiGAN (or the one for TorToiSe) trained on the last hidden states of the AR might also enable an alternate way for streaming.
- desu the
NAR-len
can be fast enough with short enough utterances to generate audio >1x speeds
- speed up inferencing for the AR
- KV caching both yields broken output and quadratically slow output, unless I'm doing something grossly wrong.
- provide a pure NAR model that foregoes most of the inferencing slowdowns a regular AR+NAR model will provide.
- HF-ify the model
- write a weights converter
- implement a pure llama_HF implementation
- this might be easily possible by subjugating the tokenizer to handle all the embeddings / classifiers
- this will pave the way to use the model under an easy marriage of
llama.cpp
andencodec.cpp
- replace the phonemizer with something that doesn't depend on espeak
- train the model to handle text => phoneme (without a hit to the rest of the model)
- ...and phonemes => text
- allow raw text as input instead
- espeak is nice, but I can only really put my whole trust with phonemizing English.
- a small model trained to handle converting text to phonemes might work, but has it's own problems (another model to carry around, as accurate as the dataset it was trained against, requires training for each language... etc).
- train the model to handle text => phoneme (without a hit to the rest of the model)
- smarter/clever inferencing, such as:
- "rolling" context, where the last generated sentence is the prefix for the next sentence.
- for the AR, stop inferencing sequences in the batch that has already hit its stop token
- explore exotic features like:
- using a pure text vocab rather than IPA phonemes (as a transformer should be "smart" enough to map text tokens)
- mixing multiple speakers through summing input prompt embeddings
- I do not expect this to work, but you never know...
- objective metrics such as WER / SIM-O
- WER simply requires transcribing audio then computing word error rates through the transcriptions
- this does require subjugating an STT model though (like Whisper(X))
- SIM-O requires passing the raw waveform through a speaker-similarity model
- WER simply requires transcribing audio then computing word error rates through the transcriptions
"Postmortem"
For the most part, the model is complete. With the NAR-len
being crammed on, I'm satisifed with the performance-to-quality.
However, while this solution boasts being lightweight, there are some caveats for its given size
- its at capacity on what it can do without additional tasks to augment it further
- post-fixing it with additional layers glued on doesn't seem to offer very much improvement (12 => 16 layers)
- wrangling it is a bit of a chore, as some voices work fine under the
AR
but not theNAR-len
, and vice-versa- some voices outright refuse to work without LoRA training
- some sampler settings works on some voices, but others need some tweaking
- for short durations, it excels, but despite training on longer durations, stability is less guaranteed
- subjugating an existing LLM architecture is a bit of a pain, as I would love to make full use of LLaMA niceties
hf
-ifying it is possible, but it'd be a chore to set up the tokenizer properly
- it still seems like the phase of the moon matters with how it wants to cooperate
- some eval tests it seems fine, other times issues like word errors will crop up
- the
NAR-len
requires CFGs > 2-ish to cooperate (or a prefix)- this isn't so much of an issue, but this can lead to user error, and CFG incurs an additional sampling step per step.
- guidance distillation would be nice, but distillation in general harms finetuning (assuming this just as likely harms it)
- rolling context/prefix does solve this
- VALL-E Continuous (prefixing with the input prompt) could also fix this, but technically makes it one-shot and not zero-shot
- multi-lingual support is a bit of an afterthought
- supported non-English speakers have the confidence problem for some speakers but exacerbated
- there seems to be a regression with an increase in the word error rate, although it might only be inherent to the
NAR-len
Notices and Citations
Unless otherwise credited/noted in this repo or within the designated Python file, this repository is licensed under AGPLv3.
-
EnCodec is licensed under CC-BY-NC 4.0. If you use the code to generate audio quantization or perform decoding, it is important to adhere to the terms of their license.
-
This implementation was originally based on enhuiz/vall-e, but has been heavily, heavily modified over time. Without it I would not have had a good basis to muck around and learn.
@article{wang2023neural,
title={Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers},
author={Wang, Chengyi and Chen, Sanyuan and Wu, Yu and Zhang, Ziqiang and Zhou, Long and Liu, Shujie and Chen, Zhuo and Liu, Yanqing and Wang, Huaming and Li, Jinyu and others},
journal={arXiv preprint arXiv:2301.02111},
year={2023}
}
@article{defossez2022highfi,
title={High Fidelity Neural Audio Compression},
author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
journal={arXiv preprint arXiv:2210.13438},
year={2022}
}