documentation under ./docs/

This commit is contained in:
mrq 2024-11-05 16:11:01 -06:00
parent bbc2de3713
commit 9901c4f8ca
13 changed files with 778 additions and 354 deletions

357
README.md
View File

@ -12,12 +12,8 @@ Besides a working PyTorch environment, the only hard requirement is [`espeak-ng`
- Linux users can consult their package managers on installing `espeak`/`espeak-ng`.
- Windows users are required to install [`espeak-ng`](https://github.com/espeak-ng/espeak-ng/releases/tag/1.51#Assets).
+ additionally, you may be required to set the `PHONEMIZER_ESPEAK_LIBRARY` environment variable to specify the path to `libespeak-ng.dll`.
+ Simply running `set PHONEMIZER_ESPEAK_LIBRARY="C:\Program Files\eSpeak NG\libespeak-ng.dll"` beforehand should fix this.
- In the future, an internal homebrew to replace this would be fantastic.
Support on AMD systems with ROCm is *mostly* supported, but performance ***will*** vary.
- ROCm is simply too inconsistent with outputs.
## Install
Simply run `pip install git+https://git.ecker.tech/mrq/vall-e` or `pip install git+https://github.com/e-c-k-e-r/vall-e`.
@ -32,355 +28,8 @@ A script to setup a proper environment and download the weights can be invoked w
When inferencing, either through the web UI or CLI, if no model is passed, the default model will download automatically instead, and should automatically update.
## Train
## Documentation
Training is very dependent on:
* the quality of your dataset.
* clean utterances and accurate transcriptions go a long way.
* a diverse dataset in prosidy and speakers help a ton.
* how much data you have.
* training from scratch requires upwards of 15K hours.
* training new languages from the base model simply requires maybe ~2K hours each.
* the bandwidth you quantized your audio to, as this affects the how many tokens are processed per step.
* the underlying model architecture used.
* some models behave better than others for a unified approach, others do not.
The provided documentation under [./docs/](./docs/) should provide thorough coverage over most, if not all, of this project.
### Try Me
To quickly test if a configuration works, you can run `python -m vall_e.models.ar_nar --yaml="./data/config.yaml"`; a small trainer will overfit a provided utterance.
### Leverage Your Own Dataset
If you already have a dataset you want, for example, your own large corpus or for finetuning, you can use your own dataset instead.
0. Set up a `venv` with `https://github.com/m-bain/whisperX/`.
+ At the moment only WhisperX is utilized. Using other variants like `faster-whisper` is an exercise left to the user at the moment.
+ It's recommended to use a dedicated virtualenv specifically for transcribing, as WhisperX will break a few dependencies.
+ The following command should work:
```
python3 -m venv venv-whisper
source ./venv-whisper/bin/activate
pip3 install torch torchvision torchaudio
pip3 install git+https://github.com/m-bain/whisperX/
```
1. Populate your source voices under `./voices/{group name}/{speaker name}/`.
2. Run `python3 -m vall_e.emb.transcribe`. This will generate a transcription with timestamps for your dataset.
+ If you're interested in using a different model, edit the script's `model_name` and `batch_size` variables.
3. Run `python3 -m vall_e.emb.process`. This will phonemize the transcriptions and quantize the audio.
+ If you're using a Descript-Audio-Codec based model, ensure to set the sample rate and audio backend accordingly.
4. Run `python3 -m vall_e.emb.similar`. This will calculate the top-k most similar utterances for each utterance for use with sampling.
+ Doing this will help the model follow the input prompt stronger, at the possible "cost" of the model not learning how to "infer" the target speaker AND prosidy.
5. Copy `./data/config.yaml` to `./training/config.yaml`. Customize the training configuration and populate your `dataset.training` list with the values stored under `./training/dataset/list.json`.
+ Refer to `./vall_e/config.py` for additional configuration details.
### Dataset Formats
Two dataset formats are supported:
* the standard way:
- data is stored under `./training/data/{group}/{speaker}/{id}.{enc|dac}` as a NumPy file, where `enc` is for the EnCodec/Vocos backend, and `dac` for the Descript-Audio-Codec backend.
- it is *highly* recommended to generate metadata to speed up dataset pre-load with `python3 -m vall_e.data --yaml="./training/config.yaml" --action=metadata`
* using an HDF5 dataset:
- you can convert from the standard way with the following command: `python3 -m vall_e.data --yaml="./training/config.yaml"` (metadata for dataset pre-load is generated alongside HDF5 creation)
- this will shove everything into a single HDF5 file and store some metadata alongside (for now, the symbol map generated, and text/audio lengths)
- be sure to also define `use_hdf5` in your config YAML.
### Training
For single GPUs, simply running `python3 -m vall_e.train --yaml="./training/config.yaml`.
For multiple GPUs, or exotic distributed training:
* with `deepspeed` backends, simply running `deepspeed --module vall_e.train --yaml="./training/config.yaml"` should handle the gory details.
* with `local` backends, simply run `torchrun --nnodes=1 --nproc-per-node={NUMOFGPUS} -m vall_e.train --yaml="./training/config.yaml"`
You can enter `save` to save the state at any time, or `quit` to save and quit training.
The `lr` command will also let you adjust the learning rate on the fly. For example: `lr 1.0e-3` will set the learning rate to `0.001`.
Some additional flags can be passed as well:
* `--eval`: only run the evaluation / validation pass, then exit afterwards.
* `--eval-random-text-prompts`: use random text prompts for the evaluation pass, rather than the provided text prompts in the dataset.
### Finetuning
Finetuning can be done by training the full model, or using a LoRA.
Finetuning the full model is done the same way as training a model, but be sure to have the weights in the correct spot, as if you're loading them for inferencing.
For training a LoRA, add the following block to your `config.yaml`:
```
loras:
- name : "arbitrary name" # whatever you want
rank: 128 # dimensionality of the LoRA
alpha: 128 # scaling factor of the LoRA
training: True
```
And that's it. Training of the LoRA is done with the same command. Depending on the rank and alpha specified, the loss may be higher than it should, as the LoRA weights are initialized to appropriately random values. I found `rank` and `alpha` of 128 works fine.
To export your LoRA weights, run `python3 -m vall_e.export --lora --yaml="./training/config.yaml"`. You *should* be able to have the LoRA weights loaded from a training checkpoint automagically for inferencing, but export them just to be safe.
### Plotting Metrics
Included is a helper script to parse the training metrics. Simply invoke it with, for example: `python3 -m vall_e.plot --yaml="./training/config.yaml"`
You can specify what X and Y labels you want to plot against by passing `--xs tokens_processed --ys loss.nll stats.acc`
### Notices
#### Training Under Windows
As training under `deepspeed` and Windows is not (easily) supported, under your `config.yaml`, simply change `trainer.backend` to `local` to use the local training backend.
Creature comforts like `float16`, `amp`, and multi-GPU training *should* work under the `local` backend, but extensive testing still needs to be done to ensure it all functions.
#### Backend Architectures
As the core of VALL-E makes use of a language model, various LLM architectures can be supported and slotted in. Currently supported LLM architectures:
* `llama`: using HF transformer's LLaMa implementation for its attention-based transformer, boasting RoPE and other improvements.
+ I aim to utilize this for the foundational model, as I get to leverage a bunch of things tailored for LLaMA (and converting to them is rather easy).
* `mixtral`: using HF transformer's Mixtral implementation for its attention-based transformer, also utilizing its MoE implementation.
* `bitnet`: using [this](https://github.com/kyegomez/BitNet/) implementation of BitNet's transformer.
- Setting `cfg.optimizers.bitnet=True` will make use of BitNet's linear implementation.
* `transformer`: a basic attention-based transformer implementation, with attention heads + feed forwards.
* `retnet`: using [TorchScale's RetNet](https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/retnet.py) implementation, a retention-based approach can be used instead.
- Its implementation for MoE can also be utilized.
* `retnet-hf`: using [syncdoth/RetNet](https://github.com/syncdoth/RetNet) with a HuggingFace-compatible RetNet model
- has an inference penality, and MoE is not implemented.
* `mamba`: using [state-spaces/mamba](https://github.com/state-spaces/mamba) (needs to mature)
- ***really hard*** to have a unified AR and NAR model
- inference penalty makes it a really hard sell, despite the loss already being a low 3 after a short amount of samples processed
For audio backends:
* [`encodec`](https://github.com/facebookresearch/encodec): a tried-and-tested EnCodec to encode/decode audio.
* [`vocos`](https://huggingface.co/charactr/vocos-encodec-24khz): a higher quality EnCodec decoder.
- encoding audio will use the `encodec` backend automagically, as there's no EnCodec encoder under `vocos`
* [`descript-audio-codec`](https://github.com/descriptinc/descript-audio-codec): boasts better compression and quality, but has issues with model convergence.
- models at 24KHz + 8kbps will NOT converge in any manner.
- models at 44KHz + 8kbps seems harder to model its "language", and the NAR side of the model suffers greatly.
`llama`-based models also support different attention backends:
* `torch.nn.functional.scaled_dot_product_attention`-based attention:
* `math`: torch's SDPA's `math` kernel
* `mem_efficient`: torch's SDPA's memory efficient (`xformers` adjacent) kernel
* `cudnn`: torch's SDPA's `cudnn` kernel
* `flash`: torch's SDPA's flash attention kernel
* internal implementations of external attention backends:
* `xformers`: [facebookresearch/xformers](https://github.com/facebookresearch/xformers/)'s memory efficient attention
* `flash_attn`: uses the available `flash_attn` package (including `flash_attn==1.0.9` through a funny wrapper)
* `flash_attn_v100`: uses [ZRayZzz/flash-attention-v100](https://github.com/ZRayZzz/flash-attention-v100/)'s Flash Attention for Volta (but doesn't work currently)
* `fused_attn`: uses an implementation using `triton` (tested on my 7900XTX and V100s), but seems to introduce errors when used to train after a while
* `default`: uses the naive path for hte internal implementation (used for attention-debugging purposed)
* `transformers` Llama\*Attention implementations:
* `eager`: default `LlamaAttention`
* `sdpa`: integrated `LlamaSdpaAttention` attention model
* `flash_attention_2`: integrated `LlamaFlashAttetion2` attention model
* `auto`: determine the best fit from the above
The wide support for various backends is solely while I try and figure out which is the "best" for a core foundation model.
##### ROCm Flash Attention
[ROCm/flash-attention](https://github.com/ROCm/flash-attention) currently does not support Navi3 cards (gfx11xx), so first-class support for Flash Attention is a bit of a mess on Navi3. Using the `howiejay/navi_support` branch can get inference support, but not training support (due to some error being thrown during the backwards pass) by:
* edit `/opt/rocm/include/hip/amd_detail/amd_hip_bf16.h`:
```
#if defined(__HIPCC_RTC__)
#define __HOST_DEVICE__ __device__ static
#else
#include <climits>
#define __HOST_DEVICE__ __host__ __device__ static inline
#endif
```
* install with `pip install -U git+https://github.com/ROCm/flash-attention@howiejay/navi_support --no-build-isolation`
## Export
To export the models, run: `python -m vall_e.export --yaml=./training/config.yaml`.
This will export the latest checkpoints, for example, under `./training/ckpt/ar+nar-retnet-8/fp32.pth`, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats.
Desite being called `fp32.pth`, you can export it to a different precision type with `--dtype=float16|bfloat16|float32`.
You can also export to `safetensors` with `--format=sft`, and `fp32.sft` will be exported instead.
## Synthesis
To synthesize speech: `python -m vall_e <text> <ref_path> <out_path> --yaml=<yaml_path>` (or `--model=<model_path>`)
Some additional flags you can pass are:
* `--language`: specifies the language for phonemizing the text, and helps guide inferencing when the model is trained against that language.
* `--task`: task to perform. Defaults to `tts`, but accepts `stt` for transcriptions.
* `--max-ar-steps`: maximum steps for inferencing through the AR model. Each second is 75 steps.
* `--device`: device to use (default: `cuda`, examples: `cuda:0`, `cuda:1`, `cpu`)
* `--ar-temp`: sampling temperature to use for the AR pass. During experimentation, `0.95` provides the most consistent output, but values close to it works fine.
* `--nar-temp`: sampling temperature to use for the NAR pass. During experimentation, the lower value, the better. Set to `0` to enable greedy sampling.
* `--input-prompt-length`: the maximum duration the input prompt can be (~6 seconds is fine, longer durations lead to slower generations for "better" accuracy, as long as the model was trained against such input prompt durations)
And some experimental sampling flags you can use too (your mileage will ***definitely*** vary, but most of these are bandaids for a bad AR):
* `--input-prompt-prefix`: (AR only) treats the input prompt as the initial response prefix, but...
* the transcription of the prompt needs to be in the input text prompt.
* doesn't perform all that well (I belive the model needs to be trained a bit on this, as `tts-c`).
* `--min-ar-temp`: triggers the dynamic temperature pathway, adjusting the temperature based on the confidence of the best token. Acceptable values are between `[0.0, (n)ar-temp)`.
+ This simply uplifts the [original implementation](https://github.com/kalomaze/koboldcpp/blob/dynamic-temp/llama.cpp#L5132) to perform it.
+ **!**NOTE**!**: This does not seem to resolve any issues with setting too high/low of a temperature. The right values are yet to be found.
* `--top-p`: limits the sampling pool to top sum of values that equal `P`% probability in the probability distribution.
* `--top-k`: limits the sampling pool to the top `K` values in the probability distribution.
* `--repetition-penalty`: modifies the probability of tokens if they have appeared before. In the context of audio generation, this is a very iffy parameter to use.
* `--repetition-penalty-decay`: modifies the above factor applied to scale based on how far away it is in the past sequence.
* `--length-penalty`: (AR only) modifies the probability of the stop token based on the current sequence length. This is ***very*** finnicky due to the AR already being well correlated with the length.
* `--beam-width`: (AR only) specifies the number of branches to search through for beam sampling.
+ This is a very naive implementation that's effectively just greedy sampling across `B` spaces.
* `--mirostat-tau`: (AR only) the "surprise value" when performing mirostat sampling.
+ This simply uplifts the [original implementation](https://github.com/basusourya/mirostat/blob/master/mirostat.py) to perform it.
+ **!**NOTE**!**: This is incompatible with beam search sampling (for the meantime at least).
* `--mirostat-eta`: (AR only) the "learning rate" during mirostat sampling applied to the maximum surprise.
* `--dry-multiplier`: (AR only) performs DRY sampling, the scalar factor.
* `--dry-base`: (AR only) for DRY sampling, the base of the exponent factor.
* `--dry-allowed-length`: (AR only) for DRY sampling, the window to perform DRY sampling within.
* `--layer-skip` enables early-exit layer skipping if the model is confident enough (for compatible models)
* `--layer-skip-exit-layer`: maximum layer to use
* `--layer-skip-entropy-threshold`: the maximum the logits' entropy (confidence) needs to be before exiting early
* `--layer-skip-varentropy-threshold`: the maximum the logits' varentropy (confidence spread) needs to be before exiting early
* `--refine-on-stop`: (AR only) uses the last steps' logits for the entire final output sequence, rather than the step-by-step iterative sequence.
+ This needs experimenting with to see if there's any downside.
+ to-do: compare the probability scores with the original output sequence, and pick the best one.
### Speech-to-Text
The `ar+nar-tts+stt-llama-8` model has received additional training for a speech-to-text task against EnCodec-encoded audio.
Currently, the model only transcribes back into the IPA phonemes it was trained against, as an additional model or external program is required to translate the IPA phonemes back into text.
* this does make a model that can phonemize text, and unphonemize text, more desirable in the future to replace espeak (having an additional task to handle this requires additional embeddings, output heads, and possible harm to the model as actual text is not a modality the model is trained on).
### Web UI
A Gradio-based web UI is accessible by running `python3 -m vall_e.webui`. You can, optionally, pass:
* `--yaml=./path/to/your/config.yaml`: will load the targeted YAML
* `--model=./path/to/your/model.sft`: will load the targeted model weights
* `--listen 0.0.0.0:7860`: will set the web UI to listen to all IPs at port 7860. Replace the IP and Port to your preference.
### Emergent Behavior
The model can be prompted in creative ways to yield some interesting behaviors:
* prompting without an input audio prompt will have the model generate a random voice at the "cost" of some unintelligible utterance at the beginning of the output response (despite doing no promptless training).
* finetunes / LoRAs can benefit from this by having input audio promptless synthesis, while opting to have an input audio prompt for guidance.
* prompting with an input text prompt being the transcription of the input audio prompt will have the response follow very closely to the input prompt (despite not doing input=output training).
* this should allow for easy transcription editing without much fuss.
#### Inference
Synthesizing speech is simple:
* `Input Prompt`: The guiding text prompt. Each new line will be its own generated audio to be stitched together at the end.
* `Audio Input`: The reference audio for the synthesis. Under Gradio, you can trim your clip accordingly, but leaving it as-is works fine.
- A properly trained model can inference without a prompt to generate a random voice (without even needing to generate a random prompt itself).
* `Output`: The resultant audio.
* `Inference`: Button to start generating the audio.
* `Basic Settings`: Basic sampler settings for most uses.
* `Sampler Settings`: Advanced sampler settings that are common for most text LLMs, but needs experimentation.
All the additional knobs have a description that can be correlated to the above CLI flags.
Speech-To-Text phoneme transcriptions for models that support it can be done using the `Speech-to-Text` tab.
#### Dataset
This tab currently only features exploring a dataset already prepared and referenced in your `config.yaml`. You can select a registered voice, and have it randomly sample an utterance.
In the future, this *should* contain the necessary niceties to process raw audio into a dataset to train/finetune through, without needing to invoke the above commands to prepare the dataset.
#### Settings
So far, this only allows you to load a different model without needing to restart. The previous model should seamlessly unload, and the new one will load in place.
## To-Do
* [x] train and release a serviceable model for finetuning against.
* [x] train and release a ***good*** zero-shot model.
- for what it's worth it's decent enough for me to finally be happy with it.
* [ ] well-integrated training through the Web UI (without the kludge from ai-voice-cloning)
* [x] ~~explore alternative setups, like a NAR-only model or Descript-Audio-Codec~~
- the current experiment of an AR length-predictor + NAR for the rest seems to fall apart...
- Descript-Audio-Codec 44KHz has NAR issues, but this *might* be user error.
* [x] ~~explore better sampling techniques~~
- the AR doesn't *need* exotic sampling techniques, as they're bandaids for a bad AR.
- the NAR benefits from greedy sampling, and anything else just harms output quality.
* [ ] clean up the README, and document, document, document onto the wiki.
* [x] extend to multiple languages ([VALL-E X](https://arxiv.org/abs/2303.03926)).
- reference model is trained against English, Japanese, French, and German.
* [ ] extend to addditional tasks ([SpeechX](https://arxiv.org/abs/2308.06873)).
- `stt` (Speech-to-Text) seems to be working fine for the most part.
- other tasks seem to require a ton of VRAM......
* [ ] extend using [VALL-E 2](https://arxiv.org/pdf/2406.05370)'s features (grouped code modeling + repetition aware sampling)
- desu these don't seem to be worthwhile improvements, as inferencing is already rather fast, and RAS is just a fancy sampler.
* [ ] audio streaming
- this *technically* can work without any additional architecture changes, just clever tricks with sampling-then-decoding-to-audio.
- something similar to HiFiGAN (or the one for TorToiSe) trained on the last hidden states of the AR *might* also enable an alternate way for streaming.
* [ ] speed up inferencing
- KV caching both yields broken output and quadratically slow output, unless I'm doing something grossly wrong.
- A pure HF model is the only way to fix this, but converting the model to one is a bit of a chore.
- Speculative sampling seems overkill for small models (and in reality seems like it's better to just train a larger model).
- Self-speculation through layer-skipping doesn't offer any tangible speedups, sadly.
* [ ] replace the phonemizer with something that doesn't depend on espeak
* [ ] train the model to handle text => phoneme (without a hit to the rest of the model)
* [ ] ...and phonemes => text
* [ ] allow raw text as input instead
- espeak is nice, but I can only really put my whole trust with phonemizing English.
- a small model trained to handle converting text to phonemes might work, but has it's own problems (another model to carry around, as accurate as the dataset it was trained against, requires training for each language... etc).
* [ ] smarter/clever inferencing, such as:
* [ ] "rolling" context, where the last generated sentence is the prefix for the next sentence.
* [ ] explore exotic features like:
* using a pure text vocab rather than IPA phonemes (as a transformer should be "smart" enough to map text tokens)
* interleaving by using summed embedding tokens:
* for example, `<RVQ 0-7><RVQ 0>` => `<RVQ 0-7><RVQ 0-1>` => `<RVQ 0-7><RVQ 0-2>` (etc.)
* however, I imagine the sequences to train for this are *too* exotic.
* mixing multiple speakers through summing input prompt embeddings
* I do not expect this to work, but you never know...
## Caveats
Despite how lightweight it is in comparison to other TTS's I've meddled with, there are still some caveats, be it with the implementation or model weights:
* the audio embeddings have some quirks to having the AR's RVQ level 0 separate from the NAR's RVQ level 0 (sharing them caused some problems in testing)
* the trainer / dataloader assumes there are zero variations between a speaker's utterances, and thus it can extract the basics of a speaker's features rather than deeper features (like prosidy, tone, etc.) when performing inferences.
+ ~~however, trying to work around this would require training under `tts-c` (VALL-E continuous) mode or modifying an input prompt enough to where its quantized representation differs enough from the output response the prompt derives from.~~
+ to remedy this, training benefits from calculating the most similar utterances for each utterance, and using that as the input prompt for training.
* the trainer's default RVQ level distribution prioritizes lower RVQ levels over higher RVQ levels, as the lower levels contribute to the final waveform more; however, this leaves some minor artifacting that rises in the higher RVQ levels due to inaccuracy issues.
+ summing the audio embeddings for later RVQ levels seems to help?
+ `model.experimental.p_rvq_levels: [0,0,0,0,0,0,0,1,2,3,4,5,6,7]` seems to help?
* speakers that aren't similar to an audiobook narrator voice has similarity issues due to the majority of training used `path`-based dataloader sampling instead of `speaker`-based (or `group`-based) dataloader sampling.
+ although LoRAs help a ton for fixing results for a single voice.
+ a diverse dataset in prosidy and speaker (such as a corpus sourced from dramatic media like video games) helps a ton, but still has issues for speakers not similar to any seen speakers.
## Notices and Citations
Unless otherwise credited/noted in this README or within the designated Python file, this repository is [licensed](LICENSE) under AGPLv3.
- [EnCodec](https://github.com/facebookresearch/encodec) is licensed under CC-BY-NC 4.0. If you use the code to generate audio quantization or perform decoding, it is important to adhere to the terms of their license.
- This implementation was originally based on [enhuiz/vall-e](https://github.com/enhuiz/vall-e), but has been heavily, heavily modified over time. Without it I would not have had a good basis to muck around and learn.
```bibtex
@article{wang2023neural,
title={Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers},
author={Wang, Chengyi and Chen, Sanyuan and Wu, Yu and Zhang, Ziqiang and Zhou, Long and Liu, Shujie and Chen, Zhuo and Liu, Yanqing and Wang, Huaming and Li, Jinyu and others},
journal={arXiv preprint arXiv:2301.02111},
year={2023}
}
```
```bibtex
@article{defossez2022highfi,
title={High Fidelity Neural Audio Compression},
author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
journal={arXiv preprint arXiv:2210.13438},
year={2022}
}
```
Markdown files should correspond directly to their respective file or folder under `./vall_e/`.

101
docs/README.md Normal file
View File

@ -0,0 +1,101 @@
# What is VALL-E?
[VALL-E](https://arxiv.org/abs/2301.02111) describes how treating text-to-speech synthesis as a language problem can easily be solved with a language model. The original paper utilizes a basic transformer as the underlying architecture to perform zero-shot text-to-speech synthesis using a short audio prompt as reference.
# Why VALL-E?
At the time, state-of-the-art neural-based TTS solutions were sparing. TorToiSe had a similar approach to treating TTS as a language problem, but required a ton of additional cruft on top of its ensemble. Thus, when VALL-E's paper released, it was simple yet effective with it requiring, at the time, just an AR and a NAR model, and leaving EnCodec to handle the rest (feature extraction, encoding audio, decoding audio). Vocos then improves upon EnCodec's decoding to produce better quality audio.
# Why this VALL-E?
Unlike the paper, this VALL-E aims to:
* be lightweight as possible, only requiring one model to load and use (and EnCodec/Vocos).
+ Even the original VALL-E requires a separate AR and a NAR.
* keep training and finetuning (be it the base model or through LoRAs) accessible to anyone.
+ Bark was needlessly complex in providing even additional voices to use.
+ Current SoTA such as F5-TTS supports it, but seems to have a rather high ceiling to finetune it.
* provide decent zero-shot text-to-speech synthesis, both without requiring sampling adjustments and providing thorough sampler settings.
## Caveats
Despite how lightweight it is in comparison to other TTS's I've meddled with, there are still some caveats, be it with the implementation or model weights:
* the audio embeddings have some quirks to having the AR's RVQ level 0 separate from the NAR's RVQ level 0 (sharing them caused some problems in testing)
* the trainer / dataloader assumes there are zero variations between a speaker's utterances, and thus it can extract the basics of a speaker's features rather than deeper features (like prosidy, tone, etc.) when performing inferences.
+ ~~however, trying to work around this would require training under `tts-c` (VALL-E continuous) mode or modifying an input prompt enough to where its quantized representation differs enough from the output response the prompt derives from.~~
+ to remedy this, training benefits from calculating the most similar utterances for each utterance, and using that as the input prompt for training.
* the trainer's default RVQ level distribution prioritizes lower RVQ levels over higher RVQ levels, as the lower levels contribute to the final waveform more; however, this leaves some minor artifacting that rises in the higher RVQ levels due to inaccuracy issues.
+ summing the audio embeddings for later RVQ levels seems to help?
+ `model.experimental.p_rvq_levels: [0,0,0,0,0,0,0,1,2,3,4,5,6,7]` seems to help?
* speakers that aren't similar to an audiobook narrator voice has similarity issues due to the majority of training used `path`-based dataloader sampling instead of `speaker`-based (or `group`-based) dataloader sampling.
+ although LoRAs help a ton for fixing results for a single voice.
+ a diverse dataset in prosidy and speaker (such as a corpus sourced from dramatic media like video games) helps a ton, but still has issues for speakers not similar to any seen speakers.
## To-Do
* [x] train and release a serviceable model for finetuning against.
* [x] train and release a ***good*** zero-shot model.
- for what it's worth it's decent enough for me to finally be happy with it.
* [ ] well-integrated training through the Web UI (without the kludge from ai-voice-cloning)
* [x] ~~explore alternative setups, like a NAR-only model or Descript-Audio-Codec~~
- the current experiment of an AR length-predictor + NAR for the rest seems to fall apart...
- Descript-Audio-Codec 44KHz has NAR issues, but this *might* be user error.
* [x] ~~explore better sampling techniques~~
- the AR doesn't *need* exotic sampling techniques, as they're bandaids for a bad AR.
- the NAR benefits from greedy sampling, and anything else just harms output quality.
* [ ] clean up the README, and document, document, document onto the wiki.
* [x] extend to multiple languages ([VALL-E X](https://arxiv.org/abs/2303.03926)).
- reference model is trained against English, Japanese, French, and German.
* [ ] extend to addditional tasks ([SpeechX](https://arxiv.org/abs/2308.06873)).
- `stt` (Speech-to-Text) seems to be working fine for the most part.
- other tasks seem to require a ton of VRAM......
* [ ] extend using [VALL-E 2](https://arxiv.org/pdf/2406.05370)'s features (grouped code modeling + repetition aware sampling)
- desu these don't seem to be worthwhile improvements, as inferencing is already rather fast, and RAS is just a fancy sampler.
* [ ] audio streaming
- this *technically* can work without any additional architecture changes, just clever tricks with sampling-then-decoding-to-audio.
- something similar to HiFiGAN (or the one for TorToiSe) trained on the last hidden states of the AR *might* also enable an alternate way for streaming.
* [ ] speed up inferencing
- KV caching both yields broken output and quadratically slow output, unless I'm doing something grossly wrong.
- A pure HF model is the only way to fix this, but converting the model to one is a bit of a chore.
- Speculative sampling seems overkill for small models (and in reality seems like it's better to just train a larger model).
- Self-speculation through layer-skipping doesn't offer any tangible speedups, sadly.
* [ ] replace the phonemizer with something that doesn't depend on espeak
* [ ] train the model to handle text => phoneme (without a hit to the rest of the model)
* [ ] ...and phonemes => text
* [ ] allow raw text as input instead
- espeak is nice, but I can only really put my whole trust with phonemizing English.
- a small model trained to handle converting text to phonemes might work, but has it's own problems (another model to carry around, as accurate as the dataset it was trained against, requires training for each language... etc).
* [ ] smarter/clever inferencing, such as:
* [ ] "rolling" context, where the last generated sentence is the prefix for the next sentence.
* [ ] explore exotic features like:
* using a pure text vocab rather than IPA phonemes (as a transformer should be "smart" enough to map text tokens)
* interleaving by using summed embedding tokens:
* for example, `<RVQ 0-7><RVQ 0>` => `<RVQ 0-7><RVQ 0-1>` => `<RVQ 0-7><RVQ 0-2>` (etc.)
* however, I imagine the sequences to train for this are *too* exotic.
* mixing multiple speakers through summing input prompt embeddings
* I do not expect this to work, but you never know...
## Notices and Citations
Unless otherwise credited/noted in this repo or within the designated Python file, this repository is [licensed](LICENSE) under AGPLv3.
- [EnCodec](https://github.com/facebookresearch/encodec) is licensed under CC-BY-NC 4.0. If you use the code to generate audio quantization or perform decoding, it is important to adhere to the terms of their license.
- This implementation was originally based on [enhuiz/vall-e](https://github.com/enhuiz/vall-e), but has been heavily, heavily modified over time. Without it I would not have had a good basis to muck around and learn.
```bibtex
@article{wang2023neural,
title={Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers},
author={Wang, Chengyi and Chen, Sanyuan and Wu, Yu and Zhang, Ziqiang and Zhou, Long and Liu, Shujie and Chen, Zhuo and Liu, Yanqing and Wang, Huaming and Li, Jinyu and others},
journal={arXiv preprint arXiv:2301.02111},
year={2023}
}
```
```bibtex
@article{defossez2022highfi,
title={High Fidelity Neural Audio Compression},
author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
journal={arXiv preprint arXiv:2210.13438},
year={2022}
}
```

173
docs/data.md Normal file
View File

@ -0,0 +1,173 @@
# data.py
This script handles the meat of preparing the data to feed the model through the dataloader, and unfortunately makes up for quite a lot of this project's complexity.
Most of these settings live under `cfg.dataset`.
## Dataset
### Leverage Your Own Dataset
If you already have a dataset you want, for example, your own large corpus or for finetuning, you can use your own dataset instead.
0. Set up a `venv` with `https://github.com/m-bain/whisperX/`.
+ At the moment only WhisperX is utilized. Using other variants like `faster-whisper` is an exercise left to the user at the moment.
+ It's recommended to use a dedicated virtualenv specifically for transcribing, as WhisperX will break a few dependencies.
+ The following command should work:
```
python3 -m venv venv-whisper
source ./venv-whisper/bin/activate
pip3 install torch torchvision torchaudio
pip3 install git+https://github.com/m-bain/whisperX/
```
1. Populate your source voices under `./voices/{group name}/{speaker name}/`.
2. Run `python3 -m vall_e.emb.transcribe`. This will generate a transcription with timestamps for your dataset.
+ If you're interested in using a different model, edit the script's `model_name` and `batch_size` variables.
3. Run `python3 -m vall_e.emb.process`. This will phonemize the transcriptions and quantize the audio.
+ If you're using a Descript-Audio-Codec based model, ensure to set the sample rate and audio backend accordingly.
4. Run `python3 -m vall_e.emb.similar`. This will calculate the top-k most similar utterances for each utterance for use with sampling.
+ Doing this will help the model follow the input prompt stronger, at the possible "cost" of the model not learning how to "infer" the target speaker AND prosidy.
5. Copy `./data/config.yaml` to `./training/config.yaml`. Customize the training configuration and populate your `dataset.training` list with the values stored under `./training/dataset/list.json`.
+ Refer to `./vall_e/config.py` for additional configuration details.
### Dataset Formats
Two dataset formats are supported:
* the standard way:
- data is stored under `./training/data/{group}/{speaker}/{id}.{enc|dac}` as a NumPy file, where `enc` is for the EnCodec/Vocos backend, and `dac` for the Descript-Audio-Codec backend.
- it is *highly* recommended to generate metadata to speed up dataset pre-load with `python3 -m vall_e.data --yaml="./training/config.yaml" --action=metadata`
* using an HDF5 dataset:
- you can convert from the standard way with the following command: `python3 -m vall_e.data --yaml="./training/config.yaml"` (metadata for dataset pre-load is generated alongside HDF5 creation)
- this will shove everything into a single HDF5 file and store some metadata alongside (for now, the symbol map generated, and text/audio lengths)
- be sure to also define `use_hdf5` in your config YAML.
## Dataloader
The dataloader handles some simple yet effective features, such as:
* culling samples within a requested duration range
* grouping samples based on:
* speakers (to keep utterances for a given speaker) and groups (to keep similar speakers within a group as defined in the dataset)
* durations, to keep VRAM usage and throughput consistent, if requested (as training requires keeping *all* samples of a batch the same token length)
* further partitioning samples per GPU
* shuffling then interleaving, per the dataloader sampler settings
* saving/loading sampler states to disk
* preparing a sample in a batch with adequate data for a given task, such as:
* picking an input prompt similar to the sampled output audio, if requested
* picking an input prompt from the same speaker as the sample, if the above is not requested
* preparing the input sequence for the given task (such as non-TTS tasks)
The initial list of paths is cached through `diskcache`, if `cfg.dataset.cache == True`. Be sure to delete the resultant `.cache` folder, as well as the `sampler.*` state dicts alongside checkpoints, if you plan to modify the dataloader settings between training sessions.
## Tasks
As this handles preparing the data fed into the model for training, this script needs to be aware of what tasks it should attend to, as mostly outlined under SpeechX.
This section may be covered elsewhere in the documentation, but coverage here should focus on the specifics of attending to the task, rather than what the task is.
* `tts`: basic and naive text-to-speech.
* requires a text transcription, input audio prompt, and the output audio response.
* `tts-c`: also noted as "VALL-E Continuous"
* this is what most other TTS solutions abide by (those that require a transcription of the input prompt)
* this *should* be more accurate as it has the output adhere stronger to the input through guidance, but doesn't seem to be necessary (to train for or inference under).
* naively, this requires just the text transcription and output audio response, where part of the output audio response is trimmed to serve as the input audio prompt.
* non-naively, this requires two text transcriptions, and two output audio responses (where one of them serve as the input audio prompt).
* `stt`: basic and naive speech-to-text.
* requires an input audio prompt and the output text transcription (as phonemes, unfortunately).
* `ns`: noise suppression.
* requires just a text transcription and an output audio response, where the input audio prompt is just the output + noise
* text transcription can optionally be removed to allow for training without text guidance.
* `sr`: speech removal.
* requires just a text transcription and an output audio response, where the input audio prompt is just the sampled utterance + noise, and the output is just the original noise.
* text transcription can optionally be removed to allow for training without text guidance.
* `tse`: target speech extraction.
* requires a text transcription, an input audio prompt of the sampled speaker, utterance sampled from a different speaker, and the output audio response.
* the input prompt is appended with both the output audio and the utterance sampled from a different speaker overlayed on one another.
* `cse`: clean speech editing.
* an ideal world would have phoneme-level transcriptions, but I do not have very-accurate phoneme-level transcriptions.
* to make up for this, this requires multiple samples for the prefix, the original middle, the edited portion for the middle, and the postfix sample.
* the prefix and postfix *can* be randomly omitted, but keeping them in ensures better editing of speech within the middle.
* requires four full samples.
* `nse`: noisy speech editing.
* the above, but injects some noise throughout the sampled utterances.
A mystical `rvc` for emulating RVC speech-to-speech synthesis is possible, but requires a dataset to do so.
## `__main__`
This script can be called directly to perform dataloader-related tasks.
### `--action=metadata`
Invoking this will take processed samples (`.enc` for EnCodec, `.dac` for Descript-Audio-Codec) from `{YAML_PATH}/data/`, as per the YAML's `cfg.dataset.{training|validation|noise}` lists, and store helpful metadata under `{YAML_PATH}/metadata/`, to speed up dataloader preparations. Since dataloader preparations can cull based on audio durations, being able to look up a sample's duration speeds things up without needing to load the sample and read the file's metadata.
This metadata can be then used to store similar speaker indices.
### `--action=hdf5`
Invoking this will take processed samples (`.enc` for EnCodec, `.dac` for Descript-Audio-Codec) from `{YAML_PATH}/data/`, as per the YAML's `cfg.dataset.{training|validation|noise}` lists, and store them within a single `.h5` HDF5 file.
Additionally, this implicitly invokes `--action=metadata`, to create additional JSON metadata under `{YAML_PATH}/metadata/`, to speed up dataloader preparations.
### `--action=sample`
Invoking this will load the dataloader, sample it, and print out the batch's contents.
This serves primarily for debugging purposes during development, and should not be necessary for the end user.
### `--action=validate`
Invoking this will process the dataset to check for any phonemes missing from the tokenizer (as defined under `cfg.tokenizer`).
Any missing phonemes will be printed through `logger` to make mending the tokenizer dict easier.
This serves primarily for debugging purposes during development, and should not be necessary for the end user. However, additional languages may emit additional IPAs through `phonemizer`, so those training additional languages should take care to validate for missing phonemes before training, to avoid headaches.
## `cfg.dataset`
This entry in the config YAML handles knobs and features related to the dataloader. This is defined as `Dataset` in `./vall_e/config.py`
* `training`: list of entries to populate the training dataset with. Wildcards are accepted, such as `LibriVox/*` to easily load a speaker within a group, without needing to define them individually.
* `validation`: the above, but for the validation dataset.
* `noise`: the above, but for any noise that may be sampled during dataloader sampling. Text is not required for this dataset.
* `speaker_name_getter`: a lambda function to evaluate, to retrieve the speaker name from a given path string.
* `speaker_group_getter`: a lambda function to evaluate, to retrieve the speaker's associated group from a given path string.
* `speaker_languages`: Deprecated. This is a dict that maps language codes to a list of speaker groups, for when the language code was not stored alongside a sample's data.
* `use_hdf5`: use `{YAML_PATH}/{cfg.dataset.hdf5_name}` to sample data from, rather than individual files on disk.
* `hdf5_name`: filename (or path?) to the HDF5 dataset file to load, if the above is requested.
* `hdf5_flag`: flag to open the above HDF5 file under. By default this is `a` to write to, as it's necessary for HDF5 creation, but will automatically set to `r` under distributed settings.
* `use_metadata`: references generated metadata instead of loading samples individually to acquire metadata.
* `validate`: cull samples that do not fall within the requested `cfg.dataset.duration_range`.
* `workers`: number of worker processes to handle dataloading under PyTorch.
* `cache`: use diskcache when requested to not require subsequent processing. This handles *all* `diskcache` requests throughout the program if requested, but should only really be used under this script.
* `min_utterances`: number of utterances to treat a speaker as valid.
* `duration_range`: a list of two values to denote the acceptable duration ranges a sample is valid for the dataloader.
* `sample_type`: type of sampler to use. Currently accepts `path` (an epoch is all paths in the dataset, and each index maps to each sample) or `speaker` (an epoch is all speakers in the dataset, and each index maps to each speaker)
* `sample_order`: order to keep the dataloader sample. Currently accepts `interleaved` (tries to balance per speaker) and `duration` (orders by duration to keep throughput and VRAM usage consistent).
* `sample_shuffle`: shuffles the dataloader sampler.
* `sample_max_duration_batch`: the maximum total duration a batch can be. Values > 0 will enable batch sampling, where the dataloader sampler returns batches of batches.
* This only works under `sample_order=duration` and `sample_type=path`, and should raise an exception for any other configuration.
* `prompt_duration_range`: a list of two values to denote the range a sample's input prompt should be. This keeps the model trained for input prompt durations within these, and a little extra sometimes works without training for it.
* `prompt_max_samples`: maximum number of utterances to sample for an input prompt to combine, if needed to fill the above duration window.
* `prompt_continuous_utterance_p`: probability for a sample's input prompt to instead be the output prompt, and prepare the sample under "continuous" mode.
* `prompt_similar_p`: probability to use a sample's most similar utterance as the input prompt, rather than randomly picking another utterance of the same speaker.
* This requires adequate metadata to be available to store the top-K similar indices.
* `prompt_similar_top_k`: use the top-k candidates for the above sampling.
* `prompt_similar_top_k_offset`: the above, but an offset (as in it will not use the top-K-offset most similar utterances).
* `prompt_inject_noise`: inject some noise in a sample's input prompt. *Will* harm dataloader throughput, as it requires re-encoding the audio.
* `resps_max_samples`: maximum utterances to use for the sample's input text and output response audio.
* `resps_append_p`: probability to append additional utterances to the sample.
* `resps_pad_silence_p`: probability to pad the output response audio with silence. Does *not* require re-encoding, unless requested through `reencode_on_concat`.
* `tasks_list`: list of task names a sample can be.
* Currently supports: `tts`, `stt`, `tts-c`, `ns`, `sr`, `tse`, `nse`, `cse`
* `reencode_on_concat`: if enabled, audio will be decoded to a raw waveform, concatted, then reencoded, instead of naively concatting EnCodec codes.
* This isn't necessary naively concatting offers trivial inaccuracies.
* `reencode_device`: device to load EnCodec within the dataloader.
* *technically* only `cpu` should be supported, as loading models in dataloaders causes problems?
* `noise_scale`: multiplier to the noise when applying noise. Lower numbers keep it quieter.
* `retokenize_text`: if the text/phoneme transcription is available in the metadata, use that to re-tokenize instead of relying on the stored tokens itself.
* This is helpful if you modify the tokenizer dict in post, but do not want to re-process the dataset to modify the tokenized phonemes.
* `_frames_per_second`: overrides the internal tokens-per-second-of-audio ratio. Should never require modifying.

65
docs/emb.md Normal file
View File

@ -0,0 +1,65 @@
# `emb/*`
This folder contains scripts to handle the text and audio data that goes in and out of the model, as well as preparing data for the dataset.
The `emb` name is a relic of the original implementation used.
## `g2p.py`
This script handles taking text of a given language, and phonemizing into IPAs.
* This is mainly an abstraction to `phonemizer`.
For Japanese, text is coerced through `pykakasi` into kana, then phonemized, as `phonemizer` does not like kanji.
By default, `espeak` is used as the backend, but other *backends* can be passed through `encode`.
By default, punctuation, stress markers, and stripping are enabled by default, but *can* be disabled.
To avoid memory leaking through `phonemizer`, backends and instances are cached for further reuse.
## `qnt.py`
This script handles taking audio waveforms and encoding it as code tokens to run through the model, and code tokens outputted from the model and decoding it into raw waveforms.
* This is mainly an abstraction to the underlying quantized audio models.
Additionally, audio manipulation helper functions like `trim` and `concat` are available.
The audio backend is dependent on the model used, but by default `encodec` is the default backend with a sample rate of `24khz`.
* if requested, `vocos` is used as the decoding model, but EnCodec is still used to encode audio.
Audio does *not* need to be resampled and downmixed, as it should already be handled when being fed to the `encode` functions.
### Audio Backends
For audio backends:
* [`encodec`](https://github.com/facebookresearch/encodec): a tried-and-tested EnCodec to encode/decode audio.
* [`vocos`](https://huggingface.co/charactr/vocos-encodec-24khz): a higher quality EnCodec decoder.
- encoding audio will use the `encodec` backend automagically, as there's no EnCodec encoder under `vocos`
* [`descript-audio-codec`](https://github.com/descriptinc/descript-audio-codec): boasts better compression and quality, but has issues with model convergence.
- models at 24KHz + 8kbps will NOT converge in any manner.
- models at 44KHz + 8kbps seems harder to model its "language", and the NAR side of the model suffers greatly.
## `transcribe.py`
This script handles taking raw input audio, and outputting adequate metadata containing transcriptions of said audio through `whisperX`.
The process maintains slices `whisperX` thinks its best per the segments outputted.
Refer to the `__main__`'s arguments for usage details.
## `process.py`
This script handles taking raw input audio and its transcribed metadata, and outputs encoded audio (NumPy) files containing encoded audio and associated metadata.
This process can utilize sliced segments within the transcription metadata, or use the entire file's audio instead for a given utterance.
Refer to the `__main__`'s arguments for usage details.
## `similar.py`
This script handles taking either raw input audio, or processed encoded audio, and determines the top-K similar utterances for each sample for a given speaker (or dataset).
When processing a dataset, this requires already having accompanying metadata generated through `vall_e.data --action=metadata --yaml=./your/training/config.yaml`.
Refer to the `__main__`'s arguments for usage details.

28
docs/engines.md Normal file
View File

@ -0,0 +1,28 @@
# `engines/*`
This folder contains the necessary abstractions for handling training of models through either a local (`base`) backend, or additional wrappers (like DeepSpeed, and in the future Accelerate and Lightning).
This architecture is partially lifted from the original implementation, but expanded for both my needs and modularity for other backends.
An `Engine` is just a wrapper that contains training metadata for the loaded module.
An `Engines` is a dict of `Engine`s, and extra functions to allow iterating through its contents, allowing for simultaneous loading and training of engines for a shared dataloader iteration.
## `__init__.py`
This script handles the bulk of loading a model and wrapping the model with the requested engine type.
The checkpoint or weight path is automatically deduced, as well as pre-processing the state dict (if requested) before loading it.
* resizing modules from the weights to the requested configuration in the YAML is done here.
* replacing modules with optimized versions or LoRAs are applied here.
* the requested optimizer, and params to freeze, for a model is applied here.
## `base.py`
The internal (`local`) implementation of orchestrating training. The basics are handled here, from automatic-mixed-precision, gradient accumulation, loss scaling, etc.
Functions for other backends are also defined here, such as the training step function.
## `deepspeed.py`
A backend relying on `deepspeed` for its orchestration, which offers additional features that can be defined under `cfg.trainer.deepspeed`.

9
docs/export.md Normal file
View File

@ -0,0 +1,9 @@
# `export.py`
To export the models, run: `python -m vall_e.export --yaml=./training/config.yaml`.
This will export the latest checkpoints, for example, under `./training/ckpt/ar+nar-retnet-8/fp32.pth`, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats.
Desite being called `fp32.pth`, you can export it to a different precision type with `--dtype=float16|bfloat16|float32`.
You can also export to `safetensors` with `--format=sft`, and `fp32.sft` will be exported instead.

7
docs/ext.md Normal file
View File

@ -0,0 +1,7 @@
# `ext/*`
This folder handles external model implementations, where the code is not easily offered as a package.
Currently, this just includes code for a RetNet, offered as a TorchScale-compatible implementation, or a HuggingFace-compatible implementation.
Comments and attributions are under its `__init__.py`.

52
docs/inferenece.md Normal file
View File

@ -0,0 +1,52 @@
# `inference.py`
This script handles everything the higher level functions of inferencing the model for various tasks for the end user.
## Synthesis
To synthesize speech: `python -m vall_e <text> <ref_path> <out_path> --yaml=<yaml_path>` (or `--model=<model_path>`)
Some additional flags you can pass are:
* `--language`: specifies the language for phonemizing the text, and helps guide inferencing when the model is trained against that language.
* `--task`: task to perform. Defaults to `tts`, but accepts `stt` for transcriptions.
* `--max-ar-steps`: maximum steps for inferencing through the AR model. Each second is 75 steps.
* `--device`: device to use (default: `cuda`, examples: `cuda:0`, `cuda:1`, `cpu`)
* `--ar-temp`: sampling temperature to use for the AR pass. During experimentation, `0.95` provides the most consistent output, but values close to it works fine.
* `--nar-temp`: sampling temperature to use for the NAR pass. During experimentation, the lower value, the better. Set to `0` to enable greedy sampling.
* `--input-prompt-length`: the maximum duration the input prompt can be (~6 seconds is fine, longer durations lead to slower generations for "better" accuracy, as long as the model was trained against such input prompt durations)
And some experimental sampling flags you can use too (your mileage will ***definitely*** vary, but most of these are bandaids for a bad AR):
* `--input-prompt-prefix`: (AR only) treats the input prompt as the initial response prefix, but...
* the transcription of the prompt needs to be in the input text prompt.
* doesn't perform all that well (I belive the model needs to be trained a bit on this, as `tts-c`).
* `--min-ar-temp`: triggers the dynamic temperature pathway, adjusting the temperature based on the confidence of the best token. Acceptable values are between `[0.0, (n)ar-temp)`.
+ This simply uplifts the [original implementation](https://github.com/kalomaze/koboldcpp/blob/dynamic-temp/llama.cpp#L5132) to perform it.
+ **!**NOTE**!**: This does not seem to resolve any issues with setting too high/low of a temperature. The right values are yet to be found.
* `--top-p`: limits the sampling pool to top sum of values that equal `P`% probability in the probability distribution.
* `--top-k`: limits the sampling pool to the top `K` values in the probability distribution.
* `--repetition-penalty`: modifies the probability of tokens if they have appeared before. In the context of audio generation, this is a very iffy parameter to use.
* `--repetition-penalty-decay`: modifies the above factor applied to scale based on how far away it is in the past sequence.
* `--length-penalty`: (AR only) modifies the probability of the stop token based on the current sequence length. This is ***very*** finnicky due to the AR already being well correlated with the length.
* `--beam-width`: (AR only) specifies the number of branches to search through for beam sampling.
+ This is a very naive implementation that's effectively just greedy sampling across `B` spaces.
* `--mirostat-tau`: (AR only) the "surprise value" when performing mirostat sampling.
+ This simply uplifts the [original implementation](https://github.com/basusourya/mirostat/blob/master/mirostat.py) to perform it.
+ **!**NOTE**!**: This is incompatible with beam search sampling (for the meantime at least).
* `--mirostat-eta`: (AR only) the "learning rate" during mirostat sampling applied to the maximum surprise.
* `--dry-multiplier`: (AR only) performs DRY sampling, the scalar factor.
* `--dry-base`: (AR only) for DRY sampling, the base of the exponent factor.
* `--dry-allowed-length`: (AR only) for DRY sampling, the window to perform DRY sampling within.
* `--layer-skip` enables early-exit layer skipping if the model is confident enough (for compatible models)
* `--layer-skip-exit-layer`: maximum layer to use
* `--layer-skip-entropy-threshold`: the maximum the logits' entropy (confidence) needs to be before exiting early
* `--layer-skip-varentropy-threshold`: the maximum the logits' varentropy (confidence spread) needs to be before exiting early
* `--refine-on-stop`: (AR only) uses the last steps' logits for the entire final output sequence, rather than the step-by-step iterative sequence.
+ This needs experimenting with to see if there's any downside.
+ to-do: compare the probability scores with the original output sequence, and pick the best one.
### Speech-to-Text
The `ar+nar-tts+stt-llama-8` model has received additional training for a speech-to-text task against EnCodec-encoded audio.
Currently, the model only transcribes back into the IPA phonemes it was trained against, as an additional model or external program is required to translate the IPA phonemes back into text.
* this does make a model that can phonemize text, and unphonemize text, more desirable in the future to replace espeak (having an additional task to handle this requires additional embeddings, output heads, and possible harm to the model as actual text is not a modality the model is trained on).

177
docs/models.md Normal file
View File

@ -0,0 +1,177 @@
# Model Notes
To be filled.
## Emergent Behavior
The model can be prompted in creative ways to yield some interesting behaviors:
* prompting without an input audio prompt will have the model generate a random voice at the "cost" of some unintelligible utterance at the beginning of the output response (despite doing no promptless training).
* finetunes / LoRAs can benefit from this by having input audio promptless synthesis, while opting to have an input audio prompt for guidance.
* prompting with an input text prompt being the transcription of the input audio prompt will have the response follow very closely to the input prompt (despite not doing input=output training).
* this should allow for easy transcription editing without much fuss.
# `models/*`
This folder contains scripts relating to models and code for VALL-E use, from the wrapping model to the underlying arch.
## `models/lora.py`
This script implements Low-Ranking Adapters, to allow for cheaper and easier finetuning of existing modules.
At the moment, two approaches are offered, through replacing `nn.Linear` outright, or parameterizing a `nn.Liner`. The latter is used by default(?).
## `models/base.py`
This script implements the core underlying model for VALL-E. This handle:
* storing its settings and features, and initializing the right modules
* processing inputs into a proper input string
* orchestrates running text and audio through the respective embeddings
* generating the right padding, masking, and position IDs to feed the underlying arch (if requested)
* removes padding from the logits
* handles performing loss calculation, both as a whole or in individual pieces, both autoregressively and non-autoregressively
* handles sampling through the logits through samplers provided through `./vall_e/samplers.py`, both autoregressively and non-autoregressively.
This script aims to implement everything as required per VALL-E agnostically, to allow for different implementations to contain little extra code.
## `models/ar_nar.py`
This script implements VALL-E as a unified autoregressive and non-autoregressive model, where RVQ-level 0 is inferenced autoregressively, the remaining levels are infereneced non-autoregressively.
By default, this is the default model, but is used through `cfg.model.capabilities = ["ar", "nar"]`.
For training, this model handles preparing the batch provided through the dataloader according to a randomly sampled targetted RVQ-level.
For inferencing, this will dynamically inference depending on the arguments provided.
## `models/ar.py`
This script implements VALL-E as a pure autoregressive (AR) model.
If `cfg.model.experimental.interleave=True`, this makes use of interleaving its audio codes, instead of inferencing per-codebook level. If not, this simply attends to RVQ level 0.
This model serves as an experiment that failed, and might be revisited in the future.
Use of this is governed through `cfg.model.capabilities = ["ar"]`
## `models/nar.py`
This script implements VALL-E as a mostly-pure non-autoregresive model, where it infers the duration autoregressively (if `"len" in cfg.model.capabilities`). If not, this simply attends to RVQ levels 1+.
This makes use of training an additional `len` task that can infer the duration of a requested input, as well as (maybe) using special tokens as the initial input for RVQ-level 0 (the level the AR attends to).
This model serves as an experiment that failed, and might be revisited in the future.
Use of this is governed through `cfg.model.capabilities = ["nar"]`
## `models/experimental.py`
This script implements VALL-E as a mostly-HuggingFace compatible model, where it handles processing tokens as a uniform sequence of IDs.
This mostly serves as an experiment to see what is required to do so, for possible future implementations requiring just `llama.cpp` and `encodec.cpp`, and to provide a pure HF-compatible implementation.
Use of this is governed through `cfg.model.experimental.hf = True`
## `models/arch/*`
This folder contains scripts, I've either written myself or properly attributed to, that provide or modify existing modules of a given model.
As the core of VALL-E makes use of a language model, various LLM architectures can be supported and slotted in. Currently supported LLM architectures:
* `llama`: using HF transformer's LLaMa implementation for its attention-based transformer, boasting RoPE and other improvements.
+ I aim to utilize this for the foundational model, as I get to leverage a bunch of things tailored for LLaMA (and converting to them is rather easy).
* `mixtral`: using HF transformer's Mixtral implementation for its attention-based transformer, also utilizing its MoE implementation.
* `bitnet`: using [this](https://github.com/kyegomez/BitNet/) implementation of BitNet's transformer.
- Setting `cfg.optimizers.bitnet=True` will make use of BitNet's linear implementation.
* `transformer`: a basic attention-based transformer implementation, with attention heads + feed forwards.
* `retnet`: using [TorchScale's RetNet](https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/retnet.py) implementation, a retention-based approach can be used instead.
- Its implementation for MoE can also be utilized.
* `retnet-hf`: using [syncdoth/RetNet](https://github.com/syncdoth/RetNet) with a HuggingFace-compatible RetNet model
- has an inference penality, and MoE is not implemented.
* `mamba`: using [state-spaces/mamba](https://github.com/state-spaces/mamba) (needs to mature)
- ***really hard*** to have a unified AR and NAR model
- inference penalty makes it a really hard sell, despite the loss already being a low 3 after a short amount of samples processed
The wide support for various backends is solely while I try and figure out which is the "best" for a core foundation model.
### `models/arch/bitnet.py`
This script modifies modules of BitNet to play nicely with my existing code.
### `models/arch/llama.py`
This script modifes modules of LLaMA provided through `transformers`.
A bulk of it pertains to modifying `LlamaAttention` and detecting available attention mechanisms, allowing for using different attention mechanisms:
* `torch.nn.functional.scaled_dot_product_attention`-based attention:
* `math`: torch's SDPA's `math` kernel
* `mem_efficient`: torch's SDPA's memory efficient (`xformers` adjacent) kernel
* `cudnn`: torch's SDPA's `cudnn` kernel
* `flash`: torch's SDPA's flash attention kernel
* internal implementations of external attention backends:
* `xformers`: [facebookresearch/xformers](https://github.com/facebookresearch/xformers/)'s memory efficient attention
* `flash_attn`: uses the available `flash_attn` package (including `flash_attn==1.0.9` through a funny wrapper)
* `flash_attn_v100`: uses [ZRayZzz/flash-attention-v100](https://github.com/ZRayZzz/flash-attention-v100/)'s Flash Attention for Volta (but doesn't work currently)
* `fused_attn`: uses an implementation using `triton` (tested on my 7900XTX and V100s), but seems to introduce errors when used to train after a while
* `default`: uses the naive path for hte internal implementation (used for attention-debugging purposed)
* `transformers` Llama\*Attention implementations:
* `eager`: default `LlamaAttention`
* `sdpa`: integrated `LlamaSdpaAttention` attention model
* `flash_attention_2`: integrated `LlamaFlashAttetion2` attention model
* `auto`: determine the best fit from the above
Modifications to `LlamaModel` is also provided to implement LayerSkip-aware training and a very naive self-speculative decoding.
#### ROCm Flash Attention
[ROCm/flash-attention](https://github.com/ROCm/flash-attention) currently does not support Navi3 cards (gfx11xx), so first-class support for Flash Attention is a bit of a mess on Navi3. Using the `howiejay/navi_support` branch can get inference support, but not training support (due to some error being thrown during the backwards pass) by:
* edit `/opt/rocm/include/hip/amd_detail/amd_hip_bf16.h`:
```
#if defined(__HIPCC_RTC__)
#define __HOST_DEVICE__ __device__ static
#else
#include <climits>
#define __HOST_DEVICE__ __host__ __device__ static inline
#endif
```
* install with `pip install -U git+https://github.com/ROCm/flash-attention@howiejay/navi_support --no-build-isolation`
### `models/arch/mamba.py`
This script modifies modules of Mamba, to allow it to play nicely with my existing code.
If I rememer right, it just simply provides gradient checkpointing.
### `models/arch/mixtral.py`
Like `llama.py`, this provides modifications to Mixtral through `transformers`.
Primarily, this is to address a bug with batch sizes > 1, and to use a different attention mechanism.
* to-do: this is out of date from `llama.py`'s modified attention class.
### `models/arch/retnet.py`
This provides modification to RetNet, mostly to allow for gradient checkpointing.
### `models/arch/transformer.py`
This provides the original implementation's implementation of a transformer.
### `models/arch/attention/*`
This folder contains specific attention mechanisms.
Currently, only `fused.py` is provided, which implements fused attention through Triton.
Attributions are noted at the top of the respective file(s).
### `models/arch/mamba_vasqu`
This folder contains an implementation of Mamba2 as a HuggingFace-compatible model, and not requiring Triton.
Attributions are noted at the top of the respective file(s).
### `models/arch/retnet_syncdoth`
This folder contains scripts to modify modules within a RetNet model.
Attributions are noted at the top of the respective file(s).

5
docs/plot.md Normal file
View File

@ -0,0 +1,5 @@
# `plot.py`
Included is a helper script to parse the training metrics. Simply invoke it with, for example: `python3 -m vall_e.plot --yaml="./training/config.yaml"`
You can specify what X and Y labels you want to plot against by passing `--xs tokens_processed --ys loss.nll stats.acc`

65
docs/train.md Normal file
View File

@ -0,0 +1,65 @@
# Training Notes
Training is very dependent on:
* the quality of your dataset.
* clean utterances and accurate transcriptions go a long way.
* a diverse dataset in prosidy and speakers help a ton.
* how much data you have.
* training from scratch requires upwards of 15K hours.
* training new languages from the base model simply requires maybe ~2K hours each.
* the bandwidth you quantized your audio to, as this affects the how many tokens are processed per step.
* the underlying model architecture used.
* some models behave better than others for a unified approach, others do not.
For single GPUs, simply running `python3 -m vall_e.train --yaml="./training/config.yaml`.
For multiple GPUs, or exotic distributed training:
* with `deepspeed` backends, simply running `deepspeed --module vall_e.train --yaml="./training/config.yaml"` should handle the gory details.
* with `local` backends, simply run `torchrun --nnodes=1 --nproc-per-node={NUMOFGPUS} -m vall_e.train --yaml="./training/config.yaml"`
You can enter `save` to save the state at any time, or `quit` to save and quit training.
The `lr` command will also let you adjust the learning rate on the fly. For example: `lr 1.0e-3` will set the learning rate to `0.001`.
Some additional flags can be passed as well:
* `--eval`: only run the evaluation / validation pass, then exit afterwards.
* `--eval-random-text-prompts`: use random text prompts for the evaluation pass, rather than the provided text prompts in the dataset.
## Try Me
To quickly test if a configuration works, you can run `python -m vall_e.models.ar_nar --yaml="./data/config.yaml"`; a small trainer will overfit a provided utterance.
## Finetuning
Finetuning can be done by training the full model, or using a LoRA.
Finetuning the full model is done the same way as training a model, but be sure to have the weights in the correct spot, as if you're loading them for inferencing.
For training a LoRA, add the following block to your `config.yaml`:
```
loras:
- name : "arbitrary name" # whatever you want
rank: 128 # dimensionality of the LoRA
alpha: 128 # scaling factor of the LoRA
training: True
```
And that's it. Training of the LoRA is done with the same command. Depending on the rank and alpha specified, the loss may be higher than it should, as the LoRA weights are initialized to appropriately random values. I found `rank` and `alpha` of 128 works fine.
To export your LoRA weights, run `python3 -m vall_e.export --lora --yaml="./training/config.yaml"`. You *should* be able to have the LoRA weights loaded from a training checkpoint automagically for inferencing, but export them just to be safe.
## Training Under Windows
As training under `deepspeed` and Windows is not (easily) supported, under your `config.yaml`, simply change `trainer.backend` to `local` to use the local training backend.
Creature comforts like `float16`, `amp`, and multi-GPU training *should* work under the `local` backend, but extensive testing still needs to be done to ensure it all functions.
# `train.py`
This script handles the VALL-E specific training code.
For the most part, this handles:
* feeding the model a batch from the dataloader
* performing evaluation / validation when requested
* unloading the `emb.qnt` model when its not needed anymore

60
docs/utils.md Normal file
View File

@ -0,0 +1,60 @@
# `utils/*`
This folder contains helper utilities for either training or general functions of the program.
These scripts are to remain agnostic to any model, to allow for reuse for other applications.
## `utils/distributed.py`
This script contains the necessary code needed to utilize distributed training.
Attributions are noted at the top.
## `utils/io.py`
This script contains the necessary code for loading and storing state dicts, through pickles (`.pt`) or SafeTensors (`.sft`), and offers parity for each storage type.
Additionally, some JSON helper functions are provided here.
## `utils/pattern.py`
This script contains (unused) code related to formatting sequences of audio codes into different pattern types.
Attributions are noted at the top.
## `utils/sampler.py`
This script contains code to handle sampling from a list of indices.
* `PoolSampler` has a master list of indices "in the marble bag" that are sampled without replacement.
* `OrderedSampler` will output indices from 0 to `length`, in order.
* `BatchedOrderedSampler` does the above, but will output lists of indices instead.
* `RandomSampler` will output indices from 0 to `length`, randomly.
Each sampler can load and store a state dict.
## `utils/unsloth.py`
This script contains Unsloth, a VRAM-saving optimization that offloads the input tensors to CPU on a backwards pass.
This is mostly unncessary, as inputs are rather small themselves, but is offered nonetheless if needed through `cfg.optimizations.unsloth = True`
Attributions are noted at the top.
## `utils/utils.py`
This script contains additional helper functions that do not require a dedicated file.
## `utils/train.py`
This script handles the necessary code for training, such as:
* iterating through a dataloader
* iterating through an `Engines` to train each underlying `Engine`
* printing training metrics
* invoking `save`, `eval`, `export` every X iterations
* handling stdin commands, such as `save`, `export`, `eval`, and `quit`
## `utils/wrapper.py`
This script contains optimizations and additional code that require injecting or replacing modules.
Most configurations are offered through `cfg.optimization`.

33
docs/webui.md Normal file
View File

@ -0,0 +1,33 @@
# `webui.py`
A Gradio-based web UI is accessible by running `python3 -m vall_e.webui`. You can, optionally, pass:
* `--yaml=./path/to/your/config.yaml`: will load the targeted YAML
* `--model=./path/to/your/model.sft`: will load the targeted model weights
* `--listen 0.0.0.0:7860`: will set the web UI to listen to all IPs at port 7860. Replace the IP and Port to your preference.
## Inference
Synthesizing speech is simple:
* `Input Prompt`: The guiding text prompt. Each new line will be its own generated audio to be stitched together at the end.
* `Audio Input`: The reference audio for the synthesis. Under Gradio, you can trim your clip accordingly, but leaving it as-is works fine.
- A properly trained model can inference without a prompt to generate a random voice (without even needing to generate a random prompt itself).
* `Output`: The resultant audio.
* `Inference`: Button to start generating the audio.
* `Basic Settings`: Basic sampler settings for most uses.
* `Sampler Settings`: Advanced sampler settings that are common for most text LLMs, but needs experimentation.
All the additional knobs have a description that can be correlated to the inferencing CLI flags.
Speech-To-Text phoneme transcriptions for models that support it can be done using the `Speech-to-Text` tab.
## Dataset
This tab currently only features exploring a dataset already prepared and referenced in your `config.yaml`. You can select a registered voice, and have it randomly sample an utterance.
In the future, this *should* contain the necessary niceties to process raw audio into a dataset to train/finetune through, without needing to invoke the above commands to prepare the dataset.
## Settings
So far, this only allows you to load a different model without needing to restart. The previous model should seamlessly unload, and the new one will load in place.