documentation update

This commit is contained in:
mrq 2024-08-04 00:14:49 -05:00
parent 2cb465018b
commit d19f93a2c0
4 changed files with 25 additions and 9 deletions

View File

@ -27,7 +27,7 @@ I've tested this repo under Python versions `3.10.9`, `3.11.3`, and `3.12.3`.
My pre-trained weights can be acquired from [here](https://huggingface.co/ecker/vall-e). My pre-trained weights can be acquired from [here](https://huggingface.co/ecker/vall-e).
A script to setup a proper environment and download the weights can be invoked with `./scripts/setup.sh`. This will automatically create a `venv`, and download the weights and config file to the right place. A script to setup a proper environment and download the weights can be invoked with `./scripts/setup.sh`. This will automatically create a `venv`, and download the `ar+nar-llama-8` weights and config file to the right place.
## Train ## Train
@ -155,10 +155,10 @@ For audio backends:
* `math`: torch's SDPA's `math` implementation * `math`: torch's SDPA's `math` implementation
* `mem_efficient`: torch's SDPA's memory efficient (`xformers` adjacent) implementation * `mem_efficient`: torch's SDPA's memory efficient (`xformers` adjacent) implementation
* `flash`: torch's SDPA's flash attention implementation * `flash`: torch's SDPA's flash attention implementation
* `xformers`: [facebookresearch/xformers](https://github.com/facebookresearch/xformers/)'s memory efficient attention * `xformers`: ~~[facebookresearch/xformers](https://github.com/facebookresearch/xformers/)'s memory efficient attention~~ Aliased to `mem_efficient`
* `auto`: determine the best fit from the above
* `sdpa`: integrated `LlamaSdpaAttention` attention model * `sdpa`: integrated `LlamaSdpaAttention` attention model
* `flash_attention_2`: integrated `LlamaFlashAttetion2` attention model * `flash_attention_2`: integrated `LlamaFlashAttetion2` attention model
* `auto`: determine the best fit from the above
The wide support for various backends is solely while I try and figure out which is the "best" for a core foundation model. The wide support for various backends is solely while I try and figure out which is the "best" for a core foundation model.
@ -168,6 +168,10 @@ To export the models, run: `python -m vall_e.export --yaml=./training/config.yam
This will export the latest checkpoints, for example, under `./training/ckpt/ar+nar-retnet-8/fp32.pth`, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats. This will export the latest checkpoints, for example, under `./training/ckpt/ar+nar-retnet-8/fp32.pth`, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats.
Desite being called `fp32.pth`, you can export it to a different precision type with `--dtype=float16|bfloat16|float32`.
You can also export to `safetensors` with `--format=sft`, and `fp32.sft` will be exported instead.
## Synthesis ## Synthesis
To synthesize speech: `python -m vall_e <text> <ref_path> <out_path> --yaml=<yaml_path>` To synthesize speech: `python -m vall_e <text> <ref_path> <out_path> --yaml=<yaml_path>`
@ -194,6 +198,9 @@ And some experimental sampling flags you can use too (your mileage will ***defin
+ This simply uplifts the [original implementation](https://github.com/basusourya/mirostat/blob/master/mirostat.py) to perform it. + This simply uplifts the [original implementation](https://github.com/basusourya/mirostat/blob/master/mirostat.py) to perform it.
+ **!**NOTE**!**: This is incompatible with beam search sampling (for the meantime at least). + **!**NOTE**!**: This is incompatible with beam search sampling (for the meantime at least).
* `--mirostat-eta`: (AR only) the "learning rate" during mirostat sampling applied to the maximum surprise. * `--mirostat-eta`: (AR only) the "learning rate" during mirostat sampling applied to the maximum surprise.
* `--dry-multiplier`: (AR only) performs DRY sampling, the scalar factor.
* `--dry-base`: (AR only) for DRY sampling, the base of the exponent factor.
* `--dry-allowed-length`: (AR only) for DRY sampling, the window to perform DRY sampling within.
### Web UI ### Web UI
@ -208,6 +215,7 @@ Synthesizing speech is simple:
* `Input Prompt`: The guiding text prompt. Each new line will be it's own generated audio to be stitched together at the end. * `Input Prompt`: The guiding text prompt. Each new line will be it's own generated audio to be stitched together at the end.
* `Audio Input`: The reference audio for the synthesis. Under Gradio, you can trim your clip accordingly, but leaving it as-is works fine. * `Audio Input`: The reference audio for the synthesis. Under Gradio, you can trim your clip accordingly, but leaving it as-is works fine.
- A properly trained model can inference without a prompt to generate a random voice (without even needing to generate a random prompt itself).
* `Output`: The resultant audio. * `Output`: The resultant audio.
* `Inference`: Button to start generating the audio. * `Inference`: Button to start generating the audio.
@ -231,12 +239,15 @@ So far, this only allows you to load a different model without needing to restar
- the NAR benefits from greedy sampling, and anything else just harms output quality. - the NAR benefits from greedy sampling, and anything else just harms output quality.
* [ ] clean up the README, and document, document, document onto the wiki. * [ ] clean up the README, and document, document, document onto the wiki.
* [ ] extend to ~~multiple languages ([VALL-E X](https://arxiv.org/abs/2303.03926)) and~~ addditional tasks ([SpeechX](https://arxiv.org/abs/2308.06873)). * [ ] extend to ~~multiple languages ([VALL-E X](https://arxiv.org/abs/2303.03926)) and~~ addditional tasks ([SpeechX](https://arxiv.org/abs/2308.06873)).
- this requires a good foundational model before extending it to transfer tasks onto. - this requires a good foundational model before extending it to transfer tasks onto, and a large corpus of the other language (I imagine it gets easier the more languages it's trained against).
* [ ] extend using [VALL-E 2](https://arxiv.org/pdf/2406.05370)'s features (grouped code modeling + repetition aware sampling) * [ ] extend using [VALL-E 2](https://arxiv.org/pdf/2406.05370)'s features (grouped code modeling + repetition aware sampling)
- desu these don't seem to be worthwhile improvements, as inferencing is already rather fast, and RAS is just a fancy sampler. - desu these don't seem to be worthwhile improvements, as inferencing is already rather fast, and RAS is just a fancy sampler.
* [ ] audio streaming * [ ] audio streaming
- this *technically* can work without any additional architecture changes, just clever tricks with sampling-then-decoding-to-audio. - this *technically* can work without any additional architecture changes, just clever tricks with sampling-then-decoding-to-audio.
- something similar to HiFiGAN (or the one for TorToiSe) trained on the last hidden states of the AR *might* also enable an alternate way for streaming. - something similar to HiFiGAN (or the one for TorToiSe) trained on the last hidden states of the AR *might* also enable an alternate way for streaming.
* [ ] replace the phonemizer with something that doesn't depend on espeak
- espeak is nice, but I can only really put my whole trust with phonemizing English.
- a small model trained to handle converting text to phonemes might work, but has it's own problems (another model to carry around, as accurate as the dataset it was trained against, requires training for each language... etc).
## Notices and Citations ## Notices and Citations

View File

@ -2,9 +2,9 @@
python3 -m venv venv python3 -m venv venv
source ./venv/bin/activate source ./venv/bin/activate
#pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 # or cu118 / cu124 pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 # or cu118 / cu124
#pip3 install -e . pip3 install -e .
mkdir -p ./training/valle/ckpt/ar+nar-llama-8/ mkdir -p ./training/valle/ckpt/ar+nar-llama-8/
wget -P ./training/valle/ckpt/ar+nar-llama-8/ "https://huggingface.co/ecker/vall-e/resolve/main/models/ckpt/ar%2Bnar-llama-8/fp32.pth" wget -P ./training/valle/ckpt/ar+nar-llama-8/ "https://huggingface.co/ecker/vall-e/resolve/main/models/ckpt/ar%2Bnar-llama-8/fp32.sft"
wget -P ./training/valle/ "https://huggingface.co/ecker/vall-e/resolve/main/models/config.llama.yaml" wget -P ./training/valle/ "https://huggingface.co/ecker/vall-e/resolve/main/models/config.llama.yaml"

View File

@ -191,6 +191,8 @@ def load_engines(training=True):
("classifiers.proj.0.bias" if model.config.experimental.split_classifiers else 'classifier.bias', model.config.audio_tokens + uses_stop_token ), ("classifiers.proj.0.bias" if model.config.experimental.split_classifiers else 'classifier.bias', model.config.audio_tokens + uses_stop_token ),
] ]
for k, tokens in keys: for k, tokens in keys:
if k not in state:
continue
state[k] = ml.resize_weight( state[k], tokens ) state[k] = ml.resize_weight( state[k], tokens )
model.load_state_dict(state, strict=cfg.trainer.strict_loading) model.load_state_dict(state, strict=cfg.trainer.strict_loading)

View File

@ -70,10 +70,13 @@ def get_model_paths( paths=[Path("./training/"), Path("./models/")] ):
return yamls return yamls
# #@gradio_wrapper(inputs=layout["settings"]["inputs"].keys())
def load_model( yaml ): def load_model( yaml ):
gr.Info(f"Loading: {yaml}") gr.Info(f"Loading: {yaml}")
init_tts( yaml=Path(yaml), restart=True ) try:
init_tts( yaml=Path(yaml), restart=True )
except Exception as e:
raise gr.Error(e)
gr.Info(f"Loaded model") gr.Info(f"Loaded model")
def init_tts(yaml=None, restart=False): def init_tts(yaml=None, restart=False):