documentation update
This commit is contained in:
parent
2cb465018b
commit
d19f93a2c0
19
README.md
19
README.md
|
@ -27,7 +27,7 @@ I've tested this repo under Python versions `3.10.9`, `3.11.3`, and `3.12.3`.
|
||||||
|
|
||||||
My pre-trained weights can be acquired from [here](https://huggingface.co/ecker/vall-e).
|
My pre-trained weights can be acquired from [here](https://huggingface.co/ecker/vall-e).
|
||||||
|
|
||||||
A script to setup a proper environment and download the weights can be invoked with `./scripts/setup.sh`. This will automatically create a `venv`, and download the weights and config file to the right place.
|
A script to setup a proper environment and download the weights can be invoked with `./scripts/setup.sh`. This will automatically create a `venv`, and download the `ar+nar-llama-8` weights and config file to the right place.
|
||||||
|
|
||||||
## Train
|
## Train
|
||||||
|
|
||||||
|
@ -155,10 +155,10 @@ For audio backends:
|
||||||
* `math`: torch's SDPA's `math` implementation
|
* `math`: torch's SDPA's `math` implementation
|
||||||
* `mem_efficient`: torch's SDPA's memory efficient (`xformers` adjacent) implementation
|
* `mem_efficient`: torch's SDPA's memory efficient (`xformers` adjacent) implementation
|
||||||
* `flash`: torch's SDPA's flash attention implementation
|
* `flash`: torch's SDPA's flash attention implementation
|
||||||
* `xformers`: [facebookresearch/xformers](https://github.com/facebookresearch/xformers/)'s memory efficient attention
|
* `xformers`: ~~[facebookresearch/xformers](https://github.com/facebookresearch/xformers/)'s memory efficient attention~~ Aliased to `mem_efficient`
|
||||||
* `auto`: determine the best fit from the above
|
|
||||||
* `sdpa`: integrated `LlamaSdpaAttention` attention model
|
* `sdpa`: integrated `LlamaSdpaAttention` attention model
|
||||||
* `flash_attention_2`: integrated `LlamaFlashAttetion2` attention model
|
* `flash_attention_2`: integrated `LlamaFlashAttetion2` attention model
|
||||||
|
* `auto`: determine the best fit from the above
|
||||||
|
|
||||||
The wide support for various backends is solely while I try and figure out which is the "best" for a core foundation model.
|
The wide support for various backends is solely while I try and figure out which is the "best" for a core foundation model.
|
||||||
|
|
||||||
|
@ -168,6 +168,10 @@ To export the models, run: `python -m vall_e.export --yaml=./training/config.yam
|
||||||
|
|
||||||
This will export the latest checkpoints, for example, under `./training/ckpt/ar+nar-retnet-8/fp32.pth`, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats.
|
This will export the latest checkpoints, for example, under `./training/ckpt/ar+nar-retnet-8/fp32.pth`, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats.
|
||||||
|
|
||||||
|
Desite being called `fp32.pth`, you can export it to a different precision type with `--dtype=float16|bfloat16|float32`.
|
||||||
|
|
||||||
|
You can also export to `safetensors` with `--format=sft`, and `fp32.sft` will be exported instead.
|
||||||
|
|
||||||
## Synthesis
|
## Synthesis
|
||||||
|
|
||||||
To synthesize speech: `python -m vall_e <text> <ref_path> <out_path> --yaml=<yaml_path>`
|
To synthesize speech: `python -m vall_e <text> <ref_path> <out_path> --yaml=<yaml_path>`
|
||||||
|
@ -194,6 +198,9 @@ And some experimental sampling flags you can use too (your mileage will ***defin
|
||||||
+ This simply uplifts the [original implementation](https://github.com/basusourya/mirostat/blob/master/mirostat.py) to perform it.
|
+ This simply uplifts the [original implementation](https://github.com/basusourya/mirostat/blob/master/mirostat.py) to perform it.
|
||||||
+ **!**NOTE**!**: This is incompatible with beam search sampling (for the meantime at least).
|
+ **!**NOTE**!**: This is incompatible with beam search sampling (for the meantime at least).
|
||||||
* `--mirostat-eta`: (AR only) the "learning rate" during mirostat sampling applied to the maximum surprise.
|
* `--mirostat-eta`: (AR only) the "learning rate" during mirostat sampling applied to the maximum surprise.
|
||||||
|
* `--dry-multiplier`: (AR only) performs DRY sampling, the scalar factor.
|
||||||
|
* `--dry-base`: (AR only) for DRY sampling, the base of the exponent factor.
|
||||||
|
* `--dry-allowed-length`: (AR only) for DRY sampling, the window to perform DRY sampling within.
|
||||||
|
|
||||||
### Web UI
|
### Web UI
|
||||||
|
|
||||||
|
@ -208,6 +215,7 @@ Synthesizing speech is simple:
|
||||||
|
|
||||||
* `Input Prompt`: The guiding text prompt. Each new line will be it's own generated audio to be stitched together at the end.
|
* `Input Prompt`: The guiding text prompt. Each new line will be it's own generated audio to be stitched together at the end.
|
||||||
* `Audio Input`: The reference audio for the synthesis. Under Gradio, you can trim your clip accordingly, but leaving it as-is works fine.
|
* `Audio Input`: The reference audio for the synthesis. Under Gradio, you can trim your clip accordingly, but leaving it as-is works fine.
|
||||||
|
- A properly trained model can inference without a prompt to generate a random voice (without even needing to generate a random prompt itself).
|
||||||
* `Output`: The resultant audio.
|
* `Output`: The resultant audio.
|
||||||
* `Inference`: Button to start generating the audio.
|
* `Inference`: Button to start generating the audio.
|
||||||
|
|
||||||
|
@ -231,12 +239,15 @@ So far, this only allows you to load a different model without needing to restar
|
||||||
- the NAR benefits from greedy sampling, and anything else just harms output quality.
|
- the NAR benefits from greedy sampling, and anything else just harms output quality.
|
||||||
* [ ] clean up the README, and document, document, document onto the wiki.
|
* [ ] clean up the README, and document, document, document onto the wiki.
|
||||||
* [ ] extend to ~~multiple languages ([VALL-E X](https://arxiv.org/abs/2303.03926)) and~~ addditional tasks ([SpeechX](https://arxiv.org/abs/2308.06873)).
|
* [ ] extend to ~~multiple languages ([VALL-E X](https://arxiv.org/abs/2303.03926)) and~~ addditional tasks ([SpeechX](https://arxiv.org/abs/2308.06873)).
|
||||||
- this requires a good foundational model before extending it to transfer tasks onto.
|
- this requires a good foundational model before extending it to transfer tasks onto, and a large corpus of the other language (I imagine it gets easier the more languages it's trained against).
|
||||||
* [ ] extend using [VALL-E 2](https://arxiv.org/pdf/2406.05370)'s features (grouped code modeling + repetition aware sampling)
|
* [ ] extend using [VALL-E 2](https://arxiv.org/pdf/2406.05370)'s features (grouped code modeling + repetition aware sampling)
|
||||||
- desu these don't seem to be worthwhile improvements, as inferencing is already rather fast, and RAS is just a fancy sampler.
|
- desu these don't seem to be worthwhile improvements, as inferencing is already rather fast, and RAS is just a fancy sampler.
|
||||||
* [ ] audio streaming
|
* [ ] audio streaming
|
||||||
- this *technically* can work without any additional architecture changes, just clever tricks with sampling-then-decoding-to-audio.
|
- this *technically* can work without any additional architecture changes, just clever tricks with sampling-then-decoding-to-audio.
|
||||||
- something similar to HiFiGAN (or the one for TorToiSe) trained on the last hidden states of the AR *might* also enable an alternate way for streaming.
|
- something similar to HiFiGAN (or the one for TorToiSe) trained on the last hidden states of the AR *might* also enable an alternate way for streaming.
|
||||||
|
* [ ] replace the phonemizer with something that doesn't depend on espeak
|
||||||
|
- espeak is nice, but I can only really put my whole trust with phonemizing English.
|
||||||
|
- a small model trained to handle converting text to phonemes might work, but has it's own problems (another model to carry around, as accurate as the dataset it was trained against, requires training for each language... etc).
|
||||||
|
|
||||||
## Notices and Citations
|
## Notices and Citations
|
||||||
|
|
||||||
|
|
|
@ -2,9 +2,9 @@
|
||||||
|
|
||||||
python3 -m venv venv
|
python3 -m venv venv
|
||||||
source ./venv/bin/activate
|
source ./venv/bin/activate
|
||||||
#pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 # or cu118 / cu124
|
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 # or cu118 / cu124
|
||||||
#pip3 install -e .
|
pip3 install -e .
|
||||||
|
|
||||||
mkdir -p ./training/valle/ckpt/ar+nar-llama-8/
|
mkdir -p ./training/valle/ckpt/ar+nar-llama-8/
|
||||||
wget -P ./training/valle/ckpt/ar+nar-llama-8/ "https://huggingface.co/ecker/vall-e/resolve/main/models/ckpt/ar%2Bnar-llama-8/fp32.pth"
|
wget -P ./training/valle/ckpt/ar+nar-llama-8/ "https://huggingface.co/ecker/vall-e/resolve/main/models/ckpt/ar%2Bnar-llama-8/fp32.sft"
|
||||||
wget -P ./training/valle/ "https://huggingface.co/ecker/vall-e/resolve/main/models/config.llama.yaml"
|
wget -P ./training/valle/ "https://huggingface.co/ecker/vall-e/resolve/main/models/config.llama.yaml"
|
||||||
|
|
|
@ -191,6 +191,8 @@ def load_engines(training=True):
|
||||||
("classifiers.proj.0.bias" if model.config.experimental.split_classifiers else 'classifier.bias', model.config.audio_tokens + uses_stop_token ),
|
("classifiers.proj.0.bias" if model.config.experimental.split_classifiers else 'classifier.bias', model.config.audio_tokens + uses_stop_token ),
|
||||||
]
|
]
|
||||||
for k, tokens in keys:
|
for k, tokens in keys:
|
||||||
|
if k not in state:
|
||||||
|
continue
|
||||||
state[k] = ml.resize_weight( state[k], tokens )
|
state[k] = ml.resize_weight( state[k], tokens )
|
||||||
|
|
||||||
model.load_state_dict(state, strict=cfg.trainer.strict_loading)
|
model.load_state_dict(state, strict=cfg.trainer.strict_loading)
|
||||||
|
|
|
@ -70,10 +70,13 @@ def get_model_paths( paths=[Path("./training/"), Path("./models/")] ):
|
||||||
|
|
||||||
return yamls
|
return yamls
|
||||||
|
|
||||||
#
|
#@gradio_wrapper(inputs=layout["settings"]["inputs"].keys())
|
||||||
def load_model( yaml ):
|
def load_model( yaml ):
|
||||||
gr.Info(f"Loading: {yaml}")
|
gr.Info(f"Loading: {yaml}")
|
||||||
init_tts( yaml=Path(yaml), restart=True )
|
try:
|
||||||
|
init_tts( yaml=Path(yaml), restart=True )
|
||||||
|
except Exception as e:
|
||||||
|
raise gr.Error(e)
|
||||||
gr.Info(f"Loaded model")
|
gr.Info(f"Loaded model")
|
||||||
|
|
||||||
def init_tts(yaml=None, restart=False):
|
def init_tts(yaml=None, restart=False):
|
||||||
|
|
Loading…
Reference in New Issue
Block a user