data | ||
scripts | ||
vall_e | ||
.gitignore | ||
LICENSE | ||
README.md | ||
setup.py | ||
vall-e.png |
VALL'E
An unofficial PyTorch implementation of VALL-E, based on the EnCodec tokenizer.
Requirements
If your config YAML has the training backend set to deepspeed
, you will need to have a GPU that DeepSpeed has developed and tested against, as well as a CUDA or ROCm compiler pre-installed to install this package.
Install
Simply run pip install git+https://git.ecker.tech/mrq/vall-e
, or, you may clone by: git clone --recurse-submodules https://git.ecker.tech/mrq/vall-e.git
I've tested this repo under Python versions 3.10.9
and 3.11.3
.
Try Me
To quickly try it out, you can choose between the following modes:
- AR only:
python -m vall_e.models.ar yaml="./data/config.yaml"
- NAR only:
python -m vall_e.models.nar yaml="./data/config.yaml"
- AR+NAR:
python -m vall_e.models.base yaml="./data/config.yaml"
Each model file has a barebones trainer and inference routine.
Pre-Trained Model
My pre-trained weights can be acquired from here.
For example:
git lfs clone --exclude "*.h5" https://huggingface.co/ecker/vall-e ./data/
python -m vall_e "The birch canoe slid on the smooth planks." "./path/to/an/utterance.wav" --out-path="./output.wav" yaml="./data/config.yaml"
Train
Training is very dependent on:
- the quality of your dataset.
- how much data you have.
- the bandwidth you quantized your audio to.
Notices
Modifying prom_levels
, resp_levels
, Or tasks
For A Model
If you're wanting to increase the prom_levels
for a given model, or increase the tasks
levels a model accepts, you will need to export your weights and set train.load_state_dict
to True
in your configuration YAML.
Pre-Processed Dataset
Note A pre-processed "libre" is being prepared. This contains only data from the LibriTTS and LibriLight datasets (and MUSAN for noise), and culled out any non-libre datasets.
Leverage Your Own Dataset
Note It is highly recommended to utilize mrq/ai-voice-cloning with
--tts-backend="vall-e"
to handle transcription and dataset preparations.
-
Put your data into a folder, e.g.
./data/custom
. Audio files should be named with the suffix.wav
and text files with.txt
. -
Quantize the data:
python -m vall_e.emb.qnt ./data/custom
-
Generate phonemes based on the text:
python -m vall_e.emb.g2p ./data/custom
-
Customize your configuration and define the dataset by modifying
./data/config.yaml
. Refer to./vall_e/config.py
for details. If you want to choose between different model presets, check./vall_e/models/__init__.py
.
If you're interested in creating an HDF5 copy of your dataset, simply invoke: python -m vall_e.data --action='hdf5' yaml='./data/config.yaml'
- Train the AR and NAR models using the following scripts:
python -m vall_e.train yaml=./data/config.yaml
You may quit your training any time by just typing quit
in your CLI. The latest checkpoint will be automatically saved.
Dataset Formats
Two dataset formats are supported:
- the standard way:
- data is stored under
${speaker}/${id}.phn.txt
and${speaker}/${id}.qnt.pt
- data is stored under
- using an HDF5 dataset:
- you can convert from the standard way with the following command:
python3 -m vall_e.data yaml="./path/to/your/config.yaml"
- this will shove everything into a single HDF5 file and store some metadata alongside (for now, the symbol map generated, and text/audio lengths)
- be sure to also define
use_hdf5
in your config YAML.
- you can convert from the standard way with the following command:
Export
Both trained models can be exported, but is only required if loading them on systems without DeepSpeed for inferencing (Windows systems). To export the models, run: python -m vall_e.export yaml=./data/config.yaml
.
This will export the latest checkpoints under ./data/ckpt/ar-retnet-2/fp32.pth
and ./data/ckpt/nar-retnet-2/fp32.pth
to be loaded on any system with PyTorch.
Synthesis
To synthesize speech, invoke either (if exported the models): python -m vall_e <text> <ref_path> <out_path> --ar-ckpt ./models/ar.pt --nar-ckpt ./models/nar.pt
or python -m vall_e <text> <ref_path> <out_path> yaml=<yaml_path>
Some additional flags you can pass are:
--max-ar-steps
: maximum steps for inferencing through the AR model. Each second is 75 steps.--ar-temp
: sampling temperature to use for the AR pass. During experimentation,0.95
provides the most consistent output.--nar-temp
: sampling temperature to use for the NAR pass. During experimentation,0.2
provides the most clean output.--device
: device to use (default:cuda
, examples:cuda:0
,cuda:1
,cpu
)
To-Do
- reduce load time for creating / preparing dataloaders.
- train and release a model.
- extend to multiple languages (VALL-E X) and
extend totrain SpeechX features.
Notice
- EnCodec is licensed under CC-BY-NC 4.0. If you use the code to generate audio quantization or perform decoding, it is important to adhere to the terms of their license.
Unless otherwise credited/noted, this repository is licensed under AGPLv3.
Citations
@article{wang2023neural,
title={Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers},
author={Wang, Chengyi and Chen, Sanyuan and Wu, Yu and Zhang, Ziqiang and Zhou, Long and Liu, Shujie and Chen, Zhuo and Liu, Yanqing and Wang, Huaming and Li, Jinyu and others},
journal={arXiv preprint arXiv:2301.02111},
year={2023}
}
@article{defossez2022highfi,
title={High Fidelity Neural Audio Compression},
author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
journal={arXiv preprint arXiv:2210.13438},
year={2022}
}