documentation update

This commit is contained in:
mrq 2024-05-04 21:03:46 -05:00
parent 253441b750
commit 8aa1b2dabf
6 changed files with 57 additions and 36 deletions

View File

@ -50,37 +50,55 @@ A script to setup a proper environment and train can be invoked with `./scripts/
### Leverage Your Own Dataset
> **Note** It is highly recommended to utilize [mrq/ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning) with `--tts-backend="vall-e"` to handle transcription and dataset preparations.
> **Note** Preparing a dataset is a bit messy.
1. Put your data into a folder, e.g. `./data/custom`. Audio files should be named with the suffix `.wav` and text files with `.txt`.
0. Set up a `venv` with `https://github.com/m-bain/whisperX/`.
+ At the moment only WhisperX is utilized. Using other variants like `faster-whisper` is an exercise left to the user at the moment.
+ It's recommended to use a dedicated virtualenv specifically for transcribing, as WhisperX will break a few dependencies.
+ The following command should work:
```
python3 -m venv venv-whisper
source ./venv-whisper/bin/activate
pip3 install torch torchvision torchaudio
pip3 install git+https://github.com/m-bain/whisperX/
```
2. Quantize the data: `python -m vall_e.emb.qnt ./data/custom`
1. Populate your source voices under `./voices/{group name}/{speaker name}/`.
3. Generate phonemes based on the text: `python -m vall_e.emb.g2p ./data/custom`
2. Run `python3 ./scripts/transcribe_dataset.py`. This will generate a transcription with timestamps for your dataset.
+ If you're interested in using a different model, edit the script's `model_name` and `batch_size` variables.
4. Customize your configuration and define the dataset by modifying `./data/config.yaml`. Refer to `./vall_e/config.py` for details. If you want to choose between different model presets, check `./vall_e/models/__init__.py`.
3. Run `python3 ./scripts/process_dataset.py`. This will phonemize the transcriptions and quantize the audio.
If you're interested in creating an HDF5 copy of your dataset, simply invoke: `python -m vall_e.data --action='hdf5' yaml='./data/config.yaml'`
5. Train the model using the following scripts: `python -m vall_e.train yaml=./data/config.yaml`
* If distributing your training (for example, multi-GPU), use `deepspeed --module vall_e.train yaml="./data/config.yaml"`
+ if you're not using the `deepspeed` backend, set `trainer.ddp = True` in the config YAML, then launch with `torchrun --nnodes=1 --nproc-per-node=4 -m vall_e.train yaml="./data/config.yaml"`
You may quit your training any time by just entering `quit` in your CLI. The latest checkpoint will be automatically saved.
4. Copy `./data/config.yaml` to `./training/config.yaml`. Customize the training configuration and populate your `dataset.training` list with the values stored under `./training/dataset_list.json`.
+ Refer to `./vall_e/config.py` for additional configuration details.
### Dataset Formats
Two dataset formats are supported:
* the standard way:
- data is stored under `${speaker}/${id}.phn.txt` and `${speaker}/${id}.qnt.pt`
- for Encodec/Vocos audio backends, data is stored under `./training/data/{group}/{speaker}/{id}.phn.txt` and `./training/data/{group}/{speaker}/{id}.qnt.pt`
- for Descript-Audio-Codec audio backend, data is stored under `./training/data/{group}/{speaker}/{id}.json` and `./training/data/{group}/{speaker}/{id}.dac`
* using an HDF5 dataset:
- you can convert from the standard way with the following command: `python3 -m vall_e.data yaml="./path/to/your/config.yaml"`
- you can convert from the standard way with the following command: `python3 -m vall_e.data yaml="./training/config.yaml"`
- this will shove everything into a single HDF5 file and store some metadata alongside (for now, the symbol map generated, and text/audio lengths)
- be sure to also define `use_hdf5` in your config YAML.
### Initializing Training
For single GPUs, simply running `python3 -m vall_e.train yaml="./training/config.yaml`.
For multiple GPUs, or exotic distributed training:
* with `deepspeed` backends, simply running `deepspeed --module vall_e.train yaml="./training/config.yaml"` should handle the gory details.
* with `local` backends, simply run `torchrun --nnodes=1 --nproc-per-node={NUMOFGPUS} -m vall_e.train yaml="./training/config.yaml"`
You can enter `save` to save the state at any time, or `quit` to save and quit training.
The `lr` will also let you adjust the learning rate on the fly. For example: `lr 1.0e-3` will set the learning rate to `0.001`.
### Plotting Metrics
Included is a helper script to parse the training metrics. Simply invoke it with, for example: `python3 -m vall_e.plot yaml="./training/valle/config.yaml"`
Included is a helper script to parse the training metrics. Simply invoke it with, for example: `python3 -m vall_e.plot yaml="./training/config.yaml"`
You can specify what X and Y labels you want to plot against by passing `--xs tokens_processed --ys loss stats.acc`
@ -92,12 +110,6 @@ As training under `deepspeed` and Windows is not (easily) supported, under your
Keep in mind that creature comforts like distributed training or `float16` training cannot be verified as working at the moment with the local trainer.
#### Training on Low-VRAM Cards
During experimentation, I've found I can comfortably train on a 4070Ti (12GiB VRAM). Howver, VRAM use is predicated on your dataset; a mix of large and small utterances will cause VRAM usage to spike and can trigger OOM conditions during the backwards pass if you are not careful.
Additionally, under Windows, I managed to finetune the AR on my 2060 (6GiB VRAM) with a batch size of 8 (although, with the card as a secondary GPU).
#### Training Caveats
Unfortunately, efforts to train a *good* foundational model seems entirely predicated on a good dataset. My dataset might be too fouled with:
@ -119,15 +131,17 @@ As the core of VALL-E makes use of a language model, various LLM architectures c
* `bitnet`: using [this](https://github.com/kyegomez/BitNet/) implementation of BitNet's transformer.
- Setting `bitsandbytes.bitnet=True` will make use of BitNet's linear implementation.
If you're training a true foundational model, consider which backend you want to use the most. `llama` backends can benefit from all the additional tech with it, while exotic ones like `retnet` or `bitnet` can't at the moment, but may leverage experimental gains.
## Export
To export the models, run: `python -m vall_e.export yaml=./data/config.yaml`.
To export the models, run: `python -m vall_e.export yaml=./training/config.yaml`.
This will export the latest checkpoints, for example, under `./data/ckpt/ar+nar-retnet-8/fp32.pth`, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats.
This will export the latest checkpoints, for example, under `./training/ckpt/ar+nar-retnet-8/fp32.pth`, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats.
## Synthesis
To synthesize speech, invoke either (if exported the models): `python -m vall_e <text> <ref_path> <out_path> --model-ckpt ./data/ckpt/ar+nar-retnet-8/fp32.pth` or `python -m vall_e <text> <ref_path> <out_path> yaml=<yaml_path>`
To synthesize speech, invoke either (if exported the models): `python -m vall_e <text> <ref_path> <out_path> --model-ckpt ./training/ckpt/ar+nar-retnet-8/fp32.pth` or `python -m vall_e <text> <ref_path> <out_path> yaml=<yaml_path>`
Some additional flags you can pass are:
* `--language`: specifies the language for phonemizing the text, and helps guide inferencing when the model is trained against that language.

View File

@ -6,8 +6,8 @@ import torchaudio
from tqdm.auto import tqdm
from pathlib import Path
input_dataset = "metadata"
output_dataset = "metadata-cleaned"
input_dataset = "training/metadata"
output_dataset = "training/metadata-cleaned"
def pad(num, zeroes):
return str(num).zfill(zeroes+1)

View File

@ -8,10 +8,10 @@ from pathlib import Path
from vall_e.emb.g2p import encode as valle_phonemize
from vall_e.emb.qnt import encode as valle_quantize, _replace_file_extension
# things that could be args
# to-do: use argparser
input_audio = "voices"
input_metadata = "metadata"
output_dataset = "training-24K"
input_metadata = "training/metadata"
output_dataset = "training/data"
device = "cuda"
slice = "auto"
@ -19,6 +19,7 @@ missing = {
"transcription": [],
"audio": []
}
dataset = []
def pad(num, zeroes):
return str(num).zfill(zeroes+1)
@ -63,6 +64,8 @@ for dataset_name in sorted(os.listdir(f'./{input_audio}/')):
waveform, sample_rate = None, None
language = metadata[filename]["language"] if "language" in metadata[filename] else "english"
dataset.append(f'{dataset_name}/{speaker_id}')
if len(metadata[filename]["segments"]) == 0 or not use_slices:
outpath = Path(f'./{output_dataset}/{dataset_name}/{speaker_id}/{fname}.{extension}')
text = metadata[filename]["text"]
@ -148,4 +151,5 @@ for dataset_name in sorted(os.listdir(f'./{input_audio}/')):
print(f"Failed to quantize: {outpath}:", e)
continue
open("./missing.json", 'w', encoding='utf-8').write(json.dumps(missing))
open("./training/missing.json", 'w', encoding='utf-8').write(json.dumps(missing))
open("./training/dataset_list.json", 'w', encoding='utf-8').write(json.dumps(dataset))

View File

@ -12,9 +12,9 @@ from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing
input_metadata = "training-24K"
input_metadata = "training/data"
output_file = Path("./dataset.json")
output_file = Path("./training/tokenizer_training_data.json")
tokenizer_data = []
def pad(num, zeroes):
@ -54,4 +54,4 @@ tokenizer.post_processor = TemplateProcessing(
)
tokenizer.train_from_iterator(tokenizer_data, trainer=trainer)
tokenizer.save("./tokenizer.json")
tokenizer.save("./training/tokenizer.json")

View File

@ -7,14 +7,14 @@ import whisperx
from tqdm.auto import tqdm
from pathlib import Path
# should be args
# to-do: use argparser
batch_size = 16
device = "cuda"
dtype = "float16"
model_name = "large-v3"
input_audio = "voices"
output_dataset = "metadata"
output_dataset = "training/metadata"
skip_existing = True
diarize = False

View File

@ -447,6 +447,7 @@ class Engines(dict[str, Engine]):
if not cfg.trainer.check_for_oom:
engine.backward(loss)
else:
# to-do: properly handle when one GPU throws an OOM because it just halts
try:
engine.backward(loss)
except RuntimeError as e:
@ -460,8 +461,10 @@ class Engines(dict[str, Engine]):
if world_size() > 1:
all_reduce(n_ooms)
if n_ooms.item() > 0:
self.save_checkpoint()
raise RuntimeError("Out of memory during backwards pass!")
engine.step()