documentation update
This commit is contained in:
parent
253441b750
commit
8aa1b2dabf
62
README.md
62
README.md
|
@ -50,37 +50,55 @@ A script to setup a proper environment and train can be invoked with `./scripts/
|
||||||
|
|
||||||
### Leverage Your Own Dataset
|
### Leverage Your Own Dataset
|
||||||
|
|
||||||
> **Note** It is highly recommended to utilize [mrq/ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning) with `--tts-backend="vall-e"` to handle transcription and dataset preparations.
|
> **Note** Preparing a dataset is a bit messy.
|
||||||
|
|
||||||
1. Put your data into a folder, e.g. `./data/custom`. Audio files should be named with the suffix `.wav` and text files with `.txt`.
|
0. Set up a `venv` with `https://github.com/m-bain/whisperX/`.
|
||||||
|
+ At the moment only WhisperX is utilized. Using other variants like `faster-whisper` is an exercise left to the user at the moment.
|
||||||
|
+ It's recommended to use a dedicated virtualenv specifically for transcribing, as WhisperX will break a few dependencies.
|
||||||
|
+ The following command should work:
|
||||||
|
```
|
||||||
|
python3 -m venv venv-whisper
|
||||||
|
source ./venv-whisper/bin/activate
|
||||||
|
pip3 install torch torchvision torchaudio
|
||||||
|
pip3 install git+https://github.com/m-bain/whisperX/
|
||||||
|
```
|
||||||
|
|
||||||
2. Quantize the data: `python -m vall_e.emb.qnt ./data/custom`
|
1. Populate your source voices under `./voices/{group name}/{speaker name}/`.
|
||||||
|
|
||||||
3. Generate phonemes based on the text: `python -m vall_e.emb.g2p ./data/custom`
|
2. Run `python3 ./scripts/transcribe_dataset.py`. This will generate a transcription with timestamps for your dataset.
|
||||||
|
+ If you're interested in using a different model, edit the script's `model_name` and `batch_size` variables.
|
||||||
|
|
||||||
4. Customize your configuration and define the dataset by modifying `./data/config.yaml`. Refer to `./vall_e/config.py` for details. If you want to choose between different model presets, check `./vall_e/models/__init__.py`.
|
3. Run `python3 ./scripts/process_dataset.py`. This will phonemize the transcriptions and quantize the audio.
|
||||||
|
|
||||||
If you're interested in creating an HDF5 copy of your dataset, simply invoke: `python -m vall_e.data --action='hdf5' yaml='./data/config.yaml'`
|
4. Copy `./data/config.yaml` to `./training/config.yaml`. Customize the training configuration and populate your `dataset.training` list with the values stored under `./training/dataset_list.json`.
|
||||||
|
+ Refer to `./vall_e/config.py` for additional configuration details.
|
||||||
5. Train the model using the following scripts: `python -m vall_e.train yaml=./data/config.yaml`
|
|
||||||
* If distributing your training (for example, multi-GPU), use `deepspeed --module vall_e.train yaml="./data/config.yaml"`
|
|
||||||
+ if you're not using the `deepspeed` backend, set `trainer.ddp = True` in the config YAML, then launch with `torchrun --nnodes=1 --nproc-per-node=4 -m vall_e.train yaml="./data/config.yaml"`
|
|
||||||
|
|
||||||
You may quit your training any time by just entering `quit` in your CLI. The latest checkpoint will be automatically saved.
|
|
||||||
|
|
||||||
### Dataset Formats
|
### Dataset Formats
|
||||||
|
|
||||||
Two dataset formats are supported:
|
Two dataset formats are supported:
|
||||||
* the standard way:
|
* the standard way:
|
||||||
- data is stored under `${speaker}/${id}.phn.txt` and `${speaker}/${id}.qnt.pt`
|
- for Encodec/Vocos audio backends, data is stored under `./training/data/{group}/{speaker}/{id}.phn.txt` and `./training/data/{group}/{speaker}/{id}.qnt.pt`
|
||||||
|
- for Descript-Audio-Codec audio backend, data is stored under `./training/data/{group}/{speaker}/{id}.json` and `./training/data/{group}/{speaker}/{id}.dac`
|
||||||
* using an HDF5 dataset:
|
* using an HDF5 dataset:
|
||||||
- you can convert from the standard way with the following command: `python3 -m vall_e.data yaml="./path/to/your/config.yaml"`
|
- you can convert from the standard way with the following command: `python3 -m vall_e.data yaml="./training/config.yaml"`
|
||||||
- this will shove everything into a single HDF5 file and store some metadata alongside (for now, the symbol map generated, and text/audio lengths)
|
- this will shove everything into a single HDF5 file and store some metadata alongside (for now, the symbol map generated, and text/audio lengths)
|
||||||
- be sure to also define `use_hdf5` in your config YAML.
|
- be sure to also define `use_hdf5` in your config YAML.
|
||||||
|
|
||||||
|
### Initializing Training
|
||||||
|
|
||||||
|
For single GPUs, simply running `python3 -m vall_e.train yaml="./training/config.yaml`.
|
||||||
|
|
||||||
|
For multiple GPUs, or exotic distributed training:
|
||||||
|
* with `deepspeed` backends, simply running `deepspeed --module vall_e.train yaml="./training/config.yaml"` should handle the gory details.
|
||||||
|
* with `local` backends, simply run `torchrun --nnodes=1 --nproc-per-node={NUMOFGPUS} -m vall_e.train yaml="./training/config.yaml"`
|
||||||
|
|
||||||
|
You can enter `save` to save the state at any time, or `quit` to save and quit training.
|
||||||
|
|
||||||
|
The `lr` will also let you adjust the learning rate on the fly. For example: `lr 1.0e-3` will set the learning rate to `0.001`.
|
||||||
|
|
||||||
### Plotting Metrics
|
### Plotting Metrics
|
||||||
|
|
||||||
Included is a helper script to parse the training metrics. Simply invoke it with, for example: `python3 -m vall_e.plot yaml="./training/valle/config.yaml"`
|
Included is a helper script to parse the training metrics. Simply invoke it with, for example: `python3 -m vall_e.plot yaml="./training/config.yaml"`
|
||||||
|
|
||||||
You can specify what X and Y labels you want to plot against by passing `--xs tokens_processed --ys loss stats.acc`
|
You can specify what X and Y labels you want to plot against by passing `--xs tokens_processed --ys loss stats.acc`
|
||||||
|
|
||||||
|
@ -92,12 +110,6 @@ As training under `deepspeed` and Windows is not (easily) supported, under your
|
||||||
|
|
||||||
Keep in mind that creature comforts like distributed training or `float16` training cannot be verified as working at the moment with the local trainer.
|
Keep in mind that creature comforts like distributed training or `float16` training cannot be verified as working at the moment with the local trainer.
|
||||||
|
|
||||||
#### Training on Low-VRAM Cards
|
|
||||||
|
|
||||||
During experimentation, I've found I can comfortably train on a 4070Ti (12GiB VRAM). Howver, VRAM use is predicated on your dataset; a mix of large and small utterances will cause VRAM usage to spike and can trigger OOM conditions during the backwards pass if you are not careful.
|
|
||||||
|
|
||||||
Additionally, under Windows, I managed to finetune the AR on my 2060 (6GiB VRAM) with a batch size of 8 (although, with the card as a secondary GPU).
|
|
||||||
|
|
||||||
#### Training Caveats
|
#### Training Caveats
|
||||||
|
|
||||||
Unfortunately, efforts to train a *good* foundational model seems entirely predicated on a good dataset. My dataset might be too fouled with:
|
Unfortunately, efforts to train a *good* foundational model seems entirely predicated on a good dataset. My dataset might be too fouled with:
|
||||||
|
@ -119,15 +131,17 @@ As the core of VALL-E makes use of a language model, various LLM architectures c
|
||||||
* `bitnet`: using [this](https://github.com/kyegomez/BitNet/) implementation of BitNet's transformer.
|
* `bitnet`: using [this](https://github.com/kyegomez/BitNet/) implementation of BitNet's transformer.
|
||||||
- Setting `bitsandbytes.bitnet=True` will make use of BitNet's linear implementation.
|
- Setting `bitsandbytes.bitnet=True` will make use of BitNet's linear implementation.
|
||||||
|
|
||||||
|
If you're training a true foundational model, consider which backend you want to use the most. `llama` backends can benefit from all the additional tech with it, while exotic ones like `retnet` or `bitnet` can't at the moment, but may leverage experimental gains.
|
||||||
|
|
||||||
## Export
|
## Export
|
||||||
|
|
||||||
To export the models, run: `python -m vall_e.export yaml=./data/config.yaml`.
|
To export the models, run: `python -m vall_e.export yaml=./training/config.yaml`.
|
||||||
|
|
||||||
This will export the latest checkpoints, for example, under `./data/ckpt/ar+nar-retnet-8/fp32.pth`, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats.
|
This will export the latest checkpoints, for example, under `./training/ckpt/ar+nar-retnet-8/fp32.pth`, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats.
|
||||||
|
|
||||||
## Synthesis
|
## Synthesis
|
||||||
|
|
||||||
To synthesize speech, invoke either (if exported the models): `python -m vall_e <text> <ref_path> <out_path> --model-ckpt ./data/ckpt/ar+nar-retnet-8/fp32.pth` or `python -m vall_e <text> <ref_path> <out_path> yaml=<yaml_path>`
|
To synthesize speech, invoke either (if exported the models): `python -m vall_e <text> <ref_path> <out_path> --model-ckpt ./training/ckpt/ar+nar-retnet-8/fp32.pth` or `python -m vall_e <text> <ref_path> <out_path> yaml=<yaml_path>`
|
||||||
|
|
||||||
Some additional flags you can pass are:
|
Some additional flags you can pass are:
|
||||||
* `--language`: specifies the language for phonemizing the text, and helps guide inferencing when the model is trained against that language.
|
* `--language`: specifies the language for phonemizing the text, and helps guide inferencing when the model is trained against that language.
|
||||||
|
|
|
@ -6,8 +6,8 @@ import torchaudio
|
||||||
from tqdm.auto import tqdm
|
from tqdm.auto import tqdm
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
input_dataset = "metadata"
|
input_dataset = "training/metadata"
|
||||||
output_dataset = "metadata-cleaned"
|
output_dataset = "training/metadata-cleaned"
|
||||||
|
|
||||||
def pad(num, zeroes):
|
def pad(num, zeroes):
|
||||||
return str(num).zfill(zeroes+1)
|
return str(num).zfill(zeroes+1)
|
||||||
|
|
|
@ -8,10 +8,10 @@ from pathlib import Path
|
||||||
from vall_e.emb.g2p import encode as valle_phonemize
|
from vall_e.emb.g2p import encode as valle_phonemize
|
||||||
from vall_e.emb.qnt import encode as valle_quantize, _replace_file_extension
|
from vall_e.emb.qnt import encode as valle_quantize, _replace_file_extension
|
||||||
|
|
||||||
# things that could be args
|
# to-do: use argparser
|
||||||
input_audio = "voices"
|
input_audio = "voices"
|
||||||
input_metadata = "metadata"
|
input_metadata = "training/metadata"
|
||||||
output_dataset = "training-24K"
|
output_dataset = "training/data"
|
||||||
device = "cuda"
|
device = "cuda"
|
||||||
|
|
||||||
slice = "auto"
|
slice = "auto"
|
||||||
|
@ -19,6 +19,7 @@ missing = {
|
||||||
"transcription": [],
|
"transcription": [],
|
||||||
"audio": []
|
"audio": []
|
||||||
}
|
}
|
||||||
|
dataset = []
|
||||||
|
|
||||||
def pad(num, zeroes):
|
def pad(num, zeroes):
|
||||||
return str(num).zfill(zeroes+1)
|
return str(num).zfill(zeroes+1)
|
||||||
|
@ -63,6 +64,8 @@ for dataset_name in sorted(os.listdir(f'./{input_audio}/')):
|
||||||
waveform, sample_rate = None, None
|
waveform, sample_rate = None, None
|
||||||
language = metadata[filename]["language"] if "language" in metadata[filename] else "english"
|
language = metadata[filename]["language"] if "language" in metadata[filename] else "english"
|
||||||
|
|
||||||
|
dataset.append(f'{dataset_name}/{speaker_id}')
|
||||||
|
|
||||||
if len(metadata[filename]["segments"]) == 0 or not use_slices:
|
if len(metadata[filename]["segments"]) == 0 or not use_slices:
|
||||||
outpath = Path(f'./{output_dataset}/{dataset_name}/{speaker_id}/{fname}.{extension}')
|
outpath = Path(f'./{output_dataset}/{dataset_name}/{speaker_id}/{fname}.{extension}')
|
||||||
text = metadata[filename]["text"]
|
text = metadata[filename]["text"]
|
||||||
|
@ -148,4 +151,5 @@ for dataset_name in sorted(os.listdir(f'./{input_audio}/')):
|
||||||
print(f"Failed to quantize: {outpath}:", e)
|
print(f"Failed to quantize: {outpath}:", e)
|
||||||
continue
|
continue
|
||||||
|
|
||||||
open("./missing.json", 'w', encoding='utf-8').write(json.dumps(missing))
|
open("./training/missing.json", 'w', encoding='utf-8').write(json.dumps(missing))
|
||||||
|
open("./training/dataset_list.json", 'w', encoding='utf-8').write(json.dumps(dataset))
|
|
@ -12,9 +12,9 @@ from tokenizers.trainers import BpeTrainer
|
||||||
from tokenizers.pre_tokenizers import Whitespace
|
from tokenizers.pre_tokenizers import Whitespace
|
||||||
from tokenizers.processors import TemplateProcessing
|
from tokenizers.processors import TemplateProcessing
|
||||||
|
|
||||||
input_metadata = "training-24K"
|
input_metadata = "training/data"
|
||||||
|
|
||||||
output_file = Path("./dataset.json")
|
output_file = Path("./training/tokenizer_training_data.json")
|
||||||
tokenizer_data = []
|
tokenizer_data = []
|
||||||
|
|
||||||
def pad(num, zeroes):
|
def pad(num, zeroes):
|
||||||
|
@ -54,4 +54,4 @@ tokenizer.post_processor = TemplateProcessing(
|
||||||
)
|
)
|
||||||
|
|
||||||
tokenizer.train_from_iterator(tokenizer_data, trainer=trainer)
|
tokenizer.train_from_iterator(tokenizer_data, trainer=trainer)
|
||||||
tokenizer.save("./tokenizer.json")
|
tokenizer.save("./training/tokenizer.json")
|
|
@ -7,14 +7,14 @@ import whisperx
|
||||||
from tqdm.auto import tqdm
|
from tqdm.auto import tqdm
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
# should be args
|
# to-do: use argparser
|
||||||
batch_size = 16
|
batch_size = 16
|
||||||
device = "cuda"
|
device = "cuda"
|
||||||
dtype = "float16"
|
dtype = "float16"
|
||||||
model_name = "large-v3"
|
model_name = "large-v3"
|
||||||
|
|
||||||
input_audio = "voices"
|
input_audio = "voices"
|
||||||
output_dataset = "metadata"
|
output_dataset = "training/metadata"
|
||||||
|
|
||||||
skip_existing = True
|
skip_existing = True
|
||||||
diarize = False
|
diarize = False
|
||||||
|
|
|
@ -447,6 +447,7 @@ class Engines(dict[str, Engine]):
|
||||||
if not cfg.trainer.check_for_oom:
|
if not cfg.trainer.check_for_oom:
|
||||||
engine.backward(loss)
|
engine.backward(loss)
|
||||||
else:
|
else:
|
||||||
|
# to-do: properly handle when one GPU throws an OOM because it just halts
|
||||||
try:
|
try:
|
||||||
engine.backward(loss)
|
engine.backward(loss)
|
||||||
except RuntimeError as e:
|
except RuntimeError as e:
|
||||||
|
@ -460,9 +461,11 @@ class Engines(dict[str, Engine]):
|
||||||
|
|
||||||
if world_size() > 1:
|
if world_size() > 1:
|
||||||
all_reduce(n_ooms)
|
all_reduce(n_ooms)
|
||||||
|
|
||||||
if n_ooms.item() > 0:
|
if n_ooms.item() > 0:
|
||||||
self.save_checkpoint()
|
self.save_checkpoint()
|
||||||
raise RuntimeError("Out of memory during backwards pass!")
|
|
||||||
|
raise RuntimeError("Out of memory during backwards pass!")
|
||||||
|
|
||||||
engine.step()
|
engine.step()
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user