diff --git a/README.md b/README.md index 57174ac..eeaed13 100755 --- a/README.md +++ b/README.md @@ -50,37 +50,55 @@ A script to setup a proper environment and train can be invoked with `./scripts/ ### Leverage Your Own Dataset -> **Note** It is highly recommended to utilize [mrq/ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning) with `--tts-backend="vall-e"` to handle transcription and dataset preparations. +> **Note** Preparing a dataset is a bit messy. -1. Put your data into a folder, e.g. `./data/custom`. Audio files should be named with the suffix `.wav` and text files with `.txt`. +0. Set up a `venv` with `https://github.com/m-bain/whisperX/`. + + At the moment only WhisperX is utilized. Using other variants like `faster-whisper` is an exercise left to the user at the moment. + + It's recommended to use a dedicated virtualenv specifically for transcribing, as WhisperX will break a few dependencies. + + The following command should work: + ``` + python3 -m venv venv-whisper + source ./venv-whisper/bin/activate + pip3 install torch torchvision torchaudio + pip3 install git+https://github.com/m-bain/whisperX/ + ``` -2. Quantize the data: `python -m vall_e.emb.qnt ./data/custom` +1. Populate your source voices under `./voices/{group name}/{speaker name}/`. -3. Generate phonemes based on the text: `python -m vall_e.emb.g2p ./data/custom` +2. Run `python3 ./scripts/transcribe_dataset.py`. This will generate a transcription with timestamps for your dataset. + + If you're interested in using a different model, edit the script's `model_name` and `batch_size` variables. -4. Customize your configuration and define the dataset by modifying `./data/config.yaml`. Refer to `./vall_e/config.py` for details. If you want to choose between different model presets, check `./vall_e/models/__init__.py`. +3. Run `python3 ./scripts/process_dataset.py`. This will phonemize the transcriptions and quantize the audio. -If you're interested in creating an HDF5 copy of your dataset, simply invoke: `python -m vall_e.data --action='hdf5' yaml='./data/config.yaml'` - -5. Train the model using the following scripts: `python -m vall_e.train yaml=./data/config.yaml` -* If distributing your training (for example, multi-GPU), use `deepspeed --module vall_e.train yaml="./data/config.yaml"` - + if you're not using the `deepspeed` backend, set `trainer.ddp = True` in the config YAML, then launch with `torchrun --nnodes=1 --nproc-per-node=4 -m vall_e.train yaml="./data/config.yaml"` - -You may quit your training any time by just entering `quit` in your CLI. The latest checkpoint will be automatically saved. +4. Copy `./data/config.yaml` to `./training/config.yaml`. Customize the training configuration and populate your `dataset.training` list with the values stored under `./training/dataset_list.json`. + + Refer to `./vall_e/config.py` for additional configuration details. ### Dataset Formats Two dataset formats are supported: * the standard way: - - data is stored under `${speaker}/${id}.phn.txt` and `${speaker}/${id}.qnt.pt` + - for Encodec/Vocos audio backends, data is stored under `./training/data/{group}/{speaker}/{id}.phn.txt` and `./training/data/{group}/{speaker}/{id}.qnt.pt` + - for Descript-Audio-Codec audio backend, data is stored under `./training/data/{group}/{speaker}/{id}.json` and `./training/data/{group}/{speaker}/{id}.dac` * using an HDF5 dataset: - - you can convert from the standard way with the following command: `python3 -m vall_e.data yaml="./path/to/your/config.yaml"` + - you can convert from the standard way with the following command: `python3 -m vall_e.data yaml="./training/config.yaml"` - this will shove everything into a single HDF5 file and store some metadata alongside (for now, the symbol map generated, and text/audio lengths) - be sure to also define `use_hdf5` in your config YAML. +### Initializing Training + +For single GPUs, simply running `python3 -m vall_e.train yaml="./training/config.yaml`. + +For multiple GPUs, or exotic distributed training: +* with `deepspeed` backends, simply running `deepspeed --module vall_e.train yaml="./training/config.yaml"` should handle the gory details. +* with `local` backends, simply run `torchrun --nnodes=1 --nproc-per-node={NUMOFGPUS} -m vall_e.train yaml="./training/config.yaml"` + +You can enter `save` to save the state at any time, or `quit` to save and quit training. + +The `lr` will also let you adjust the learning rate on the fly. For example: `lr 1.0e-3` will set the learning rate to `0.001`. + ### Plotting Metrics -Included is a helper script to parse the training metrics. Simply invoke it with, for example: `python3 -m vall_e.plot yaml="./training/valle/config.yaml"` +Included is a helper script to parse the training metrics. Simply invoke it with, for example: `python3 -m vall_e.plot yaml="./training/config.yaml"` You can specify what X and Y labels you want to plot against by passing `--xs tokens_processed --ys loss stats.acc` @@ -92,12 +110,6 @@ As training under `deepspeed` and Windows is not (easily) supported, under your Keep in mind that creature comforts like distributed training or `float16` training cannot be verified as working at the moment with the local trainer. -#### Training on Low-VRAM Cards - -During experimentation, I've found I can comfortably train on a 4070Ti (12GiB VRAM). Howver, VRAM use is predicated on your dataset; a mix of large and small utterances will cause VRAM usage to spike and can trigger OOM conditions during the backwards pass if you are not careful. - -Additionally, under Windows, I managed to finetune the AR on my 2060 (6GiB VRAM) with a batch size of 8 (although, with the card as a secondary GPU). - #### Training Caveats Unfortunately, efforts to train a *good* foundational model seems entirely predicated on a good dataset. My dataset might be too fouled with: @@ -119,15 +131,17 @@ As the core of VALL-E makes use of a language model, various LLM architectures c * `bitnet`: using [this](https://github.com/kyegomez/BitNet/) implementation of BitNet's transformer. - Setting `bitsandbytes.bitnet=True` will make use of BitNet's linear implementation. +If you're training a true foundational model, consider which backend you want to use the most. `llama` backends can benefit from all the additional tech with it, while exotic ones like `retnet` or `bitnet` can't at the moment, but may leverage experimental gains. + ## Export -To export the models, run: `python -m vall_e.export yaml=./data/config.yaml`. +To export the models, run: `python -m vall_e.export yaml=./training/config.yaml`. -This will export the latest checkpoints, for example, under `./data/ckpt/ar+nar-retnet-8/fp32.pth`, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats. +This will export the latest checkpoints, for example, under `./training/ckpt/ar+nar-retnet-8/fp32.pth`, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats. ## Synthesis -To synthesize speech, invoke either (if exported the models): `python -m vall_e --model-ckpt ./data/ckpt/ar+nar-retnet-8/fp32.pth` or `python -m vall_e yaml=` +To synthesize speech, invoke either (if exported the models): `python -m vall_e --model-ckpt ./training/ckpt/ar+nar-retnet-8/fp32.pth` or `python -m vall_e yaml=` Some additional flags you can pass are: * `--language`: specifies the language for phonemizing the text, and helps guide inferencing when the model is trained against that language. diff --git a/scripts/cleanup_dataset.py b/scripts/cleanup_dataset.py index 083de32..04eca87 100644 --- a/scripts/cleanup_dataset.py +++ b/scripts/cleanup_dataset.py @@ -6,8 +6,8 @@ import torchaudio from tqdm.auto import tqdm from pathlib import Path -input_dataset = "metadata" -output_dataset = "metadata-cleaned" +input_dataset = "training/metadata" +output_dataset = "training/metadata-cleaned" def pad(num, zeroes): return str(num).zfill(zeroes+1) diff --git a/scripts/process_dataset.py b/scripts/process_dataset.py index 0c628a3..1ae4f50 100644 --- a/scripts/process_dataset.py +++ b/scripts/process_dataset.py @@ -8,10 +8,10 @@ from pathlib import Path from vall_e.emb.g2p import encode as valle_phonemize from vall_e.emb.qnt import encode as valle_quantize, _replace_file_extension -# things that could be args +# to-do: use argparser input_audio = "voices" -input_metadata = "metadata" -output_dataset = "training-24K" +input_metadata = "training/metadata" +output_dataset = "training/data" device = "cuda" slice = "auto" @@ -19,6 +19,7 @@ missing = { "transcription": [], "audio": [] } +dataset = [] def pad(num, zeroes): return str(num).zfill(zeroes+1) @@ -63,6 +64,8 @@ for dataset_name in sorted(os.listdir(f'./{input_audio}/')): waveform, sample_rate = None, None language = metadata[filename]["language"] if "language" in metadata[filename] else "english" + dataset.append(f'{dataset_name}/{speaker_id}') + if len(metadata[filename]["segments"]) == 0 or not use_slices: outpath = Path(f'./{output_dataset}/{dataset_name}/{speaker_id}/{fname}.{extension}') text = metadata[filename]["text"] @@ -148,4 +151,5 @@ for dataset_name in sorted(os.listdir(f'./{input_audio}/')): print(f"Failed to quantize: {outpath}:", e) continue -open("./missing.json", 'w', encoding='utf-8').write(json.dumps(missing)) +open("./training/missing.json", 'w', encoding='utf-8').write(json.dumps(missing)) +open("./training/dataset_list.json", 'w', encoding='utf-8').write(json.dumps(dataset)) \ No newline at end of file diff --git a/scripts/train_tokenizer.py b/scripts/train_tokenizer.py index 6b4058a..cc22430 100644 --- a/scripts/train_tokenizer.py +++ b/scripts/train_tokenizer.py @@ -12,9 +12,9 @@ from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import Whitespace from tokenizers.processors import TemplateProcessing -input_metadata = "training-24K" +input_metadata = "training/data" -output_file = Path("./dataset.json") +output_file = Path("./training/tokenizer_training_data.json") tokenizer_data = [] def pad(num, zeroes): @@ -54,4 +54,4 @@ tokenizer.post_processor = TemplateProcessing( ) tokenizer.train_from_iterator(tokenizer_data, trainer=trainer) -tokenizer.save("./tokenizer.json") \ No newline at end of file +tokenizer.save("./training/tokenizer.json") \ No newline at end of file diff --git a/scripts/transcribe_dataset.py b/scripts/transcribe_dataset.py index 3814a3c..657e527 100644 --- a/scripts/transcribe_dataset.py +++ b/scripts/transcribe_dataset.py @@ -7,14 +7,14 @@ import whisperx from tqdm.auto import tqdm from pathlib import Path -# should be args +# to-do: use argparser batch_size = 16 device = "cuda" dtype = "float16" model_name = "large-v3" input_audio = "voices" -output_dataset = "metadata" +output_dataset = "training/metadata" skip_existing = True diarize = False diff --git a/vall_e/engines/base.py b/vall_e/engines/base.py index fa2f86d..eeca527 100755 --- a/vall_e/engines/base.py +++ b/vall_e/engines/base.py @@ -447,6 +447,7 @@ class Engines(dict[str, Engine]): if not cfg.trainer.check_for_oom: engine.backward(loss) else: + # to-do: properly handle when one GPU throws an OOM because it just halts try: engine.backward(loss) except RuntimeError as e: @@ -460,9 +461,11 @@ class Engines(dict[str, Engine]): if world_size() > 1: all_reduce(n_ooms) + if n_ooms.item() > 0: self.save_checkpoint() - raise RuntimeError("Out of memory during backwards pass!") + + raise RuntimeError("Out of memory during backwards pass!") engine.step()