adjustments

This commit is contained in:
mrq 2023-08-02 22:01:49 +00:00
parent bf8cedc9dd
commit d88e43800b
2 changed files with 20 additions and 35 deletions

View File

@ -10,9 +10,9 @@ An unofficial PyTorch implementation of [VALL-E](https://valle-demo.github.io/),
> **Note** This README won't get much love until I truly nail out a quasi-decent model.
* **Note** Distributed training seems broken? I'm not really sure how to test it, as my two 6800XTs have been redistributed for now, and the last time I tried using them for this, things weren't good.
> **Note** Distributed training seems broken? I'm not really sure how to test it, as my two 6800XTs have been redistributed for now, and the last time I tried using them for this, things weren't good.
* **Note** You can follow along with my pseudo-blog in an issue [here](https://git.ecker.tech/mrq/ai-voice-cloning/issues/152). I currently have a dataset clocking in at 3400+ trimmed hours.
> **Note** You can follow along with my pseudo-blog in an issue [here](https://git.ecker.tech/mrq/ai-voice-cloning/issues/152). I currently have a dataset clocking in at 3400+ trimmed hours.
### Requirements
@ -31,6 +31,7 @@ git clone --recurse-submodules https://git.ecker.tech/mrq/vall-e.git
```
Note that the code is only tested under `Python 3.10.9`.
* `fairseq` is not compatible with `Python 3.11`, a pseudo-dependency for `torchscale`.
### Train
@ -39,20 +40,6 @@ Training is very dependent on:
* how much data you have.
* the bandwidth you quantized your audio to.
#### Quick Preparations
##### Prepared Dataset
Under `./scripts/download_libritts-small.sh` is a script that will quickly set up an already prepared dataset to train. This leverages a repo I've published to HuggingFace that contains everything processsed, straight from the below method.
##### Prepare It Yourself
Under `./scripts/prepare_libri.sh` is a small script to quickly set up a dataset based on LibriSpeech-Finetuning. It'll handle everything from downloading, to extracting, to preparing, to quantizing and phonemizing.
Afterwards, simply use `./config/libri/config.yaml` as your target YAML.
However, you'll only train against a small subset of the data with the default settings, due to the maximum phoneme length configured. Increasing this will not only drastically increase VRAM usage, but also reduce iteration rates. It's recommended to further process your files by slicing them down (for example, through [mrq/ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning)).
#### Leverage Your Own
1. Put your data into a folder, e.g. `./data/custom`. Audio files should be named with the suffix `.wav` and text files with `.normalized.txt`.
@ -66,15 +53,15 @@ python -m vall_e.emb.qnt ./data/custom
3. Generate phonemes based on the text:
```
python -m vall_e.emb.g2p data/custom
python -m vall_e.emb.g2p ./data/custom
```
4. Customize your configuration by creating `./config/custom.yml`. Refer to the example configs in `./config/libri-quarter.yaml` and `./vall_e/config.py` for details. If you want to choose between different model presets, check `./vall_e/models/__init__.py`.
4. Customize your configuration modifying `./data/config.yml`. Refer to `./vall_e/config.py` for details. If you want to choose between different model presets, check `./vall_e/models/__init__.py`.
5. Train the AR and NAR models using the following scripts:
```
python -m vall_e.train yaml=config/custom/config.yml
python -m vall_e.train yaml=./data/config.yml
```
You may quit your training any time by just typing `quit` in your CLI. The latest checkpoint will be automatically saved.

View File

@ -15,8 +15,8 @@ dataset:
workers: 8
cache: True
phones_range: [4, 192]
duration_range: [1.0, 10.0]
phones_range: [4, 256]
duration_range: [1.0, 12.0]
random_utterance: 1.0
max_prompts: 3
@ -25,24 +25,20 @@ dataset:
models:
_models:
- name: "ar"
size: "full"
size: "quarter"
resp_levels: 1
use_retnet: True
full_retnet: True
use_torchscale: True
arch_type: "retnet"
- name: "nar"
size: "full"
size: "quarter"
resp_levels: 1
use_retnet: True
full_retnet: True
use_torchscale: True
arch_type: "retnet"
prom_levels: 2
hyperparameters:
batch_size: 16
gradient_accumulation_steps: 8
batch_size: 32
gradient_accumulation_steps: 4
gradient_clipping: 100
optimizer: Adamw
@ -68,11 +64,11 @@ hyperparameters:
# decay_mom_rate: 0.0
evaluation:
batch_size: 64
batch_size: 32
frequency: 250
size: 64
size: 32
steps: 500
steps: 300
temperature: 1.0
trainer:
@ -96,4 +92,6 @@ trainer:
weight_dtype: bfloat16
zero_optimization_level: 2
use_compression_training: True
use_compression_training: True
use_vocos: False