adjustments

2023-08-02 22:01:49 +00:00 · 2023-08-02 22:01:49 +00:00 · d88e43800b
commit d88e43800b
parent bf8cedc9dd
2 changed files with 20 additions and 35 deletions
--- a/README.md
+++ b/README.md
@ -10,9 +10,9 @@ An unofficial PyTorch implementation of [VALL-E](https://valle-demo.github.io/),

 > **Note** This README won't get much love until I truly nail out a quasi-decent model.

-* **Note** Distributed training seems broken? I'm not really sure how to test it, as my two 6800XTs have been redistributed for now, and the last time I tried using them for this, things weren't good.
+> **Note** Distributed training seems broken? I'm not really sure how to test it, as my two 6800XTs have been redistributed for now, and the last time I tried using them for this, things weren't good.

-* **Note** You can follow along with my pseudo-blog in an issue [here](https://git.ecker.tech/mrq/ai-voice-cloning/issues/152). I currently have a dataset clocking in at 3400+ trimmed hours.
+> **Note** You can follow along with my pseudo-blog in an issue [here](https://git.ecker.tech/mrq/ai-voice-cloning/issues/152). I currently have a dataset clocking in at 3400+ trimmed hours.

 ### Requirements

@ -31,6 +31,7 @@ git clone --recurse-submodules https://git.ecker.tech/mrq/vall-e.git
 ```

 Note that the code is only tested under `Python 3.10.9`.
+* `fairseq` is not compatible with `Python 3.11`, a pseudo-dependency for `torchscale`.

 ### Train

@ -39,20 +40,6 @@ Training is very dependent on:
 * how much data you have.
 * the bandwidth you quantized your audio to.

-#### Quick Preparations
-
-##### Prepared Dataset
-
-Under `./scripts/download_libritts-small.sh` is a script that will quickly set up an already prepared dataset to train. This leverages a repo I've published to HuggingFace that contains everything processsed, straight from the below method.
-
-##### Prepare It Yourself
-
-Under `./scripts/prepare_libri.sh` is a small script to quickly set up a dataset based on LibriSpeech-Finetuning. It'll handle everything from downloading, to extracting, to preparing, to quantizing and phonemizing.
-
-Afterwards, simply use `./config/libri/config.yaml` as your target YAML.
-
-However, you'll only train against a small subset of the data with the default settings, due to the maximum phoneme length configured. Increasing this will not only drastically increase VRAM usage, but also reduce iteration rates. It's recommended to further process your files by slicing them down (for example, through [mrq/ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning)).
-
 #### Leverage Your Own

 1. Put your data into a folder, e.g. `./data/custom`. Audio files should be named with the suffix `.wav` and text files with `.normalized.txt`.
@ -66,15 +53,15 @@ python -m vall_e.emb.qnt ./data/custom
 3. Generate phonemes based on the text:

 ```
-python -m vall_e.emb.g2p data/custom
+python -m vall_e.emb.g2p ./data/custom
 ```

-4. Customize your configuration by creating `./config/custom.yml`. Refer to the example configs in `./config/libri-quarter.yaml` and `./vall_e/config.py` for details. If you want to choose between different model presets, check `./vall_e/models/__init__.py`.
+4. Customize your configuration modifying `./data/config.yml`. Refer to `./vall_e/config.py` for details. If you want to choose between different model presets, check `./vall_e/models/__init__.py`.

 5. Train the AR and NAR models using the following scripts:

 ```
-python -m vall_e.train yaml=config/custom/config.yml
+python -m vall_e.train yaml=./data/config.yml
 ```

 You may quit your training any time by just typing `quit` in your CLI. The latest checkpoint will be automatically saved.
--- a/data/config.yaml
+++ b/data/config.yaml
@ -15,8 +15,8 @@ dataset:
  workers: 8
  cache: True

-  phones_range: [4, 192]
-  duration_range: [1.0, 10.0]
+  phones_range: [4, 256]
+  duration_range: [1.0, 12.0]

  random_utterance: 1.0
  max_prompts: 3
@ -25,24 +25,20 @@ dataset:
 models:
  _models:
  - name: "ar"
-    size: "full"
+    size: "quarter"
    resp_levels: 1
-    use_retnet: True
-    full_retnet: True
-    use_torchscale: True
+    arch_type: "retnet"

  - name: "nar"
-    size: "full"
+    size: "quarter"
    resp_levels: 1
-    use_retnet: True
-    full_retnet: True
-    use_torchscale: True
+    arch_type: "retnet"
  
  prom_levels: 2

 hyperparameters:
-  batch_size: 16
-  gradient_accumulation_steps: 8
+  batch_size: 32
+  gradient_accumulation_steps: 4
  gradient_clipping: 100
  
  optimizer: Adamw
@ -68,11 +64,11 @@ hyperparameters:
  #  decay_mom_rate: 0.0

 evaluation:
-  batch_size: 64
+  batch_size: 32
  frequency: 250
-  size: 64
+  size: 32
  
-  steps: 500
+  steps: 300
  temperature: 1.0

 trainer:
@ -96,4 +92,6 @@ trainer:
  weight_dtype: bfloat16

  zero_optimization_level: 2
-  use_compression_training: True
+  use_compression_training: True
+
+use_vocos: False