vall-e/docs/config.md
2024-11-19 19:19:34 -06:00

7.1 KiB

config.py

This script handles everything related to storing configuration information, as well as:

  • loading the data.h5 file
  • loading the phoneme tokenizer

Thorough documentation pertaining to each field should already be noted alongside each line, or in the provided YAML.

BaseConfig

This serves as an agnostic base class that can be reused across additional projects.

Aside from accessing properties, the end user should not be required to interact with this.

Config

This serves as the inhereted class for BaseConfig, which contains instances of the following classes within it.

Additional global-states can be found here, such as:

  • device: which device to load the model to
  • experimental: a debug flag
    • for the end user, this gates off experimental sampler settings in the web UI.
  • tokenizer: the tokenizer type to use
    • this only really is used for the ar+nar-retnet-8, as it used a naive tokenizer and vocab.
  • tokenizer_path: the path to the tokenizer's vocab to use
    • this should be left alone for the end user.
  • audio_backend: which audio backend to use.
    • supported options are encodec, vocos, and dac.
    • the end user should not touch this, as this not only depends on the model used, but also governs what audio codec to store processed audio under for the dataset.
  • weights_format: the default weights format to save and load state dicts to
    • the end user shouldn't worry about this, as SafeTensors are primarily used, but the program can easily handle any pickled dicts if requested.
  • weights_name: the name (without the extension) to load the weights from directly. Defaults to fp32.
    • the end user shouldn't worry about this, but it makes regression testing much easier without needing to juggle renaming files.

On initialization, this class then validates its member variables to ensure they're instances of the below classes, rather than dicts.

  • Backwards compatibility validation may be performed during this step as well.
  • The tokenizer and HDF5 dataset (if requested) is instantiated and initialized here too.

Dataset

This class contains configuration options pertaining to the dataset and dataloader for the program, as documented under /docs/data.md.

This is mostly agnostic, but VALL-E specific options can easily be gutted.

Model

This class contains configuration options pertaining to a loaded model, both model specifications and model-specific runtime options (such as the attention mechanism).

This can be stored alongside a state dict to allow for loading stored weights directly without need for a config YAML.

This is mostly agnostic, but VALL-E specific options can easily be gutted.

ModelExperimentalSettings

This class contains experimental knobs and dials that offer zero guarantees that modify model, training, or inferencing behavior.

The end user should not mess with these unless you know what you're doing, as output will greatly vary.

LoRA

Similar to Model, this stores settings pertaining to the LoRA(s) to load for training or inferencing.

Like Model, these settings can also be stored alongside a LoRA's state dict to be loaded directly without need for a config YAML.

Hyperparameters

This class defines the hyperparameters to use during training.

For the most part, when using prodigyopt, the only dials to care about is batch_size and gradient_accumulation_step.

Evaluation

This class governs the behavior during the evaluation / validation pass during training.

If cfg.evaluation.size > 0, then the evaluation / validation passes are triggered every cfg.evaluation.frequency iteration steps.

During evaluation:

  • for the subtrain evaluation pass, the training dataset is directly sampled through indices, rather than the iterator, to avoid having to duplicate the dataset.
    • in the future, the samples during this pass should sample around the training dataloader's current position.
  • for the val validation pass, the validation dataset is sampled through the dataloader's iterator.
    • currently, the validation dataloader's sampler is not stored.

A total of cfg.evaluation.size samples are inferenced in no more than cfg.evaluation.batch_size-sized batches (no more than, because batched samplers may return different sized batches).

The resulting audio is then stored within the current log directory (./{YAML_PATH}/logs/{START_TIME}/{CURRENT_ITERATION}/), storing the input audio prompt, the resulting output, and the target output.

The resultant waveform compared against the target waveform using AuraLoss's MelSTFTLoss to compare similarities, and the loss is logged.

  • To-do: replace this with a better method.

The inference settings used for the evaluation / validation pass can be defined under cfg.evaluation.kwargs, where each entry should mirror the CLI arguments for inferencing.

Trainer

This class governs the trainer's behavior during training, from:

  • which checkpoint to save and load from
  • when loading the state dict or checkpoint
  • when to save (or export) every X iterations
  • what to do when an OOM error is caught, if it should catch those thrown exceptions
  • which Engine backend to use
  • what data type to load the model for training under, and to use mixed precision

DeepSpeed

This class handles the config dict that is passed to DeepSpeed for initialization.

DeepSpeed-specific features like "compression training" (which for the purpose of VALL-E is superfluous) and use of ZeRO (which for the purpose of VALL-E is only really needed if you're on very low VRAM for training).

The dict can be overriden under cfg.trainer.deepspeed.config, to explicitly provide options.

Inference

This class handles inferencing behavior, such as:

  • which Engine backend to use
  • what data type to load the model for inferencing under, and to use mixed precision

Optimizations

This class handles enabling requested optimization techniques and frameworks, such as:

  • BitsAndBytes
  • DAdaptation
  • BitNet
  • Nvidia's TPE's FP8
  • Unsloth input tensor offloading

as well as modifying how optimization techniques and frameworks, by either replacing the original module within the model, or by injecting the optimized version of the model over the original model.

  • In other words, replace will not override the original classes under torch, while inject is a more invasive method.
  • For all intents and purposes, use replace.

Additionally, an experimental method of offloading the model between different devices can be done through model_offloading.

  • However, this feature needs validation, as this was partially tested forever ago.

NaiveTokenizer

This is a simple class that handles tokenizing from my original, naive way. The ar+nar-retnet-8 uses this form of tokenizing, which simply mainly does some funny string manipulation to handle token merges.

The reference model ar+nar-llama-8 could use this, but for how reliant it is on the remaining tokens in the vocab being merges, requires better merging logic.