more notes

This commit is contained in:
mrq 2024-11-06 13:51:28 -06:00
parent bfc5e1d723
commit bcabde3454
7 changed files with 358 additions and 22 deletions

View File

@ -15,6 +15,7 @@ Unlike the paper, this VALL-E aims to:
+ Bark was needlessly complex in providing even additional voices to use.
+ Current SoTA such as F5-TTS supports it, but seems to have a rather high ceiling to finetune it.
* provide decent zero-shot text-to-speech synthesis, both without requiring sampling adjustments and providing thorough sampler settings.
* provide additional, easy to use functionality, that other solutions don't offer.
## Caveats

View File

@ -4,6 +4,129 @@ This script handles everything related to storing configuration information, as
* loading the `data.h5` file
* loading the phoneme tokenizer
Most documentation should already be noted alongside each line, or in the [provided YAML](/data/config.yaml).
Thorough documentation pertaining to each field should already be noted alongside each line, or in the [provided YAML](/data/config.yaml).
To be filled.
## `BaseConfig`
This serves as an agnostic base class that can be reused across additional projects.
Aside from accessing properties, the end user should not be required to interact with this.
## `Config`
This serves as the inhereted class for `BaseConfig`, which contains instances of the following classes within it.
Additional global-states can be found here, such as:
* `device`: which device to load the model to
* `experimental`: a debug flag
* for the end user, this gates off experimental sampler settings in the web UI.
* `tokenizer`: the tokenizer type to use
* this only really is used for the `ar+nar-retnet-8`, as it used a naive tokenizer and vocab.
* `tokenizer_path`: the path to the tokenizer's vocab to use
* this should be left alone for the end user.
* `audio_backend`: which audio backend to use.
* supported options are `encodec`, `vocos`, and `dac`.
* the end user should not touch this, as this not only depends on the model used, but also governs what audio codec to store processed audio under for the dataset.
* `weights_format`: the default weights format to save and load state dicts to
* the end user shouldn't worry about this, as SafeTensors are primarily used, but the program can easily handle any pickled dicts if requested.
On initialization, this class then validates its member variables to ensure they're instances of the below classes, rather than dicts.
* Backwards compatibility validation may be performed during this step as well.
* The tokenizer and HDF5 dataset (if requested) is instantiated and initialized here too.
## `Dataset`
This class contains configuration options pertaining to the dataset and dataloader for the program, as documented under [/docs/data.md](/docs/data.md).
This is *mostly* agnostic, but VALL-E specific options can easily be gutted.
## `Model`
This class contains configuration options pertaining to a loaded model, both model specifications and model-specific runtime options (such as the attention mechanism).
This can be stored alongside a state dict to allow for loading stored weights directly without need for a config YAML.
This is *mostly* agnostic, but VALL-E specific options can easily be gutted.
### `ModelExperimentalSettings`
This class contains experimental knobs and dials that offer zero guarantees that modify model, training, or inferencing behavior.
The end user should *not* mess with these unless you know what you're doing, as output will greatly vary.
## `LoRA`
Similar to `Model`, this stores settings pertaining to the LoRA(s) to load for training or inferencing.
Like `Model`, these settings can also be stored alongside a LoRA's state dict to be loaded directly without need for a config YAML.
## `Hyperparameters`
This class defines the hyperparameters to use during training.
For the most part, when using `prodigyopt`, the only dials to care about is `batch_size` and `gradient_accumulation_step`.
## `Evaluation`
This class governs the behavior during the evaluation / validation pass during training.
If `cfg.evaluation.size > 0`, then the evaluation / validation passes are triggered every `cfg.evaluation.frequency` iteration steps.
During evaluation, a separate copy of the training dataset will be sampled and the inputs will be inferenced to generate an output, while during validation, the validation dataset is sampled from instead.
A total of `cfg.evaluation.size` samples are inferenced in no more than `cfg.evaluation.batch_size`-sized batches (no more than, because batched samplers may return different sized batches).
The resulting audio is then stored within the current log directory (`./{YAML_PATH}/logs/{START_TIME}/{CURRENT_ITERATION}/`), storing the input audio prompt, the resulting output, and the target output.
The resultant waveform compared against the target waveform using AuraLoss's `MelSTFTLoss` to compare similarities, and the loss is logged.
* To-do: replace this with a better method.
The inference settings used for the evaluation / validation pass can be defined under `cfg.evaluation.kwargs`, where each entry should mirror the CLI arguments for inferencing.
## `Trainer`
This class governs the trainer's behavior during training, from:
* which checkpoint to save and load from
* when loading the state dict or checkpoint
* when to save (or export) every X iterations
* what to do when an OOM error is caught, if it should catch those thrown exceptions
* which `Engine` backend to use
* what data type to load the model for training under, and to use mixed precision
### `DeepSpeed`
This class handles the config dict that is passed to DeepSpeed for initialization.
DeepSpeed-specific features like "compression training" (which for the purpose of VALL-E is superfluous) and use of ZeRO (which for the purpose of VALL-E is only really needed if you're on very low VRAM for training).
The dict can be overriden under `cfg.trainer.deepspeed.config`, to explicitly provide options.
## `Inference`
This class handles inferencing behavior, such as:
* which `Engine` backend to use
* what data type to load the model for inferencing under, and to use mixed precision
## `Optimizations`
This class handles enabling requested optimization techniques and frameworks, such as:
* BitsAndBytes
* DAdaptation
* BitNet
* Nvidia's TPE's FP8
* Unsloth input tensor offloading
as well as modifying how optimization techniques and frameworks, by either replacing the original module within the model, or by injecting the optimized version of the model over the original model.
* In other words, `replace` will not override the original classes under torch, while `inject` is a more invasive method.
* For all intents and purposes, use `replace`.
Additionally, an experimental method of offloading the model between different devices can be done through `model_offloading`.
* However, this feature needs validation, as this was partially tested forever ago.
---
## `NaiveTokenizer`
This is a simple class that handles tokenizing from my original, naive way. The `ar+nar-retnet-8` uses this form of tokenizing, which simply mainly does some funny string manipulation to handle token merges.
The reference model `ar+nar-llama-8` *could* use this, but for how reliant it is on the remaining tokens in the vocab being merges, requires better merging logic.

View File

@ -17,6 +17,20 @@ By default, punctuation, stress markers, and stripping are enabled by default, b
To avoid memory leaking through `phonemizer`, backends and instances are cached for further reuse.
### Text Tokens
Despite being an audio LM, the model still needs some form of text as the input prompt.
While it's possible to naively use raw text, it's much more beneficial to instead opt for tokenizing IPAs instead, as they're (mostly) not tied to the language itself.
For the meantime, this project depends heavily on `phonemizer` to process normal text into IPAs
In the future, a separate model that handles converting text into phonemes is preferred, but:
* this requires an extensive vocab *per language*.
* this either requires an additional model to lug around and have trained, or repurposing the existing model to perform such task.
+ The latter option does open the way of taking normal text as inputs itself, as the model should be aware enough about mapping text to IPAs.
+ This *technically* can be done, as it just requires a separate input embedding + output head per language, but training without hindering the model would be a chore.
## `qnt.py`
This script handles taking audio waveforms and encoding it as code tokens to run through the model, and code tokens outputted from the model and decoding it into raw waveforms.
@ -40,11 +54,27 @@ For audio backends:
- models at 24KHz + 8kbps will NOT converge in any manner.
- models at 44KHz + 8kbps seems harder to model its "language", and the NAR side of the model suffers greatly.
#### Descript-Audio-Codec
Descript-Audio-Codec was thoroughly tested for promising much, much cleaner output audio, as this model encodes/decodes at 44.1KHz, rather than EnCodec's 24KHz.
However, due to the nature of the codec, simply throwing it at an attention-based transformer proves to be painful, as a unified AR+NAR model *heavily* suffers from noisy output in the NAR.
Ironically, testing through mal-encoded audio (feeding 24KHz audio without upsampling to 44.1KHz) proved to have "cleaner" but bad utterances.
I'm uncertain on how to remedy this, as my options are:
* train under a RetNet, if an attention-based transformer is simply the problem
* train an AR, and train a NAR, if the codec itself is at fault
* use an SSM like Mamba, if transformers entirely cannot model the codec
* train a separate model that simply converts from EnCodec to DAC
## `transcribe.py`
This script handles taking raw input audio, and outputting adequate metadata containing transcriptions of said audio through `whisperX`.
The process maintains slices `whisperX` thinks its best per the segments outputted.
The process maintains slices `whisperX` thinks its best per the segments outputted, alongside the deduced language (if not specified).
One limiting factor is that transcription transcribes into normal text, rather than the IPA phonemes the model was trained against. Some flavors *may* exist, but I have yet to test them extensively (if I did ever find one).
Refer to the `__main__`'s arguments for usage details.
@ -59,7 +89,15 @@ Refer to the `__main__`'s arguments for usage details.
## `similar.py`
This script handles taking either raw input audio, or processed encoded audio, and determines the top-K similar utterances for each sample for a given speaker (or dataset).
* For raw input audio, the MFCC (Mel-frequency cepstrum coefficients) are extracted as features from the waveform, and the cosine similarities are compared against every other utterance for a given speaker.
* This works *fine*, as this is adequately accurate and does not require a model to already exist.
* For the encoded audio, the audio codes are passed through the model's embedding, summed to one "token", and the cosine similarities are compared to score the top-K similar speakers.
* By default, the output response embedding is used, and each RVQ level is summed together to leave one sequence.
* In theory this should be better as the model may have its own features per RVQ code+level, but still requires a model to already be trained.
* The original encoding model's embeddings can also be used, or the last hidden states passed through the model, instead, but seems overkill.
When processing a dataset, this requires already having accompanying metadata generated through `vall_e.data --action=metadata --yaml=./your/training/config.yaml`.
Be *very* careful if you opt to output unsegmented and segmented utterances, as the sliced version may end up amongst the top-K similar candidates.
Refer to the `__main__`'s arguments for usage details.

View File

@ -1,6 +1,173 @@
# Model Notes
To be filled.
The underlying model is a robust transformer, where:
* inputs are passed through an embedding
* the embedded inputs are then passed through each layer of the transformer (or other model type)
* the last hidden states are then passed through the output head / classifier / projection, resulting in logit probabilities to sample from.
The inputs are sequenced in a way that the given task requires automatically, and the outputs are handled as per the class that extends the base model.
While the original paper called for a separate AR model and a NAR model, you can actually train a unified model for effectively free, as the internal states of the two should overlap quite a lot.
## The AR (Autoregressive) Model
The AR is responsible for generating the first RVQ level of the audio codes for a given output. References to "outputs from the AR" refers to this level, as it contibutes to the final waveform the most.
* The benefit of autoregressively decoding for this code is that it offers better output while also "encoding" the duration within the sequence itself, as the stop token will depend on the length of the sequence.
* The downside is that it does take most of the compute time to iterate through the sequence one step at a time.
Autoregressive training is performed by having each token predict the next token in the sequence. This is done by appending a special stop token to the input targets, then shifting the output logits over one compared to the input targets (this shift can be more than one to decode more than one token).
One way to work around the time cost is to instead decode more than one token at a time.
* In theory, for a unified AR+NAR model, this *should* be an easy task, as the model can already decode tokens in parallel.
* In reality, this isn't the case. Specifying a `cfg.model.experimental.causal_size > 1` will have the output sound *fine* every Nth timestep, as the following tokens aren't predictable enough.
+ *However*, this may simply be a sampling problem, as this experiment was done with outdated ideas on how to sample the AR, and should be worth revisiting.
* VALL-E 2's paper proposes merging code sequences together into one embedded token for a speedup, but their solution seems rather complex to warrant a fundamental retrain.
I personally feel that autoregressive encoding offers a specific-yet-hard-to-quantify expressive quality that the NAR (and pure NAR solutions) does not offer, but further testing is required to substantiate the claim.
## The NAR (Non-autoregressive) Model
The NAR is responsible for generating the remaining RVQ levels of the audio codes for a given output. References to the "outputs from the NAR" refers to the underlying "levels" for a given waveform, as each further levels contributes to the final waveform less significantly than the previous.
As decoding is done non-autoregressively, the model can process tokens "in place" and have them attended to one another in the past and future, thus speeding up output and allowing for "more accurate" outputs.
Non-autoregressive trainng is performed by having the input tokens from the previous RVQ level predict the next level's token in place. The output logits are in the same position, and do not require further modifications as required for the AR.
However, having a pure NAR is challenging, as you need to both explicitly provide the duration and provide a "good enough" starting sequence of tokens for the initial sequence.
* The former problem is easily "solved" by training a `len` inferencing task, where the given input predicts the requested duration for a given utterance autoregressively.
* The latter however proves to be a bit of a challenge, as this could be anything from random noise to a unique token.
* Testing showed that it's easy to predict the duration, but decoding the first RVQ level accurately proves to be a chore.
* Initially, output seemed chaotic and unreliable, but further experiments showed the model will "work" for a brief moment before going silent.
One problem exhibited from a NAR is producing arfifacts ("crust") in the final waveform. I believe this is a confidence problem where the wrong token is inferred.
* Unfortunately, one solution is to simply train a separate NAR, as this should help bolster the model's NAR capabilities without the AR influencing things, as I imagine being able to both causally and parallel-ly decode tokens harms things.
* This is backed by the used `cfg.model.experimental.rvq_levels_p` distribution affecting the model's AR capabilities, as increasing the NAR's share in training causes the AR to perform *less*.
* However, this may be simply wrong, but checkpoints that used such distributions felt lobotomized.
## Embeddings
The "magic" of subjugating a transformer for audio use lies within the ensemble of the embeddings. This is necessary as each piece of a sequence is fundamentally different, but a HF-compatible model can geta way with treating each sequence as separate ranges within a total token sequence.
### Text Embeddings
The input text phonemes (or output for STT) are passed through an embedding head (`text`), similar to how a normal text LLM would. Nothing fancy is required, as it's very straightforward.
Technically, due to how the audio embeddings are implemented, it's possible to offer "language specific" embeddings, rather than one unified IPA-based embedding + a language embedding (`lang`).
* Such an implementation *could* in fact inference from normal text rather than IPA phonemes.
#### Language Embedding
This embedding provides the requested language for the model to be aware of.
This *mostly* isn't necessary, but VALL-E X's paper mentions needing a token for the language itself, and other solutions like XTTS2 provides a language token as well.
In practice, this seems to help govern the accent general mannerisms associated with that language. For example, prompting French or German with the language set to `en` will give typical foreigner speech of trying to speak a language they don't know.
* Consequently, since this does tie to accents more, ***extreme*** attention is to be paid to the dialects being trained against, instead of naively grouping, say, all of Spansih to one language code.
This embedding probably helps the model with being able to perform cross-lingual outputs, but I did not do any experimentations on a model without this, as the reference `ar+nar-llama-8` was trained with this from the beginning (and maybe the `ar+nar-retnet-8` experiment).
#### Tone Embedding
This embedding *should* provide information on the tone for the model to output the utterance in.
Should, since I do not actually make use of this anywhere, and the model is not trained against any tones. I would need to annotate my dataset based on tones *and* pick which tones I do want.
This should most definitely help the model identify tone strongly even without needing to annotate for it, but it does an adequate already with maintaining tone from a given input prompt.
### Audio Embeddings
However, due to the nature of the encoded audio, embedding the audio tokens requires the dark arts, as we use audio both as an input prompt (`prom`) for guidance, and as an output response (`resp`).
As EnCodec encodes audio across eight codebooks (and DAC's 44Khz audio under nine codebooks), our audio is encoded under a 2D space, rather than a simple 1D space like text does. Because of this, we require embeddings for *every* codebook level, effectively giving eight embedding heads for audio.
* Technically, this can be stored within a unified embedding head, but each layer is offset by 1024 (the number of tokens).
For the `prom` embedding, we can simply use each embedding for each layer. Each embedding level maps to its respective RVQ level.
Howver, the `resp` requires some extra care, as the model needs to both causally (AR) and parallel-ly (NAR) decode tokens.
* The first embedding level pertains to RVQ level 0 for the AR.
* The remaining embedding levels maps to RVQ level 0 + n for the NAR.
* In other words, embedding level 1 => RVQ level 0, embedding level 2 => RVQ level 1, etc...
* I believe this is because the model needs to "know" whether to predict the next token in the sequence, or the token in the same position of the next RVQ level.
* Unfortunately, providing a token for the current/target RVQ level within the input sequence doesn't seem to help? I don't remember if I experimented with this or not, but testing of a "sane" `resp` embedding proved to be unfruitful.
The `prom` and `resp` are split since, in theory, it helps the model know better what audio to source from, and what audio is part of the output sequence. In theory.
* I have yet to conduct tests with interchanging the `prom` and `resp`, and the model definitely exhibits being able to map from the `prom` directly, and being able to inference from the `prom` being prefixed in the `resp`.
Finally, the model *may* then sum each embedding level back down to one sequence, as defined under `cfg.model.experimental.audio_embedding_sums`.
* The resulant sum is not normalized by the length.
* It's not a requirement, as the model can still function only "seeing" the required RVQ level.
* However, it *may* help to have the model being able to "see" prior levels, as one RVQ level might depend on the prior level.
* This is mostly dependent on the underlying audio model being used, which would depend on how each residual is defined.
* A model not trained with summing embeddings can enable it without much impact, but a model trained on summing embeddings cannot go in the other way without further training.
* It *could* be beneficial to train a model under mixed modes, but requires experimentation.
* The reference model was trained originally without summing, then trained with summing.
Additionally, it's *technically* possible to instead use the embeddings from the core model used to encode the audio, but in theory this may exclude specific features the model has encoded within the embeddings.
### Tasks
The base model handles processing inputs into token sequences, per the requested task assigned to each input in a batch.
Most sequences follow a `<text><RVQ level><language><prompt><output>` sequence, but some tasks will receive the prompt as a list of tensors, instead.
The nitty gritty of how each task is implemented is documented under [./docs/data.md](/docs/data.md).
#### Text-to-Speech
The primary zero-shot text-to-speech synthesis `tts` task takes in a requested text transcript, a piece of reference audio, and then outputs the response audio of the utterance saying the prompted transcript.
The model primarily functions in a zero-shot setting, where it does not need a guiding prefix, but few-shotting is possible through manual intervention.
* I believe the original VALL-E paper refers to this more as `VALL-E Continuous`, while some other TTS solutions follow this method by transcribing the input audio prompt as well.
Additional tasks are implemented in this project, but ***are yet to be trained for*** in the reference model (as some tasks require additional compute-cost).
##### Noise Suppression
This task `ns` aims to suppress or remove noise from the input audio.
In practice, this task is already implemented by providing the input audio to denoise, and having the input transcription be the transcription of the input audio. The output isn't 1:1 exact in terms of prosody and delivery, but it's close.
I imagine training for this task will better help the model understand what is noise and what isn't, and can better strongly-er map utterances from the input audio prompt to use in the output, delivering better prompt adherance.
* This also might help serve in helping the model identify effects applied to an utterance, and being able to maintain it in normal `tts` tasks, such as reverb or the audio quality itself (the "acoustic environment").
##### Speech Removal
This task `sr` aims to remove speech from a given audio, effectively serving as the reverse of denoising.
As state above, this should help the model better identify what is noise and what isn't.
##### Target Speech Extraction
This task `tse` aims to "extract" an utterance from audio containing other speakers, effective diarizing an utterance.
I imagine training for this task will better help the model "target" the prompted input audio and adhere to it, but this task moreso appears to be solely for the task itself, rather than help the model itself.
##### Clean Speech Editing
This task `cse` aims to modify a portion of a given utterance, effectively editing it.
I imaginie training for this task *might* help the model better map to the input prompt utterances to the output response, but I don't expect for the effects to be strong enough; it definitely is a task that is for the task itself.
###### Noisy Speech Editing
This task `nse` is effectively the same as `cse`, but under a noisy condition.
#### Length Prediction
The length predictor `len` task is required for a pure NAR model.
This task will naively output a zero, then the length in base-10, followed by a stop token.
#### Speech-to-Text
The speech-To-text `stt` task transcribes a given piece of audio, by taking an input encoded audio, and outputting the text transcription.
However, due to the model being trained on phonemes, the resultant output is the phonemes itself.
The primary benefit of this task is to provide a fast way to directly transcribe audio into the phonemes used annotate the dataset itself, but at the moment the reference model isn't accurate enough to rely on this.
* The other problem is it's very hard to validate this, as the output isn't in English, and requires processing through the model again to verify the transciption.
This task will follow a reverse sequence of `<audio><language><RVQ level><output>`.
## Emergent Behavior
@ -33,16 +200,6 @@ This script implements the core underlying model for VALL-E. This handle:
This script aims to implement everything as required per VALL-E agnostically, to allow for different implementations to contain little extra code.
### Tasks
The base model handles processing inputs into token sequences, per the requested task assigned to each input in a batch.
Most sequences follow a `<text><RVQ level><language><prompt><output>` sequence, but some tasks will receive the prompt as a list of tensors, instead.
The length predictor `len` task will naively output the length in base-10 followed by a stop token.
Speech-To-Text will follow a reverse sequence of `<audio><language><RVQ level><output>`.
## `models/ar_nar.py`
This script implements VALL-E as a unified autoregressive and non-autoregressive model, where RVQ-level 0 is inferenced autoregressively, the remaining levels are infereneced non-autoregressively.

View File

@ -25,6 +25,11 @@ Some additional flags can be passed as well:
* `--eval`: only run the evaluation / validation pass, then exit afterwards.
* `--eval-random-text-prompts`: use random text prompts for the evaluation pass, rather than the provided text prompts in the dataset.
A training paradigm that works for me is:
* setting the dataloader to sort by duration, then training one epoch, so the model starts with small utterances then trains to larger ones.
* some additional training using a shuffled dataloader, as the model will be fixated towards whatever duration range it was trained under.
* additional training for sampling per speaker, to better help diversify how well it can perform for a range of speakers, rather than just speaking itself
## Try Me
To quickly test if a configuration works, you can run `python -m vall_e.models.ar_nar --yaml="./data/config.yaml"`; a small trainer will overfit a provided utterance.

View File

@ -430,7 +430,7 @@ class LoRA:
class Hyperparameters:
batch_size: int = 8 # number of samples per training batch
gradient_accumulation_steps: int = 32 # number of steps to accumulate gradients before updating
gradient_clipping: int | float = 10 # largest size a gradient norm can be
gradient_clipping: int | float = 1.0 # largest size a gradient norm can be
optimizer: str = "Adamw" # optimizer to use, should be 'Prodigyopt" now
optimizer_params: dict = field(default_factory=lambda: {}) # to pass through deepspeed config

View File

@ -22,14 +22,8 @@ import gradio as gr
from pathlib import Path
from .inference import TTS, cfg
from .train import train
from .utils import get_devices, setup_logging, timer
from .utils.io import json_read, json_stringify
from .emb.qnt import decode_to_wave
from .data import get_lang_symmap, get_random_prompt
from .models.arch import AVAILABLE_ATTENTIONS
# agony with HF's ZeroGPU spaces
try:
import spaces
@ -39,6 +33,24 @@ except Exception as e:
USING_SPACES = False
def spaces_zerogpu_decorator(func):
return func
# more agony, because gradio will not stay launched if directly called from the package, for who knows why
# this allows me to directly copy this file rather than constantly edit it on the HF space repo
if USING_SPACES:
from vall_e.inference import TTS, cfg
from vall_e.train import train
from vall_e.utils import get_devices, setup_logging, timer
from vall_e.utils.io import json_read, json_stringify
from vall_e.emb.qnt import decode_to_wave
from vall_e.data import get_lang_symmap, get_random_prompt
from vall_e.models.arch import AVAILABLE_ATTENTIONS
else:
from .inference import TTS, cfg
from .train import train
from .utils import get_devices, setup_logging, timer
from .utils.io import json_read, json_stringify
from .emb.qnt import decode_to_wave
from .data import get_lang_symmap, get_random_prompt
from .models.arch import AVAILABLE_ATTENTIONS
is_windows = sys.platform.startswith("win")