vall-e

mrq/vall-e

Table of Contents

Text Prompt
Acoustic Prompt
AR
NAR
Transformer-Model
EnCodec

VALL-E is composed of two transformer-based language models designed to synthesize speech from a input of a text prompt and an acoustic prompt.

Text Prompt

This is simply just a text transcription ran through phonemizer to convert English to IPA phonemes. The resulting phonemes are mapped through a phoneme symmap to convert into indices, and then tensors.

Acoustic Prompt

This is a reference clip of the your target speaker of usually 3 seconds. This is referred to as the "acoustic prompt" as the model is able to recreate the entire acoustics of an input, unlike other models that rely on representing features of a speaker through latents.

During inferencing, the reference clip is quantized through EnCodec and the output is directly used as the tokens for the input prompt.

During training, this is composed of random utterances sampled from the same target speaker until 3 seconds has been reached, and then trimmed down to 3 seconds (plus/minus some offsets, so the model isn't strictly trained on the same sequence lengths). Originally, a random crop was applied to shuffle it up, but in the evaluation / validation output, this would introduce harsh beginnings in the output, due to EnCodec sequences not being all that robust to random slices. Instead, the combined acoustic prompts are sliced from 0 to 3 seconds (or otherwise specified in the training configuation).

The original paper mentions for the AR was trained with (relatively) long input prompts of up to 30 seconds, while the NAR was a consistent 3 seconds.

AR

The first model of the two, the autoregressive model, handles recursively sampling the the next token until a stop token is reached, to populate the first RVQ-bin level of the final output.

The AR also boasts with being able to determine the duration of an audio clip for a given input text and acoustic prompt, unlike other solutions (like Meta's flow-matching based Voicebox) that may require a model specifically to determine the duration.

Due to being causal, the process is quite time consuming.

NAR

The second model of the pair, the non-autoregressive model. handles the remaining RVQ-bin levels of the output by sequencing the next levels for a token in the response output in parallel, until all remaining levels are tended to.

Unlike the AR, the process is near-instant, as it only requires a small handful of forward passes, one for each remaining RVQ-bin level.

Transformer-Model

Originally, the paper made use of a typical attention-based transformer model consisting of attention heads + feed-forwards.

This implementation pivoted to a retention-based transformer approach (RetNet) for all around gains: less VRAM required, faster inference speeds, better context windows, and faster training.

The input tensors get ran through designated embeddings designed to sum up embeddings specific to each RVQ-bin level being targetted. The input sequence of embeddings are then passed through layer normalization (the original implementation uses adaptive layer normalization per RVQ-bin level for the NAR, the RetNet does not), the transformer head (attention or retention), layer normalization again, and then the feed-forward's + activation, through each layer of the model.

Finally, after passing through a model's layer, the tensors are ran through a classifier where the logits are sampled for the next EnCodec token.

A neat detail is that, because the models are transformer-based, even just the text prompt can generate an entire acoustic prompt of its own, and then the response output.

EnCodec

VALL-E entirely relies on EnCodec to not only serve as representation audio (the alternative are mel-spectrograms), but serves as modeling the "features" of a speaker, thus eliminating another model to serve as encoding a speaker to return latent features of said target speaker (TorToiSe and Bark use this). A second of real audio correlates to 75 tokens.

EnCodec relies on using neurally encoding audio with residual vector quantization (RVQ), where each layer adds detail to the final 24KHz waveform. Bandwidth can be reduced by only requiring 2 RVQ-bin levels at the cost of audio quality. Vocos is a supplement to EnCodec to get much more detail out of EnCodec code sequences, where even 2 RVQ-bin levels is viable.