Add VALL-E
parent
ad54bc7f4a
commit
5e39238714
48
VALL-E.-.md
Normal file
48
VALL-E.-.md
Normal file
@ -0,0 +1,48 @@
|
||||
VALL-E is composed of two transformer-based language models designed to synthesize speech from a input of a text prompt and an acoustic prompt.
|
||||
|
||||
## Text Prompt
|
||||
|
||||
This is simply just a text transcription ran through [`phonemizer`](https://github.com/bootphon/phonemizer/) to convert English to IPA phonemes. The resulting phonemes are mapped through a phoneme symmap to convert into indices, and then tensors.
|
||||
|
||||
## Acoustic Prompt
|
||||
|
||||
This is a reference clip of the your target speaker of usually 3 seconds. This is referred to as the "acoustic prompt" as the model is able to recreate *the entire* acoustics of an input, unlike other models that rely on representing features of a speaker through latents.
|
||||
|
||||
During inferencing, the reference clip is quantized through EnCodec and the output is directly used as the tokens for the input prompt.
|
||||
|
||||
During training, this is composed of random utterances sampled from the same target speaker until 3 seconds has been reached, and then trimmed down to 3 seconds (plus/minus some offsets, so the model isn't strictly trained on the same sequence lengths). Originally, a random crop was applied to shuffle it up, but in the evaluation / validation output, this would introduce harsh beginnings in the output, due to EnCodec sequences not being all that robust to random slices. Instead, the combined acoustic prompts are sliced from 0 to 3 seconds (or otherwise specified in the training configuation).
|
||||
|
||||
The original paper mentions for the AR was trained with (relatively) long input prompts of up to 30 seconds, while the NAR was a consistent 3 seconds.
|
||||
|
||||
## AR
|
||||
|
||||
The first model of the two, the autoregressive model, handles recursively sampling the the next token until a stop token is reached, to populate the first RVQ-bin level of the final output.
|
||||
|
||||
The AR also boasts with being able to determine the duration of an audio clip for a given input text and acoustic prompt, unlike other solutions (like Meta's flow-matching based Voicebox) that may require a model specifically to determine the duration.
|
||||
|
||||
Due to being causal, the process is quite time consuming.
|
||||
|
||||
## NAR
|
||||
|
||||
The second model of the pair, the non-autoregressive model. handles the remaining RVQ-bin levels of the output by sequencing the next levels for a token in the response output in parallel, until all remaining levels are tended to.
|
||||
|
||||
Unlike the AR, the process is near-instant, as it only requires a small handful of forward passes, one for each remaining RVQ-bin level.
|
||||
|
||||
## Transformer-Model
|
||||
|
||||
Originally, the paper made use of a typical attention-based transformer model consisting of attention heads + feed-forwards.
|
||||
|
||||
This implementation pivoted to a retention-based transformer approach (RetNet) for all around gains: less VRAM required, faster inference speeds, better context windows, and faster training.
|
||||
|
||||
The input tensors get ran through designated embeddings designed to sum up embeddings specific to each RVQ-bin level being targetted. The input sequence of embeddings are then passed through layer normalization (the original implementation uses adaptive layer normalization per RVQ-bin level for the NAR, the RetNet does not), the transformer head (attention or retention), layer normalization again, and then the feed-forward's + activation, through each layer of the model.
|
||||
|
||||
Finally, after passing through a model's layer, the tensors are ran through a classifier where the logits are sampled for the next EnCodec token.
|
||||
|
||||
A neat detail is that, because the models are transformer-based, even just the text prompt can generate an entire acoustic prompt of its own, and then the response output.
|
||||
|
||||
## EnCodec
|
||||
|
||||
VALL-E entirely relies on EnCodec to not only serve as representation audio (the alternative are mel-spectrograms), but serves as modeling the "features" of a speaker, thus eliminating another model to serve as encoding a speaker to return latent features of said target speaker (TorToiSe and Bark use this). A second of real audio correlates to 75 tokens.
|
||||
|
||||
EnCodec relies on using neurally encoding audio with residual vector quantization (RVQ), where each layer adds detail to the final 24KHz waveform. Bandwidth can be reduced by only requiring 2 RVQ-bin levels at the cost of audio quality. [Vocos](https://github.com/charactr-platform/vocos/) is a supplement to EnCodec to get much more detail out of EnCodec code sequences, where even 2 RVQ-bin levels is viable.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user