Add VALL-E

2023-08-28 00:52:26 +00:00 · 2023-08-28 00:52:26 +00:00 · 5e39238714
commit 5e39238714
parent ad54bc7f4a
1 changed files with 48 additions and 0 deletions
--- a/VALL-E.-.md
+++ b/VALL-E.-.md
@ -0,0 +1,48 @@
+VALL-E is composed of two transformer-based language models designed to synthesize speech from a input of a text prompt and an acoustic prompt.
+
+## Text Prompt
+
+This is simply just a text transcription ran through [`phonemizer`](https://github.com/bootphon/phonemizer/) to convert English to IPA phonemes. The resulting phonemes are mapped through a phoneme symmap to convert into indices, and then tensors.
+
+## Acoustic Prompt
+
+This is a reference clip of the your target speaker of usually 3 seconds. This is referred to as the "acoustic prompt" as the model is able to recreate *the entire* acoustics of an input, unlike other models that rely on representing features of a speaker through latents.
+
+During inferencing, the reference clip is quantized through EnCodec and the output is directly used as the tokens for the input prompt.
+
+During training, this is composed of random utterances sampled from the same target speaker until 3 seconds has been reached, and then trimmed down to 3 seconds (plus/minus some offsets, so the model isn't strictly trained on the same sequence lengths). Originally, a random crop was applied to shuffle it up, but in the evaluation / validation output, this would introduce harsh beginnings in the output, due to EnCodec sequences not being all that robust to random slices. Instead, the combined acoustic prompts are sliced from 0 to 3 seconds (or otherwise specified in the training configuation).
+
+The original paper mentions for the AR was trained with (relatively) long input prompts of up to 30 seconds, while the NAR was a consistent 3 seconds. 
+
+## AR
+
+The first model of the two, the autoregressive model, handles recursively sampling the the next token until a stop token is reached, to populate the first RVQ-bin level of the final output.
+
+The AR also boasts with being able to determine the duration of an audio clip for a given input text and acoustic prompt, unlike other solutions (like Meta's flow-matching based Voicebox) that may require a model specifically to determine the duration.
+
+Due to being causal, the process is quite time consuming.
+
+## NAR
+
+The second model of the pair, the non-autoregressive model. handles the remaining RVQ-bin levels of the output by sequencing the next levels for a token in the response output in parallel, until all remaining levels are tended to.
+
+Unlike the AR, the process is near-instant, as it only requires a small handful of forward passes, one for each remaining RVQ-bin level.
+
+## Transformer-Model
+
+Originally, the paper made use of a typical attention-based transformer model consisting of attention heads + feed-forwards.
+
+This implementation pivoted to a retention-based transformer approach (RetNet) for all around gains: less VRAM required, faster inference speeds, better context windows, and faster training.
+
+The input tensors get ran through designated embeddings designed to sum up embeddings specific to each RVQ-bin level being targetted. The input sequence of embeddings are then passed through layer normalization (the original implementation uses adaptive layer normalization per RVQ-bin level for the NAR, the RetNet does not), the transformer head (attention or retention), layer normalization again, and then the feed-forward's + activation, through each layer of the model.
+
+Finally, after passing through a model's layer, the tensors are ran through a classifier where the logits are sampled for the next EnCodec token.
+
+A neat detail is that, because the models are transformer-based, even just the text prompt can generate an entire acoustic prompt of its own, and then the response output.
+
+## EnCodec
+
+VALL-E entirely relies on EnCodec to not only serve as representation audio (the alternative are mel-spectrograms), but serves as modeling the "features" of a speaker, thus eliminating another model to serve as encoding a speaker to return latent features of said target speaker (TorToiSe and Bark use this). A second of real audio correlates to 75 tokens.
+
+EnCodec relies on using neurally encoding audio with residual vector quantization (RVQ), where each layer adds detail to the final 24KHz waveform. Bandwidth can be reduced by only requiring 2 RVQ-bin levels at the cost of audio quality. [Vocos](https://github.com/charactr-platform/vocos/) is a supplement to EnCodec to get much more detail out of EnCodec code sequences, where even 2 RVQ-bin levels is viable.
+