documentation update while I wait for more audio (between 4 and 8 seconds per utterance) quantize for nvidia/audio-codec-44khz (I was foolish to think I can get something servicable with just 4 seconds max for an utterance)

2025-02-15 17:42:06 -06:00 · 2025-02-15 17:42:06 -06:00 · 0dc49ef4d5
commit 0dc49ef4d5
parent 13c3a08853
4 changed files with 22 additions and 27 deletions
--- a/docs/README.md
+++ b/docs/README.md
@ -27,6 +27,8 @@ This VALL-E is still actively being iterated upon without any actual proper stan

 There are far better TTS solutions out there, such as [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct) and [F5-TTS](https://github.com/SWivid/F5-TTS). They're both easy to use and offer amazing results.

+In the future, a 44KHz model will be released if training goes well for it.
+
 ## Model Specifications

 The reference model (`ar+nar-llama-8`/`ar+nar-len-llama-8`):
@ -46,6 +48,7 @@ The reference model (`ar+nar-llama-8`/`ar+nar-len-llama-8`):
 * [x] train and release a serviceable model for finetuning against.
 * [x] train and release a ***good*** zero-shot model.
  - for what it's worth it's decent enough for me to finally be happy with it.
+* [ ] train a serviceable model for 44KHz audio (instead of 24KHz)
 * [ ] well-integrated training through the Web UI (without the kludge from ai-voice-cloning)
 * [x] clean up the README, and document, document, document.
 * [x] extend to multiple languages ([VALL-E X](https://arxiv.org/abs/2303.03926)).
@ -64,23 +67,25 @@ The reference model (`ar+nar-llama-8`/`ar+nar-len-llama-8`):
 * [ ] speed up inferencing for the AR
  - KV caching both yields broken output and quadratically slow output, unless I'm doing something grossly wrong.
  * [x] provide a pure NAR model that foregoes most of the inferencing slowdowns a regular AR+NAR model will provide.
-* [ ] HF-ify the model
+* [x] HF-ify the model
  * [x] write a weights converter
-  * [ ] implement a pure llama_HF implementation
-  - this might be easily possible by subjugating the tokenizer to handle all the embeddings / classifiers
-  - this will pave the way to use the model under an easy marriage of `llama.cpp` and `encodec.cpp`
+  * [x] implement a pure llama_HF implementation
+    * provided under `./vall_e/models/base.py`'s `__main__`
 * [ ] replace the phonemizer with something that doesn't depend on espeak
  * [ ] train the model to handle text => phoneme (without a hit to the rest of the model)
    * [ ] ...and phonemes => text
    * [ ] using a pure text vocab rather than IPA phonemes (as a transformer should be "smart" enough to map text tokens)
+    * these features are predicated on the model being trained for it
 * [ ] smarter/clever inferencing, such as:
+  * [x] inference *all* codebooks in one pass, rather than each level being its own discrete pass.
+      * these features are predicated on the model being trained for it
  * [x] "rolling" context, where the last generated sentence is the prefix for the next sentence.
  * [ ] for the AR, stop inferencing sequences in the batch that has already hit its stop token
 * [x] objective metrics such as WER / SIM-O
  * [x] WER simply requires transcribing audio then computing word error rates through the transcriptions
  * [x] SIM-O requires passing the raw waveform through a speaker-similarity model
-* [ ] valle.cpp through llama.cpp + encodec.cpp
-  * the latter is easy, the former is not.
+* [x] valle.cpp through llama.cpp + encodec.cpp
+  * extend to decode with vocos.cpp, instead, for a quality improvement

 ## "Postmortem"

--- a/docs/emb.md
+++ b/docs/emb.md
@ -66,35 +66,19 @@ For audio backends:
 Descript-Audio-Codec was thoroughly tested for promising much, much cleaner output audio, as this model encodes/decodes at 44.1KHz, rather than EnCodec's 24KHz.

 However, due to the nature of the codec, simply throwing it at an attention-based transformer proves to be painful, as the model *heavily* suffers from noisy output in the higher half of the RVQ levels.
+* the solution may be to simply encode / decode with *all* RVQ levels in one pass.

 Ironically, testing through erroneously encoded audio (feeding 24KHz audio without upsampling to 44.1KHz) proved to have "cleaner" but bad utterances.

-I'm uncertain on how to remedy this, as my options are:
-* train under a RetNet, if an attention-based transformer is simply the problem (it's not)
-* train an AR, and train a NAR, if the codec itself is at fault (it's probably something inherent to the codec)
-* use an SSM like Mamba, if transformers entirely cannot model the codec (Mamba is too much of a thorn to use)
-* train a separate model that simply converts from EnCodec to DAC (requires another model to juggle, but does not require training a new model)
-* train *all* NAR levels as independent masking sequences similar to the `NAR-len` (complicated)
-  * if this works, then it means that there's little to no mappable relation between DAC's RVQ levels
-
-  Other literature does mention the difficulty for a model to model using DAC as a codec.
-
 #### `nvidia/audio-codec-44khz`

 This novel codec promises more than DAC without the difficulty to model with it.

 NVIDIA's NeMo audio codec doesn't necessarily have a concrete name, but is simply referred to as `nemo` in the code. The included code under `./emb/codecs/nemo.py` is mostly copied (with attribution) from the reference implementation with additional tweaks. In the future, it would be beneficial to decouple it from NeMo's framework and its dependencies.

-However, because this codec relies on FSQ (Finite Scalar Quantization) rather than RVQ (Residual Vector Quantization), each level of the codebook governs a specific band of the mel spectrum, where each level for RVQ governs additive levels to the final audio. Because of this, the original approach of inferencing the strongest detail, then each level predicts the next, isn't a good fit for FSQ-based codecs.
+However, because this codec relies on FSQ (Finite Scalar Quantization) rather than RVQ (Residual Vector Quantization), each level of the codebook governs a specific band of the mel spectrum, instead of each level for RVQ governs additive levels to the final audio. Because of this, the original approach of inferencing the strongest detail, then each level predicts the weaker, next detail, is theoretically not a good fit for FSQ-based codecs.

-Proposed architectures may include:
-* independent NAR-demasking for *all* levels, rather than FSQ level 0.
-  * little additional code is required, as existing NAR-demasking training/inference code can be repurposed for additional levels.
-  * this also has the best backwards compat with vall_e.cpp, as no extra model code is required.
-* parallel decoding for *all* levels in one pass, rather than separate passes for each level.
-  * some extra code would be required for orchestrating the additional decoding heads in parallel.
-  * the decoding heads may simply be a single `nn.Linear` classifier, or additional transformer blocks.
-    * the former yields bad results when overfitting, the latter without an output projection head allows for overfitting.
+The current approach is to, instead, encode / decode all FSQ levels within each pass. This approach seems promising, as it does not seem to exhibit the problem `descript-audio-codec` did where higher levels fail to train sufficiently enough.

 ## `transcribe.py`

--- a/docs/models.md
+++ b/docs/models.md
@ -37,6 +37,12 @@ Traditional samplers for text-gen models can apply to the AR (especially rep/len

 Compared to non-autoregressive decoding, I personally feel that autoregressive encoding offers a specific-yet-hard-to-quantify expressive quality that the NAR (and pure NAR solutions) does not offer.

+### Pure AR
+
+Technically, with `cfg.model.version >= 7`, a model can be purely AR, as that version of the model encodes and decodes all codebooks of audio in a single pass.
+
+Inferencing code is not available at the moment for this modality, but will be available in the future.
+
 ## The NAR (Non-autoregressive) Model

 The NAR is responsible for generating the remaining RVQ levels of the audio codes for a given output. References to the "outputs from the NAR" refers to the underlying "levels" for a given waveform, as each further levels contributes to the final waveform less significantly than the previous.
--- a/scripts/process_emilia.py
+++ b/scripts/process_emilia.py
@ -111,8 +111,8 @@ def process(
 				if "text" not in metadata:
 					continue
 				speaker_id = metadata["speaker"]
-				outpath = Path(f'./{output_dataset}/{group_name}/{speaker_group}/{speaker_id}/{fname}.{extension}')
-				os.makedirs(f'./{output_dataset}/{group_name}/{speaker_group}/{speaker_id}/', exist_ok=True)
+				outpath = Path(f'./{output_dataset}/{group_name}/{speaker_id}/{fname}.{extension}').with_suffix(audio_extension)
+				os.makedirs(f'./{output_dataset}/{group_name}/{speaker_id}/', exist_ok=True)

 				if _replace_file_extension(outpath, audio_extension).exists():
 					continue