vall-e/vall_e.cpp
2024-12-26 21:42:17 -06:00
..
include vall_e.cpp phonemizing and tokenizing 2024-12-24 22:39:32 -06:00
Makefile 2024-12-26 21:42:17 -06:00
README.md 2024-12-26 21:42:17 -06:00
vall_e-impl.h 2024-12-26 21:42:17 -06:00
vall_e.cpp 2024-12-26 21:42:17 -06:00
vall_e.h 2024-12-26 21:42:17 -06:00

vall_e.cpp

This is an implementation that makes use of llama.cpp and encodec.cpp.

Model weights can:

  • be found at ecker/vall-e@gguf
  • converted with vall_e.export --yaml=./model_path/config.yaml --hf, then running python3 /path/to/your/llama.cpp/convert_hf_to_gguf ./model_path/hf/

Build

Populate ./include/ with the ggml, llama.cpp, and encodec.cpp headers.

Populate ./lib/ with the compiled libraries of llama.cpp, encodec.cpp, and espeak-ng (if not already in your LD_LIBRARY_PATH).

Run make.

Required Modifications

encodec.cpp requires updating its GGML copy to the latest version, which requires a few lines to get the CPU backend working (per my fork).

llama.cpp only possible modification needs to ensure that a non-causal attention mask is used; everything necessary can be hacked together with clever tricks.

To-Do

  • converted model to GGUF
    • convert it without modifying any of the existing code, as the tokenizer requires some care
  • basic framework
    • load the quantized model
    • orchestrate the required embeddings
    • juggle the output head / classifier properly
  • phonemize text
    • with the help of espeak-ng
  • tokenize phonemes
    • tokenize with llama_tokenize instead of a homebrewed method because the tokenizer is being a huge thorn
  • load audio from disk
  • encode audio
  • sum embeddings for the prom and prior resps
  • working AR output
    • AR sampling
  • working NAR-len output
    • NAR-len sampling
  • working NAR output
    • NAR sampling
  • decode audio to disk
  • a functional CLI
  • actually make it work
  • clean up to make the code usable elsewhere
  • configured to allow for being used as a lib
    • (I do need to validate this in my engine project, but that's in MSYS2)
  • feature parity with the PyTorch version
    • vocos
    • additional tasks
      • stt
      • ns / sr
      • samplers