vall-e/vall_e.cpp/README.md
2024-12-21 19:59:56 -06:00

1.5 KiB

vall_e.cpp

This is an implementation that makes use of llama.cpp and encodec.cpp.

At the moment it's very barebones as I try and wrestle with llama.cpp's API without needing to modify its code.

Build

Populate ./include/ with the llama.cpp and encodec.cpp headers.

Populate ./libs/ with the compiled libraries of llama.cpp and encodec.cpp.

  • encodec.cpp requires updating ggml to the latest version and doing a quick hack to make it work on the CPU backend.
  • llama.cpp currently requires no hacks, but would be very nice to hack in a way to retrieve a model's tok_embd.

Run make.

To-Do

  • converted model to GGUF
    • convert it without modifying any of the existing code, as the tokenizer requires some care
  • basic framework
    • load the quantized model
    • orchestrate the required embeddings
    • juggle the output head / classifier properly
  • phonemize text
    • with the help of espeak-ng
  • tokenize phonemes
    • the tokenizer is being a huge thorn on actual sequences
  • load audio from disk
  • encode audio
  • sum embeddings for the prom and prior resps
  • AR sampling
  • NAR-len demasking sampling
  • NAR sampling
  • decode audio to disk
  • a functional CLI
  • actually make it work
    • it seems naively stitching the model together isn't good enough since the output is wrong, it most likely needs training with a glued together classifier