vall-e/vall_e.cpp
2024-12-21 22:52:10 -06:00
..
Makefile crammed encodec.cpp in 2024-12-21 15:48:12 -06:00
README.md agony 2024-12-21 22:52:10 -06:00
vall_e.cpp agony 2024-12-21 22:52:10 -06:00

vall_e.cpp

This is an implementation that makes use of llama.cpp and encodec.cpp.

At the moment it's very barebones as I try and wrestle with llama.cpp's API without needing to modify its code.

Build

Populate ./include/ with the llama.cpp and encodec.cpp headers.

Populate ./libs/ with the compiled libraries of llama.cpp and encodec.cpp.

  • encodec.cpp requires updating ggml to the latest version and doing a quick hack to make it work on the CPU backend.
  • llama.cpp currently requires no hacks, but:
    • would be very nice to retrieve a model's tok_embd through the API.
    • would be very nice to only specify a slice of the output head through the API.

Run make.

To-Do

  • converted model to GGUF
    • convert it without modifying any of the existing code, as the tokenizer requires some care
  • basic framework
    • load the quantized model
    • orchestrate the required embeddings
    • juggle the output head / classifier properly
  • phonemize text
    • with the help of espeak-ng
  • tokenize phonemes
    • the tokenizer is being a huge thorn on actual sequences
  • load audio from disk
  • encode audio
  • sum embeddings for the prom and prior resps
  • AR sampling
  • NAR-len demasking sampling
  • NAR sampling
  • decode audio to disk
  • a functional CLI
  • actually make it work
    • it seems naively stitching the model together isn't good enough since the output is wrong, it most likely needs training with a glued together classifier