vall-e/vall_e.cpp
2024-12-23 20:36:40 -06:00
..
include more work on vall_e.cpp (need to resolve why the embeddings (and maybe the weights as a whole) are different from the base model) 2024-12-23 20:36:40 -06:00
Makefile crammed encodec.cpp in 2024-12-21 15:48:12 -06:00
README.md more work on vall_e.cpp (need to resolve why the embeddings (and maybe the weights as a whole) are different from the base model) 2024-12-23 20:36:40 -06:00
vall_e.cpp more work on vall_e.cpp (need to resolve why the embeddings (and maybe the weights as a whole) are different from the base model) 2024-12-23 20:36:40 -06:00
vall_e.h more work on vall_e.cpp (need to resolve why the embeddings (and maybe the weights as a whole) are different from the base model) 2024-12-23 20:36:40 -06:00

vall_e.cpp

This is an implementation that makes use of llama.cpp and encodec.cpp.

At the moment it's very barebones as I try and wrestle with llama.cpp's API without needing to modify its code.

Build

Populate ./include/ with the llama.cpp and encodec.cpp headers.

Populate ./libs/ with the compiled libraries of llama.cpp and encodec.cpp.

Run make.

Required Modifications

encodec.cpp requires updating its GGML copy to the latest version, which requires a few lines to get the CPU backend working.

llama.cpp only possible modification needs to ensure that a non-causal attention mask is used; everything necessary can be hacked together with clever tricks.

To-Do

  • converted model to GGUF
    • convert it without modifying any of the existing code, as the tokenizer requires some care
    • actually convert the model properly, as the embeddings differ from the real model
  • basic framework
    • load the quantized model
    • orchestrate the required embeddings
    • juggle the output head / classifier properly
  • phonemize text
    • with the help of espeak-ng
  • tokenize phonemes
    • the tokenizer is being a huge thorn on actual sequences
  • load audio from disk
  • encode audio
  • sum embeddings for the prom and prior resps
  • working AR output
    • AR sampling
  • working NAR-len output
    • NAR-len sampling
  • working NAR output
    • NAR sampling
  • decode audio to disk
  • a functional CLI
  • actually make it work