vall-e/vall_e.cpp/README.md

2.6 KiB

vall_e.cpp

This is an implementation that makes use of llama.cpp and encodec.cpp.

At the moment it's very barebones as I try and wrestle with llama.cpp's API without needing to modify its code.

Build

Populate ./include/ with the llama.cpp and encodec.cpp headers.

Populate ./libs/ with the compiled libraries of llama.cpp and encodec.cpp.

Run make.

Required Modifications

encodec.cpp requires updating its GGML copy to the latest version, which requires a few lines to get the CPU backend working. llama.cpp might not require any modifications, but:

  • llm.build_vall_e can mostly copy llm.build_llama, but with:
    • KQ_mask = build_inp_KQ_mask( lctx.cparams.causal_attn )
    • a unified output head (pain)
      • OR adjusting the model.output to the correct classifier head (better option)
      • OR slicing that tensor with the right range (ggml_view_2d confuses me)
      • both require also require *const_cast<uint32_t*>(&ctx->model.hparams.n_vocab) = output->ne[1]; because the logits are tied to n_vocab
  • commenting out GGML_ABORT("input/output layer tensor %s used with a layer number", tn.str().c_str()); because grabbing embeddings/classifiers require using bid to trick it thinking it's part of a layer
  • some helper functions to retrieve the embeddings tensor from the model
  • some helper functions to set the target classifier head
  • some fix for GGML_ASSERT(mask->ne[0] == a->ne[0]) when using a non-causal attention mask (or I can test on the model that had a causal NAR......)

To-Do

  • converted model to GGUF
    • convert it without modifying any of the existing code, as the tokenizer requires some care
  • basic framework
    • load the quantized model
    • orchestrate the required embeddings
    • juggle the output head / classifier properly
  • phonemize text
    • with the help of espeak-ng
  • tokenize phonemes
    • the tokenizer is being a huge thorn on actual sequences
  • load audio from disk
  • encode audio
  • sum embeddings for the prom and prior resps
  • working AR output
    • AR sampling
    • currently need a model that didn't regress with the AR:0:0 output
  • working NAR-len output
    • NAR-len sampling
    • currently cannot inference with non-causal_attn
  • working NAR output
    • NAR sampling
    • currently cannot inference with non-causal_attn
  • decode audio to disk
  • a functional CLI
  • actually make it work