| .. | ||
| include | ||
| Makefile | ||
| README.md | ||
| vall_e-impl.h | ||
| vall_e.cpp | ||
| vall_e.h | ||
vall_e.cpp
This is an implementation that makes use of llama.cpp and encodec.cpp.
Model weights can:
- be found at
ecker/vall-e@gguf - converted with
vall_e.export --yaml=./model_path/config.yaml --hf, then runningpython3 /path/to/your/llama.cpp/convert_hf_to_gguf ./model_path/hf/
Build
Populate ./include/ with the ggml, llama.cpp, and encodec.cpp headers.
Populate ./lib/ with the compiled libraries of llama.cpp, encodec.cpp, and espeak-ng (if not already in your LD_LIBRARY_PATH).
Run make.
Required Modifications
encodec.cpp requires updating its GGML copy to the latest version, which requires a few lines to get the CPU backend working (per my fork).
llama.cpp only possible modification needs to ensure that a non-causal attention mask is used; everything necessary can be hacked together with clever tricks.
- initially written on commit
9ba399dfa7f115effc63d48e6860a94c9faa31b2, updated to commit7a84777f42a9b3ba47db5d20b7662f8ddf92f652
To-Do
- converted model to GGUF
- convert it without modifying any of the existing code, as the tokenizer requires some care
- basic framework
- load the quantized model
- orchestrate the required embeddings
- juggle the output head / classifier properly
- phonemize text
- with the help of espeak-ng
- tokenize phonemes
- tokenize with
llama_tokenizeinstead of a homebrewed method because the tokenizer is being a huge thorn
- tokenize with
- load audio from disk
- encode audio
- sum embeddings for the
promand priorresps - working
ARoutputARsampling
- working
NAR-lenoutputNAR-lensampling- proper scoring
- working
NARoutputNARsampling
- decode audio to disk
- a functional CLI
- actually make it work
- clean up to make the code usable elsewhere
- configured to allow for being used as a lib
- (I do need to validate this in my engine project, but that's in MSYS2)
- feature parity with the PyTorch version
- vocos
- additional tasks
sttns/sr- samplers