2.2 KiB
2.2 KiB
vall_e.cpp
This is an implementation that makes use of llama.cpp and encodec.cpp.
Model weights can:
- be found at
ecker/vall-e@gguf
- converted with
vall_e.export --yaml=./model_path/config.yaml --hf
, then runningpython3 /path/to/your/llama.cpp/convert_hf_to_gguf ./model_path/hf/
Build
Populate ./include/
with the ggml
, llama.cpp
, and encodec.cpp
headers.
Populate ./lib/
with the compiled libraries of llama.cpp
, encodec.cpp
, and espeak-ng
(if not already in your LD_LIBRARY_PATH
).
Run make
.
Required Modifications
encodec.cpp
requires updating its GGML copy to the latest version, which requires a few lines to get the CPU backend working (per my fork).
llama.cpp
only possible modification needs to ensure that a non-causal attention mask is used; everything necessary can be hacked together with clever tricks.
To-Do
- converted model to GGUF
- convert it without modifying any of the existing code, as the tokenizer requires some care
- basic framework
- load the quantized model
- orchestrate the required embeddings
- juggle the output head / classifier properly
- phonemize text
- with the help of espeak-ng
- tokenize phonemes
- tokenize with
llama_tokenize
instead of a homebrewed method because the tokenizer is being a huge thorn
- tokenize with
- load audio from disk
- encode audio
- sum embeddings for the
prom
and priorresp
s - working
AR
outputAR
sampling
- working
NAR-len
outputNAR-len
sampling
- working
NAR
outputNAR
sampling
- decode audio to disk
- a functional CLI
- actually make it work
- clean up to make the code usable elsewhere
- configured to allow for being used as a lib
- (I do need to validate this in my engine project, but that's in MSYS2)
- feature parity with the PyTorch version
- vocos
- additional tasks
stt
ns
/sr
- samplers