.. | ||
Makefile | ||
README.md | ||
vall_e.cpp | ||
vall_e.h |
vall_e.cpp
This is an implementation that makes use of llama.cpp and encodec.cpp.
At the moment it's very barebones as I try and wrestle with llama.cpp
's API without needing to modify its code.
Build
Populate ./include/
with the llama.cpp
and encodec.cpp
headers.
Populate ./libs/
with the compiled libraries of llama.cpp
and encodec.cpp
.
Run make
.
Required Modifications
encodec.cpp
requires updating its GGML copy to the latest version, which requires a few lines to get the CPU backend working.
llama.cpp
might not require any modifications, but implementing LLM_ARCH_VALL_E
requires some surgery.
To-Do
- converted model to GGUF
- convert it without modifying any of the existing code, as the tokenizer requires some care
- basic framework
- load the quantized model
- orchestrate the required embeddings
- juggle the output head / classifier properly
- phonemize text
- with the help of espeak-ng
- tokenize phonemes
- the tokenizer is being a huge thorn on actual sequences
- load audio from disk
- encode audio
- sum embeddings for the
prom
and priorresp
s - working
AR
outputAR
sampling- currently need a model that didn't regress with the
AR:0:0
output
- working
NAR-len
outputNAR-len
sampling- need to assert that a non-causal mask is used
- working
NAR
outputNAR
sampling- need to assert that a non-causal mask is used
- decode audio to disk
- a functional CLI
- actually make it work