.. | ||
include | ||
Makefile | ||
README.md | ||
vall_e.cpp | ||
vall_e.h |
vall_e.cpp
This is an implementation that makes use of llama.cpp and encodec.cpp.
At the moment it's very work in progress.
Model weights can be found at ecker/vall-e@gguf
.
Build
Populate ./include/
with the ggml
, llama.cpp
, and encodec.cpp
headers.
Populate ./libs/
with the compiled libraries of llama.cpp
, encodec.cpp
, and espeak-ng
.
Run make
.
Required Modifications
encodec.cpp
requires updating its GGML copy to the latest version, which requires a few lines to get the CPU backend working (per my fork).
llama.cpp
only possible modification needs to ensure that a non-causal attention mask is used; everything necessary can be hacked together with clever tricks.
To-Do
- converted model to GGUF
- convert it without modifying any of the existing code, as the tokenizer requires some care
- basic framework
- load the quantized model
- orchestrate the required embeddings
- juggle the output head / classifier properly
- phonemize text
- with the help of espeak-ng
- tokenize phonemes
- tokenize with
llama_tokenize
instead of a homebrewed method because the tokenizer is being a huge thorn
- tokenize with
- load audio from disk
- encode audio
- sum embeddings for the
prom
and priorresp
s - working
AR
outputAR
sampling
- working
NAR-len
outputNAR-len
sampling
- working
NAR
outputNAR
sampling
- decode audio to disk
- a functional CLI
- actually make it work
- clean up to make the code usable elsewhere
- feature parity with the PyTorch version
- vocos
- additional tasks (
stt
,ns
,sr
, samplers)