vall-e/vall_e.cpp/README.md

50 lines
1.9 KiB
Markdown
Raw Normal View History

# vall_e.cpp
This is an implementation that makes use of [llama.cpp](https://github.com/ggerganov/llama.cpp/) and [encodec.cpp](https://github.com/PABannier/encodec.cpp).
2024-12-24 04:23:43 +00:00
At the moment it's ***very*** work in progress.
2024-12-25 05:14:32 +00:00
Model weights can be found at [`ecker/vall-e@gguf`](https://huggingface.co/ecker/vall-e/tree/gguf).
## Build
2024-12-25 02:29:03 +00:00
Populate `./include/` with the `ggml`, `llama.cpp`, and `encodec.cpp` headers.
2024-12-25 04:39:32 +00:00
Populate `./libs/` with the compiled libraries of `llama.cpp`, `encodec.cpp`, and `espeak-ng`.
2024-12-21 21:48:12 +00:00
Run `make`.
2024-12-22 21:05:45 +00:00
### Required Modifications
2024-12-24 04:23:43 +00:00
[`encodec.cpp`](https://github.com/PABannier/encodec.cpp) requires updating its GGML copy to the latest version, which requires a few lines to get the CPU backend working (per my [fork](https://github.com/e-c-k-e-r/encodec.cpp)).
2024-12-24 04:23:43 +00:00
[`llama.cpp`](https://github.com/ggerganov/llama.cpp) only possible modification needs to ensure that a non-causal attention mask is used; everything necessary can be hacked together with clever tricks.
2024-12-22 21:05:45 +00:00
## To-Do
* [x] converted model to GGUF
2024-12-22 01:59:56 +00:00
* [ ] convert it without modifying any of the existing code, as the tokenizer requires some care
* [x] basic framework
2024-12-21 17:56:22 +00:00
* [x] load the quantized model
* [x] orchestrate the required embeddings
* [x] juggle the output head / classifier properly
2024-12-25 04:39:32 +00:00
* [x] phonemize text
2024-12-22 01:59:56 +00:00
* with the help of espeak-ng
2024-12-25 04:39:32 +00:00
* [x] tokenize phonemes
* tokenize with `llama_tokenize` instead of a homebrewed method because the tokenizer is being a huge thorn
2024-12-22 01:16:44 +00:00
* [x] load audio from disk
* [x] encode audio
* [x] sum embeddings for the `prom` and prior `resp`s
2024-12-24 05:42:44 +00:00
* [x] working `AR` output
* [x] `AR` sampling
2024-12-25 02:29:03 +00:00
* [x] working `NAR-len` output
* [x] `NAR-len` sampling
2024-12-25 02:29:03 +00:00
* [x] working `NAR` output
* [x] `NAR` sampling
2024-12-22 01:59:56 +00:00
* [x] decode audio to disk
2024-12-25 06:28:34 +00:00
* [x] a functional CLI
2024-12-25 02:29:03 +00:00
* [x] actually make it work
2024-12-25 06:28:34 +00:00
* [x] clean up to make the code usable elsewhere
2024-12-25 02:29:03 +00:00
* [ ] feature parity with the PyTorch version
* [ ] vocos
* [ ] additional tasks (`stt`, `ns`, `sr`, samplers)