vall-e/vall_e.cpp/README.md

# vall_e.cpp

This is an implementation that makes use of [llama.cpp](https://github.com/ggerganov/llama.cpp/) and [encodec.cpp](https://github.com/PABannier/encodec.cpp).

Model weights can:
* be found at [`ecker/vall-e@gguf`](https://huggingface.co/ecker/vall-e/tree/gguf)
* converted with `vall_e.export --yaml=./model_path/config.yaml --hf`, then running `python3 /path/to/your/llama.cpp/convert_hf_to_gguf ./model_path/hf/`

## Build

Populate `./include/` with the `ggml`, `llama.cpp`, and `encodec.cpp` headers.

Populate `./lib/` with the compiled libraries of `llama.cpp`, `encodec.cpp`, and `espeak-ng` (if not already in your `LD_LIBRARY_PATH`).

Run `make`.

### Required Modifications

[`encodec.cpp`](https://github.com/PABannier/encodec.cpp) requires updating its GGML copy to the latest version, which requires a few lines to get the CPU backend working (per my [fork](https://github.com/e-c-k-e-r/encodec.cpp)).

[`llama.cpp`](https://github.com/ggerganov/llama.cpp) only possible modification needs to ensure that a non-causal attention mask is used; everything necessary can be hacked together with clever tricks.

## To-Do

* [x] converted model to GGUF
	* [x] convert it without modifying any of the existing code, as the tokenizer requires some care
* [x] basic framework
	* [x] load the quantized model
	* [x] orchestrate the required embeddings
	* [x] juggle the output head / classifier properly
* [x] phonemize text
	* with the help of espeak-ng
* [x] tokenize phonemes
	* tokenize with `llama_tokenize` instead of a homebrewed method because the tokenizer is being a huge thorn
* [x] load audio from disk
* [x] encode audio
* [x] sum embeddings for the `prom` and prior `resp`s
* [x] working `AR` output
	* [x] `AR` sampling
* [x] working `NAR-len` output
	* [x] `NAR-len` sampling
* [x] working `NAR` output
	* [x] `NAR` sampling
* [x] decode audio to disk
* [x] a functional CLI
* [x] actually make it work
* [x] clean up to make the code usable elsewhere
* [x] configured to allow for being used as a lib
	* (I do need to validate this in my engine project, but that's in MSYS2)
* [ ] feature parity with the PyTorch version
	* [ ] vocos
	* [ ] additional tasks
		* [ ] `stt`
		* [x] `ns` / `sr`
		* [ ] samplers
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00			`# vall_e.cpp`

			`This is an implementation that makes use of [llama.cpp](https://github.com/ggerganov/llama.cpp/) and [encodec.cpp](https://github.com/PABannier/encodec.cpp).`

2024-12-27 03:42:17 +00:00			`Model weights can:`
			* be found at [`ecker/vall-e@gguf`](https://huggingface.co/ecker/vall-e/tree/gguf)
			* converted with `vall_e.export --yaml=./model_path/config.yaml --hf`, then running `python3 /path/to/your/llama.cpp/convert_hf_to_gguf ./model_path/hf/`
cleaup 2024-12-25 05:14:32 +00:00
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00			`## Build`

cleanup 2024-12-25 02:29:03 +00:00			Populate `./include/` with the `ggml`, `llama.cpp`, and `encodec.cpp` headers.
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00
2024-12-27 03:42:17 +00:00			Populate `./lib/` with the compiled libraries of `llama.cpp`, `encodec.cpp`, and `espeak-ng` (if not already in your `LD_LIBRARY_PATH`).
crammed encodec.cpp in 2024-12-21 21:48:12 +00:00
			Run `make`.
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00
sanity cleanup 2024-12-22 21:05:45 +00:00			`### Required Modifications`

nvm fixed 2024-12-24 04:23:43 +00:00			[`encodec.cpp`](https://github.com/PABannier/encodec.cpp) requires updating its GGML copy to the latest version, which requires a few lines to get the CPU backend working (per my [fork](https://github.com/e-c-k-e-r/encodec.cpp)).
more work on vall_e.cpp (some more cleanup, NAR-len demasking, but still need to iron out some kinks) 2024-12-23 23:20:04 +00:00
nvm fixed 2024-12-24 04:23:43 +00:00			[`llama.cpp`](https://github.com/ggerganov/llama.cpp) only possible modification needs to ensure that a non-causal attention mask is used; everything necessary can be hacked together with clever tricks.
sanity cleanup 2024-12-22 21:05:45 +00:00
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00			`## To-Do`

			`* [x] converted model to GGUF`
2024-12-27 03:42:17 +00:00			`* [x] convert it without modifying any of the existing code, as the tokenizer requires some care`
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00			`* [x] basic framework`
quant 2024-12-21 17:56:22 +00:00			`* [x] load the quantized model`
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00			`* [x] orchestrate the required embeddings`
			`* [x] juggle the output head / classifier properly`
vall_e.cpp phonemizing and tokenizing 2024-12-25 04:39:32 +00:00			`* [x] phonemize text`
ugh 2024-12-22 01:59:56 +00:00			`* with the help of espeak-ng`
vall_e.cpp phonemizing and tokenizing 2024-12-25 04:39:32 +00:00			`* [x] tokenize phonemes`
			* tokenize with `llama_tokenize` instead of a homebrewed method because the tokenizer is being a huge thorn
more updates to vall_e.cpp 2024-12-22 01:16:44 +00:00			`* [x] load audio from disk`
			`* [x] encode audio`
			* [x] sum embeddings for the `prom` and prior `resp`s
ugh 2024-12-24 05:42:44 +00:00			* [x] working `AR` output
more work (the wall is non-causal decoding......) 2024-12-23 02:11:31 +00:00			* [x] `AR` sampling
cleanup 2024-12-25 02:29:03 +00:00			* [x] working `NAR-len` output
more work on vall_e.cpp (some more cleanup, NAR-len demasking, but still need to iron out some kinks) 2024-12-23 23:20:04 +00:00			* [x] `NAR-len` sampling
cleanup 2024-12-25 02:29:03 +00:00			* [x] working `NAR` output
more work (the wall is non-causal decoding......) 2024-12-23 02:11:31 +00:00			* [x] `NAR` sampling
ugh 2024-12-22 01:59:56 +00:00			`* [x] decode audio to disk`
vall_e.cpp cli 2024-12-25 06:28:34 +00:00			`* [x] a functional CLI`
cleanup 2024-12-25 02:29:03 +00:00			`* [x] actually make it work`
vall_e.cpp cli 2024-12-25 06:28:34 +00:00			`* [x] clean up to make the code usable elsewhere`
2024-12-27 03:42:17 +00:00			`* [x] configured to allow for being used as a lib`
			`* (I do need to validate this in my engine project, but that's in MSYS2)`
cleanup 2024-12-25 02:29:03 +00:00			`* [ ] feature parity with the PyTorch version`
			`* [ ] vocos`
2024-12-27 03:42:17 +00:00			`* [ ] additional tasks`
			* [ ] `stt`
			* [x] `ns` / `sr`
			`* [ ] samplers`