vall-e/vall_e.cpp/README.md

# vall_e.cpp

This is an implementation that makes use of [llama.cpp](https://github.com/ggerganov/llama.cpp/) and [encodec.cpp](https://github.com/PABannier/encodec.cpp).

At the moment it's ***very*** barebones as I try and wrestle with `llama.cpp`'s API without needing to modify its code.

## Build

Populate `./include/` with the `llama.cpp` and `encodec.cpp` headers.

Populate `./libs/` with the compiled libraries of `llama.cpp` and `encodec.cpp`.

Run `make`.


### Required Modifications

`encodec.cpp` requires updating its GGML copy to the latest version, which requires a few lines to get the CPU backend working.
`llama.cpp` *might* not require any modifications, but:
* `llm.build_vall_e` can mostly copy `llm.build_llama`, but with:
	* `KQ_mask = build_inp_KQ_mask( lctx.cparams.causal_attn )`
	* a unified output head (pain)
		* OR adjusting the `model.output` to the correct classifier head
	    * OR slicing that tensor with the right range (`ggml_view_2d` confuses me)
		* both require also require `*const_cast<uint32_t*>(&ctx->model.hparams.n_vocab) = output->ne[1];` because the logits are tied to `n_vocab`
* commenting out `GGML_ABORT("input/output layer tensor %s used with a layer number", tn.str().c_str());` because grabbing embeddings/classifiers require using `bid` to trick it thinking it's part of a layer
* some helper functions to retrieve the embeddings tensor from the model
* some helper functions to set the target classifier head

## To-Do

* [x] converted model to GGUF
	* [ ] convert it without modifying any of the existing code, as the tokenizer requires some care
* [x] basic framework
	* [x] load the quantized model
	* [x] orchestrate the required embeddings
	* [x] juggle the output head / classifier properly
* [ ] phonemize text
	* with the help of espeak-ng
* [ ] tokenize phonemes
	* the tokenizer is being a huge thorn on actual sequences
* [x] load audio from disk
* [x] encode audio
* [x] sum embeddings for the `prom` and prior `resp`s
* [x] `AR` sampling
* [ ] `NAR-len` demasking sampling
* [x] `NAR` sampling
* [x] decode audio to disk
* [ ] a functional CLI
* [ ] actually make it work
	* it seems naively stitching the model together isn't good enough since the output is wrong, it most likely needs training with a glued together classifier
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00			`# vall_e.cpp`

			`This is an implementation that makes use of [llama.cpp](https://github.com/ggerganov/llama.cpp/) and [encodec.cpp](https://github.com/PABannier/encodec.cpp).`

			At the moment it's *very* barebones as I try and wrestle with `llama.cpp`'s API without needing to modify its code.

			`## Build`

crammed encodec.cpp in 2024-12-21 21:48:12 +00:00			Populate `./include/` with the `llama.cpp` and `encodec.cpp` headers.
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00
crammed encodec.cpp in 2024-12-21 21:48:12 +00:00			Populate `./libs/` with the compiled libraries of `llama.cpp` and `encodec.cpp`.

			Run `make`.
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00
sanity cleanup 2024-12-22 21:05:45 +00:00
			`### Required Modifications`

			`encodec.cpp` requires updating its GGML copy to the latest version, which requires a few lines to get the CPU backend working.
			`llama.cpp` might not require any modifications, but:
			* `llm.build_vall_e` can mostly copy `llm.build_llama`, but with:
			* `KQ_mask = build_inp_KQ_mask( lctx.cparams.causal_attn )`
			`* a unified output head (pain)`
			* OR adjusting the `model.output` to the correct classifier head
			* OR slicing that tensor with the right range (`ggml_view_2d` confuses me)
			* both require also require `const_cast<uint32_t>(&ctx->model.hparams.n_vocab) = output->ne[1];` because the logits are tied to `n_vocab`
			* commenting out `GGML_ABORT("input/output layer tensor %s used with a layer number", tn.str().c_str());` because grabbing embeddings/classifiers require using `bid` to trick it thinking it's part of a layer
			`* some helper functions to retrieve the embeddings tensor from the model`
			`* some helper functions to set the target classifier head`

added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00			`## To-Do`

			`* [x] converted model to GGUF`
ugh 2024-12-22 01:59:56 +00:00			`* [ ] convert it without modifying any of the existing code, as the tokenizer requires some care`
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00			`* [x] basic framework`
quant 2024-12-21 17:56:22 +00:00			`* [x] load the quantized model`
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00			`* [x] orchestrate the required embeddings`
			`* [x] juggle the output head / classifier properly`
			`* [ ] phonemize text`
ugh 2024-12-22 01:59:56 +00:00			`* with the help of espeak-ng`
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00			`* [ ] tokenize phonemes`
ugh 2024-12-22 01:59:56 +00:00			`* the tokenizer is being a huge thorn on actual sequences`
more updates to vall_e.cpp 2024-12-22 01:16:44 +00:00			`* [x] load audio from disk`
			`* [x] encode audio`
			* [x] sum embeddings for the `prom` and prior `resp`s
quant 2024-12-21 17:56:22 +00:00			* [x] `AR` sampling
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 16:57:02 +00:00			* [ ] `NAR-len` demasking sampling
ugh 2024-12-22 01:59:56 +00:00			* [x] `NAR` sampling
			`* [x] decode audio to disk`
more updates to vall_e.cpp 2024-12-22 01:16:44 +00:00			`* [ ] a functional CLI`
			`* [ ] actually make it work`
ugh 2024-12-22 01:59:56 +00:00			`* it seems naively stitching the model together isn't good enough since the output is wrong, it most likely needs training with a glued together classifier`