87 lines
4.1 KiB
Markdown
87 lines
4.1 KiB
Markdown
# TorToiSe TTS
|
|
|
|
An unofficial PyTorch re-implementation of [TorToise TTS](https://github.com/neonbjb/tortoise-tts/tree/98a891e66e7a1f11a830f31bd1ce06cc1f6a88af).
|
|
|
|
Almost all of the documentation and usage are carried over from my [VALL-E](https://github.com/e-c-k-e-r/vall-e) implementation, as documentation is lacking for this implementation, as I whipped it up over the course of two days using knowledge I haven't touched in a year.
|
|
|
|
## Requirements
|
|
|
|
A working PyTorch environment.
|
|
+ `python3 -m venv venv && source ./venv/bin/activate` is sufficient.
|
|
|
|
## Install
|
|
|
|
Simply run `pip install git+https://git.ecker.tech/mrq/tortoise-tts@new` or `pip install git+https://github.com/e-c-k-e-r/tortoise-tts`.
|
|
|
|
## Usage
|
|
|
|
### Inferencing
|
|
|
|
Using the default settings: `python3 -m tortoise_tts --yaml="./data/config.yaml" "Read verse out loud for pleasure." "./path/to/a.wav"`
|
|
|
|
To inference using the included Web UI: `python3 -m tortoise_tts.webui --yaml="./data/config.yaml"`
|
|
+ Pass `--listen 0.0.0.0:7860` if you're accessing the web UI from outside of `localhost` (or pass the host machine's local IP instead)
|
|
|
|
### Training / Finetuning
|
|
|
|
Training is as simple as copying the reference YAML from `./data/config.yaml` to any training directory of your choice (for examples: `./training/` or `./training/lora-finetune/`).
|
|
|
|
A pre-processed dataset is required. Refer to [the VALL-E implementation](https://github.com/e-c-k-e-r/vall-e#leverage-your-own-dataset) for more details.
|
|
|
|
To start the trainer, run `python3 -m tortoise_tts.train --yaml="./path/to/your/training/config.yaml`.
|
|
+ Type `save` to save whenever. Type `quit` to quit and save whenever. Type `eval` to run evaluation / validation of the model.
|
|
|
|
For training a LoRA, uncomment the `loras` block in your training YAML.
|
|
|
|
For loading an existing finetuned model, create a folder with this structure, and load its accompanying YAML:
|
|
```
|
|
./some/arbitrary/path/:
|
|
ckpt:
|
|
autoregressive:
|
|
fp32.pth # finetuned weights
|
|
config.yaml
|
|
```
|
|
|
|
For LoRAs, replace the above `fp32.pth` with `lora.pth`.
|
|
|
|
## To-Do
|
|
|
|
- [X] Reimplement original inferencing through TorToiSe (as done with `api.py`)
|
|
- [ ] Reimplement candidate selection with the CLVP
|
|
- [ ] Reimplement redaction with the Wav2Vec2
|
|
- [X] Implement training support (without DLAS)
|
|
- [X] Feature parity with the VALL-E training setup with preparing a dataset ahead of time
|
|
- [X] Automagic offloading to CPU for unused models (for training and inferencing)
|
|
- [X] Automagic handling of the original weights into compatible weights
|
|
- [ ] Reimplement added features from my original fork:
|
|
- [ ] "Better" conditioning latents calculating
|
|
- [x] Use of KV-cache for the AR
|
|
- [x] Re-enable DDIM sampler
|
|
- [ ] Extend the original inference routine with additional features:
|
|
- [ ] non-float32 / mixed precision for the entire stack
|
|
- [x] BitsAndBytes support
|
|
- Provided Linears technically aren't used because GPT2 uses Conv1D instead...
|
|
- [x] LoRAs
|
|
- [x] Web UI
|
|
- [ ] Feature parity with [ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning)
|
|
- Although I feel a lot of its features are the wrong way to go about it.
|
|
- [ ] Additional samplers for the autoregressive model (such as mirostat / dynamic temperature)
|
|
- [ ] Additional samplers for the diffusion model (beyond the already included DDIM)
|
|
- [X] BigVGAN in place of the original vocoder
|
|
- [X] HiFiGAN integration as well
|
|
- [ ] XFormers / flash_attention_2 for the autoregressive model
|
|
- Beyond HF's internal implementation of handling alternative attention
|
|
- Both the AR and diffusion models also do their own attention...
|
|
- [ ] Saner way of loading finetuned models / LoRAs
|
|
- [ ] Some vector embedding store to find the "best" utterance to pick
|
|
- [ ] Documentation
|
|
|
|
## Why?
|
|
|
|
To correct the mess I've made with forking TorToiSe TTS originally with a bunch of slopcode, and the nightmare that ai-voice-cloning turned out.
|
|
|
|
Additional features can be applied to the program through a framework of my own that I'm very familiar with.
|
|
|
|
## License
|
|
|
|
Unless otherwise credited/noted in this README or within the designated Python file, this repository is [licensed](LICENSE) under AGPLv3. |