more fixes for local engine backend

This commit is contained in:
mrq 2024-12-12 14:38:42 -06:00
parent 6b237ae5e3
commit f41251f648
5 changed files with 24 additions and 13 deletions

View File

@ -4,7 +4,7 @@
# VALL'E
An unofficial PyTorch implementation of [VALL-E](https://vall-e-demo.ecker.tech/), utilizing the [EnCodec](https://github.com/facebookresearch/encodec) encoder/decoder.
An unofficial PyTorch implementation of [VALL-E](https://vall-e-demo.ecker.tech/) (last updated: `2024.12.11`), utilizing the [EnCodec](https://github.com/facebookresearch/encodec) encoder/decoder.
A demo is available on HuggingFace [here](https://huggingface.co/spaces/ecker/vall-e).

View File

@ -10,14 +10,22 @@ At the time, state-of-the-art neural-based TTS solutions were sparing. TorToiSe
Unlike the paper, this VALL-E aims to:
* be lightweight as possible, only requiring one model to load and use (and EnCodec/Vocos as an audio encoder/decoder).
+ Even the original VALL-E requires a separate AR and a NAR.
+ Even the original VALL-E requires two separate models (one for the course codes, and one for the fine codes).
* keep training and finetuning (be it the base model or through LoRAs) accessible to anyone.
+ Bark was needlessly complex in providing even additional voices to use.
+ Current SoTA such as F5-TTS supports it, but seems to have a rather high ceiling to finetune it.
* provide decent zero-shot text-to-speech synthesis, both without requiring sampling adjustments and providing thorough sampler settings.
* provide additional, easy to use functionality, that other solutions don't offer.
However, at this point and time, the implementation is rather divorced from VALL-E and its derivating papers, but the core principle is still followed.
However, at this point and time, the implementation is *very* divorced from VALL-E and its derivating papers, but the core principle is still followed.
# Why *not* this VALL-E?
This VALL-E is still actively being iterated upon without any actual proper standards or procedures.
* While I try to maintain interop with previous versions, I can't guarantee it (for example, support for `ar+nar-retnet-8` dropped due to shifting focuses).
* I am *very* stubborn with/against some approaches, paradigms, and methodologies.
There are far better TTS solutions out there, such as [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct) and [F5-TTS](https://github.com/SWivid/F5-TTS). They're both easy to use and offer amazing results.
## Model Specifications
@ -82,10 +90,9 @@ The reference model (`ar+nar-llama-8`/`ar+nar-len-llama-8`):
* using a pure text vocab rather than IPA phonemes (as a transformer should be "smart" enough to map text tokens)
* mixing multiple speakers through summing input prompt embeddings
* I do not expect this to work, but you never know...
* [ ] objective metrics such as WER / SIM-O
* [ ] WER simply requires transcribing audio then computing word error rates through the transcriptions
* this does require subjugating an STT model though (like Whisper(X))
* [ ] SIM-O requires passing the raw waveform through a speaker-similarity model
* [x] objective metrics such as WER / SIM-O
* [x] WER simply requires transcribing audio then computing word error rates through the transcriptions
* [x] SIM-O requires passing the raw waveform through a speaker-similarity model
## "Postmortem"
@ -109,7 +116,7 @@ However, while this solution boasts being lightweight, there are some caveats fo
* VALL-E Continuous (prefixing with the input prompt) could also fix this, but technically makes it one-shot and not zero-shot
* multi-lingual support is a bit of an afterthought
* supported non-English speakers have the confidence problem for some speakers but exacerbated
* there seems to be a regression with an increase in the word error rate, although it might only be inherent to the `NAR-len`
* there's a regression in the `ar+nar-len-llama-8` model with a decrease in speaker similarity.
## Notices and Citations

View File

@ -279,7 +279,7 @@ def main():
# pull from provided samples
samples_dirs = {
"librispeech": args.demo_dir / "librispeech",
#"librispeech": args.demo_dir / "librispeech",
}
if (args.demo_dir / args.dataset_dir_name).exists():

View File

@ -67,10 +67,11 @@ class Engine():
self.lr_scheduler = kwargs['lr_scheduler'] if 'lr_scheduler' in kwargs else None
stats = kwargs.pop("stats", {})
self.global_steps = stats.pop("global_step", 0)
self.micro_steps = stats.pop("micro_step", 0)
self.global_samples = stats.pop("global_samples", 0)
self.tokens_processed = stats.pop("tokens_processed", 0)
if stats is not None:
self.global_steps = stats.pop("global_step", 0)
self.micro_steps = stats.pop("micro_step", 0)
self.global_samples = stats.pop("global_samples", 0)
self.tokens_processed = stats.pop("tokens_processed", 0)
self._frozen_params = set()

View File

@ -106,6 +106,9 @@ def _make_infinite_epochs(dl):
total = dl.dataset.batches()
manual_update = False
if total == 0:
raise Exception("Empty dataset")
while True:
if dl.dataset.index() == 0:
_logger.info("New epoch starts.")