From 9b0d2ccbe185fd1cbb608abc5e0695c5ed463ed0 Mon Sep 17 00:00:00 2001 From: mrq Date: Thu, 26 Dec 2024 21:42:17 -0600 Subject: [PATCH] =?UTF-8?q?=C2=A0?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .gitignore | 4 +- docs/models.md | 31 +- vall_e.cpp/Makefile | 26 +- vall_e.cpp/README.md | 17 +- vall_e.cpp/vall_e-impl.h | 93 ++++++ vall_e.cpp/vall_e.cpp | 134 +++++--- vall_e.cpp/vall_e.h | 213 +++++++------ vall_e/data.py | 28 +- vall_e/export.py | 148 +-------- vall_e/models/__init__.py | 50 +-- vall_e/models/base.py | 13 +- vall_e/models/experimental.py | 574 ---------------------------------- vall_e/webui.py | 2 +- 13 files changed, 396 insertions(+), 937 deletions(-) create mode 100644 vall_e.cpp/vall_e-impl.h delete mode 100644 vall_e/models/experimental.py diff --git a/.gitignore b/.gitignore index b6b1a4a..d6c3ac7 100755 --- a/.gitignore +++ b/.gitignore @@ -10,6 +10,6 @@ __pycache__ /.nltk /vall_e.cpp/data /vall_e.cpp/include -/vall_e.cpp/libs +/vall_e.cpp/lib /vall_e.cpp/*.o -/vall_e.cpp/vall_e \ No newline at end of file +/vall_e.cpp/vall_e diff --git a/docs/models.md b/docs/models.md index 416d054..79ca223 100644 --- a/docs/models.md +++ b/docs/models.md @@ -121,6 +121,9 @@ With attention-based transformers, most embeddings can serve as a token itself a Other solutions such as TorToiSe makes use of additional embeddings/classifiers for each portion of the sequence as well. +Other solutions will rely on conditioning latents or extracted features as the input. This *technically* isn't necessary since portions of the model seem to be allocated as an encoder anyways from the embeddings to some arbitrary depth, and as a decoder from some arbitrary depth to the output heads. +* This might also mean it makes more sense to increase the model's size in-post by injecting new layers in the middle where it's outside these pseudo-encoder/decoder layers where it won't make any difference. + ### Classifiers Classifiers are the final output head / projection layer that processes the last hidden states of a model into a probability distribution for each token. @@ -152,7 +155,7 @@ In reality, this seems to help govern the accent / general mannerisms associated * Consequently, since this does tie to accents more, ***extreme*** attention is to be paid to the dialects being trained against, instead of naively grouping, say, all of Spanish to one language code. * unfortunately, this does mean that audio annotated as English is dialect/accent-agnostic, per the dataset. -This embedding probably helps the model with being able to perform cross-lingual outputs, but I did not do any experimentations on a model without this, as the reference `ar+nar-llama-8` was trained with this from the beginning with the small Japanese in my dataset anyhow (and maybe the `ar+nar-retnet-8` experiment). +Some checkpoints of the model needs this for cross-lingual output, but the current checkpoints of the model doesn't seem to do this due to the attention heads deriving the language/accent from the phoneme sequences themselves rather than the language token due to a careless oversight. #### Tone Embedding @@ -162,6 +165,8 @@ Should, since I do not actually make use of this anywhere, and the model is not This should most definitely help the model identify tone strongly even without needing to annotate for it, but it does an adequate job already with maintaining tone from a given input prompt. +I imagine, like language/accent, this gets derived from the phoneme sequence itself rather than a guidance token. + ### Audio Embeddings However, due to the nature of the encoded audio, embedding the audio tokens requires the dark arts, as we use audio both as an input prompt (`prom`) for guidance, and as an output response (`resp`). @@ -230,12 +235,16 @@ In practice, this task is already implemented by providing the input audio to de I imagine training for this task will better help the model understand what is noise and what isn't, and can better strongly-er map utterances from the input audio prompt to use in the output, delivering better prompt adherance. * This also might help serve in helping the model identify effects applied to an utterance, and being able to maintain it in normal `tts` tasks, such as reverb or the audio quality itself (the "acoustic environment"). +This task can be briefly trained for decent results in-post. + ##### Speech Removal This task `sr` aims to remove speech from a given audio, effectively serving as the reverse of denoising. As state above, this should help the model better identify what is noise and what isn't. +This task can be briefly trained for decent results in-post. + ##### Target Speech Extraction This task `tse` aims to "extract" an utterance from audio containing other speakers, effective diarizing an utterance. @@ -258,6 +267,8 @@ The length predictor `len` task is required for a pure NAR model. This task will naively output a zero, then the length in base-10, followed by a stop token. +This works because the model can already derive the length of a sequence when autoregressively decoding through the probability of emitting a `` token. + #### Speech-to-Text The speech-To-text `stt` task transcribes a given piece of audio, by taking an input encoded audio, and outputting the text transcription. @@ -274,11 +285,13 @@ This task will follow a reverse sequence of `