exposed additional task (ns, sr, vc) (vc is experimental)
This commit is contained in:
parent
53230efd74
commit
59bf6b8b33
|
@ -31,14 +31,14 @@ There are far better TTS solutions out there, such as [MaskGCT](https://github.c
|
|||
|
||||
The reference model (`ar+nar-llama-8`/`ar+nar-len-llama-8`):
|
||||
* boasts 220M parameters
|
||||
* supports English, German, French, and Japanese
|
||||
* support for Korean and Chinese (Mandarin?) soon™
|
||||
* supports English, German, French, Japanese, Korean, and Chinese (Mandarin?)
|
||||
* has several modalities of inferencing:
|
||||
* the primary audio level (RVQ level 0) can be inferenced both autoregressively (`AR`) or non-autoregressively (`NAR-len`)
|
||||
* pure-NAR can yield faster-than-realtime output
|
||||
* supports predicting the duration of an input
|
||||
* supports Speech-to-Text (although it's a second-class feature)
|
||||
* additional tasks such as noise reduction, speech removal, editing, and voice conversion eventually™ (just need to train on it)
|
||||
* supports additional tasks such as speech removal, noice reduction, and voice converison.
|
||||
* additional tasks such as speaker extraction and speech editing eventually™ (just need to train on it)
|
||||
* trained on `?` samples / `?` hours of EnCodec-quantized audio at 24KHz
|
||||
|
||||
## To-Do
|
||||
|
@ -79,6 +79,8 @@ The reference model (`ar+nar-llama-8`/`ar+nar-len-llama-8`):
|
|||
* [x] objective metrics such as WER / SIM-O
|
||||
* [x] WER simply requires transcribing audio then computing word error rates through the transcriptions
|
||||
* [x] SIM-O requires passing the raw waveform through a speaker-similarity model
|
||||
* [ ] valle.cpp through llama.cpp + encodec.cpp
|
||||
* the latter is easy, the former is not.
|
||||
|
||||
## "Postmortem"
|
||||
|
||||
|
@ -87,12 +89,11 @@ For the most part, the model is complete. With the `NAR-len` being crammed on, I
|
|||
However, while this solution boasts being lightweight, there are some caveats for its given size
|
||||
* its at capacity on what it *can* do without additional tasks to augment it further
|
||||
* post-fixing it with additional layers glued on doesn't seem to offer very much improvement (12 => 16 layers)
|
||||
* wrangling it is a bit of a chore, as some voices work fine under the `AR` but not the `NAR-len`, and vice-versa
|
||||
* some voices outright refuse to work without LoRA training
|
||||
* some sampler settings works on some voices, but others need some tweaking
|
||||
* the only bet is to feed it more data and see how it fares, since the model is still grossly undertrained compared to the 50K+ hour behemoths.
|
||||
* subjugating an existing LLM architecture is a bit of a pain, as I would *love* to make full use of LLaMA niceties
|
||||
* `hf`-ifying it is possible, but it'd be a chore to set up the tokenizer properly
|
||||
* multi-lingual support is a bit of an afterthought
|
||||
* `hf`-ifying it is possible, but due to the nature of summed audio embeddings and split classifiers, it's not as plug-and-play as I would like for inferencing.
|
||||
* speaker similarity is rather mediocre for unseen speakers, the model isn't as robust for mapping speakers to its latent space as it is for seen speakers.
|
||||
* despite being rather robust, some vocal stutters makes it way in.
|
||||
|
||||
## Notices and Citations
|
||||
|
||||
|
|
12
docs/data.md
12
docs/data.md
|
@ -23,17 +23,6 @@ These durations were reported from the training script directly.
|
|||
|
||||
If you already have a dataset you want, for example, your own large corpus or for finetuning, you can use your own dataset instead.
|
||||
|
||||
0. Set up a `venv` with `https://github.com/m-bain/whisperX/`.
|
||||
+ At the moment only WhisperX is utilized. Using other variants like `faster-whisper` is an exercise left to the user at the moment.
|
||||
+ It's recommended to use a dedicated virtualenv specifically for transcribing, as WhisperX will break a few dependencies.
|
||||
+ The following command should work:
|
||||
```
|
||||
python3 -m venv venv-whisper
|
||||
source ./venv-whisper/bin/activate
|
||||
pip3 install torch torchvision torchaudio
|
||||
pip3 install git+https://github.com/m-bain/whisperX/
|
||||
```
|
||||
|
||||
1. Populate your source voices under `./voices/{group name}/{speaker name}/`.
|
||||
|
||||
2. Run `python3 -m vall_e.emb.transcribe`. This will generate a transcription with timestamps for your dataset.
|
||||
|
@ -114,6 +103,7 @@ This section may be covered elsewhere in the documentation, but coverage here sh
|
|||
* the above, but injects some noise throughout the sampled utterances.
|
||||
|
||||
A mystical `vc` for performing voice conversion is possible, but either requires a dataset to do so, or abusing an emergent property.
|
||||
* This emergent property is mostly abused through the NAR-len's demasking routine.
|
||||
|
||||
## `__main__`
|
||||
|
||||
|
|
32
docs/emb.md
32
docs/emb.md
|
@ -63,40 +63,28 @@ For audio backends:
|
|||
|
||||
Descript-Audio-Codec was thoroughly tested for promising much, much cleaner output audio, as this model encodes/decodes at 44.1KHz, rather than EnCodec's 24KHz.
|
||||
|
||||
However, due to the nature of the codec, simply throwing it at an attention-based transformer proves to be painful, as a unified AR+NAR model *heavily* suffers from noisy output in the NAR.
|
||||
However, due to the nature of the codec, simply throwing it at an attention-based transformer proves to be painful, as the model *heavily* suffers from noisy output in the higher half of the RVQ levels.
|
||||
|
||||
Ironically, testing through erroneously encoded audio (feeding 24KHz audio without upsampling to 44.1KHz) proved to have "cleaner" but bad utterances.
|
||||
|
||||
I'm uncertain on how to remedy this, as my options are:
|
||||
* train under a RetNet, if an attention-based transformer is simply the problem
|
||||
* train an AR, and train a NAR, if the codec itself is at fault
|
||||
* use an SSM like Mamba, if transformers entirely cannot model the codec
|
||||
* train a separate model that simply converts from EnCodec to DAC
|
||||
* train *all* NAR levels as independent masking sequences.
|
||||
* train under a RetNet, if an attention-based transformer is simply the problem (it's not)
|
||||
* train an AR, and train a NAR, if the codec itself is at fault (it's probably something inherent to the codec)
|
||||
* use an SSM like Mamba, if transformers entirely cannot model the codec (Mamba is too much of a thorn to use)
|
||||
* train a separate model that simply converts from EnCodec to DAC (requires another model to juggle, but does not require training a new model)
|
||||
* train *all* NAR levels as independent masking sequences similar to the `NAR-len` (complicated)
|
||||
* if this works, then it means that there's little to no mappable relation between DAC's RVQ levels
|
||||
|
||||
## `transcribe.py`
|
||||
|
||||
This script primarily handles taking raw input audio, and outputting adequate metadata containing transcriptions of said audio through `whisperX`.
|
||||
This script primarily handles taking raw input audio, and outputting adequate metadata containing transcriptions of said audio through `whisper`.
|
||||
|
||||
The process maintains slices `whisperX` thinks its best per the segments outputted, alongside the deduced language (if not specified).
|
||||
By default, `openai/whisper-large-v3` is used through HuggingFace's `pipeline` and everything is handled automatically. The process maintains slices `whisper` thinks its best per the segments outputted, alongside the deduced language (if not specified).
|
||||
|
||||
One limiting factor is that transcription transcribes into normal text, rather than the IPA phonemes the model was trained against. Some flavors *may* exist, but I have yet to test them extensively (if I did ever find one).
|
||||
|
||||
Refer to the `__main__`'s arguments for usage details.
|
||||
|
||||
### Metrics
|
||||
|
||||
This script also handles calculating `WER` simply by transcribing the given audio file (and reference, if requested), then comparing the word error rate.
|
||||
|
||||
This process *heavily* relies on text normalization, which currently is lacking, but transcribing the reference should keep things "normalized" per the transcriber.
|
||||
|
||||
### ROCm
|
||||
|
||||
Because life is pain, ROCm requires additional steps to ensure that `whisperX` works. A special fork of `CTranslate2` is required, but simplying following [these](https://github.com/arlo-phoenix/CTranslate2-rocm/blob/rocm/README_ROCM.md) steps should fix things.
|
||||
|
||||
In the future, I would love to replace WhisperX for something simple.
|
||||
|
||||
## `process.py`
|
||||
|
||||
This script handles taking raw input audio and its transcribed metadata, and outputs encoded audio (NumPy) files containing encoded audio and associated metadata.
|
||||
|
@ -120,7 +108,3 @@ When processing a dataset, this requires already having accompanying metadata ge
|
|||
Be *very* careful if you opt to output unsegmented and segmented utterances, as the sliced version may end up amongst the top-K similar candidates.
|
||||
|
||||
Refer to the `__main__`'s arguments for usage details.
|
||||
|
||||
### Metrics
|
||||
|
||||
This script also handles calculating `SIM-O` per [keonlee9420/evaluate-zero-shot-tts](https://github.com/keonlee9420/evaluate-zero-shot-tts/blob/master/src/evaluate_zero_shot_tts/utils/speaker_verification/verification.py), by making use of a model to create an embedding of a speaker, then computing cosine similarities on those embeddings.
|
|
@ -14,7 +14,7 @@ This script handles the bulk of loading a model and wrapping the model with the
|
|||
|
||||
The checkpoint or weight path is automatically deduced, as well as pre-processing the state dict (if requested) before loading it.
|
||||
* resizing modules from the weights to the requested configuration in the YAML is done here.
|
||||
* replacing modules with optimized versions or LoRAs are applied here.
|
||||
* replacing modules with quantized versions or LoRAs are applied here.
|
||||
* the requested optimizer, and params to freeze, for a model is applied here.
|
||||
|
||||
## `base.py`
|
||||
|
|
|
@ -1,13 +1,21 @@
|
|||
# `metrics.py`
|
||||
|
||||
This file provides helper functions for computing objective metrics, such as word-error rate (WER), character-error rate (CER), and speaker similarity (SIM-O).
|
||||
This file provides helper functions for computing objective metrics, such as word-error rate (WER), character-error rate (CER), phoneme-error rate (PER), and speaker similarity (SIM-O).
|
||||
|
||||
## WER / CER
|
||||
|
||||
Word-error rate (WER) is simply computed by transcribing the requested input, and comparing its transcription against the target transcription.
|
||||
* The transcription is cleaned up and normalized to account for inconsistencies between transcriptions with `openai/whisper-large-v3` with the nuances of English.
|
||||
* Languages without spaces between words (Chinese, Japanese) should not rely on this, and instead rely on the CER.
|
||||
|
||||
Because of issues with normalization (and not having a robust normalization stack), both transcriptions are then phonemized, then the resultant phonemes are used for error rate calculations.
|
||||
Character-error rate (CER) does the same thing as WER, but on a character basis rather than a word basis.
|
||||
|
||||
Phoneme-error rate (PER) does the same thing as CER, but on the phonemized transcription instead. As this is a speech model, this metric is more correct than the prior metrics, but this isn't a universal metric for comparison, as most models don't report this.
|
||||
|
||||
All rates are un-normalized because I think that's the right way to go about it? Papers aren't clear that they do this, but the error rates are even more unusually low without this.
|
||||
|
||||
## SIM-O
|
||||
|
||||
Speaker similarity (SIM-O) is computed by obtaining the embedding of each speaker (the output audio and the input prompt), and computing the cosine similarity between those two embeddings.
|
||||
|
||||
These embeddings are obtained through a finetune of WavLM-large geared towards speaker verification.
|
|
@ -87,6 +87,7 @@ The NAR-len model keeps things simple by:
|
|||
* it could be in any base, but it's simple to just treat each token ID as a digit, then cast the string to an int.
|
||||
* this could literally also not be relying on an AR sequence to predict.
|
||||
* some checkpoints of the model seems to adhere well to outputting silence at the end if the requested duration exceeds the actual duration.
|
||||
* this seems to only happen for models that erroneously causally attend to tokens in the `NAR-len`.
|
||||
* inferencing is a simple loop that simply takes the best masked-off k tokens per step, and remasks the remaining.
|
||||
|
||||
Because the model already leverages the magic of attention to derive phoneme-alignment, such annotations are still not required (but they probably help with a naive sampler).
|
||||
|
|
|
@ -10,12 +10,25 @@ A Gradio-based web UI is accessible by running `python3 -m vall_e.webui`. You ca
|
|||
|
||||
Synthesizing speech is simple:
|
||||
|
||||
* `Input Prompt`: The guiding text prompt. Each new line will be its own generated audio to be stitched together at the end.
|
||||
* `Text`:
|
||||
* `Input Prompt`: The guiding text prompt. Each segment will be its own generated audio to be stitched together at the end.
|
||||
* `Audio`:
|
||||
* `Audio Input`: The transcription of the audio will be inserted into the `Text/Input Prompt` box.
|
||||
* For `vc` task, this will serve as the guidance reference audio as well.
|
||||
|
||||
* `Audio Input`: The reference audio for the synthesis. Under Gradio, you can trim your clip accordingly, but leaving it as-is works fine.
|
||||
- A properly trained model can inference without a prompt to generate a random voice (without even needing to generate a random prompt itself).
|
||||
* `Output`: The resultant audio.
|
||||
* `Inference`: Button to start generating the audio.
|
||||
* `Basic Settings`: Basic sampler settings for most uses.
|
||||
* `Max Steps`: Number of demasking steps to perform for RVQ level 0. For the `NAR-len` modality.
|
||||
* `Max Duration`: Maximum duration the output audio will be.
|
||||
* `Input Prompt Repeat/Trim Length`: The audio prompt will be this duration length, as it will either be trimmed down or repeated (although repeating might cause more harm).
|
||||
* `Language (Text)`: The language of the input text for phonemizing.
|
||||
* `Language (Output)`: The target language for the output audio. Some checkpoints of the model might ignore this due to how it was trained, unfortunately. Some models might steer the output accent.
|
||||
* `Task`: The task to perform (in order): Text-To-Speech, Speech Removal, Noise Reduction, Voice Conversion.
|
||||
* `Text Delimiter`: How to split the `Text/Input Prompt`. Sentences will split by sentences, while lines will split by new lines.
|
||||
* `(Rolling) Context History`: Paired with the above, the previous N utterances will serve as the prefix to extend the generation on, allowing for consistency and stability across pieces.
|
||||
* `Sampler Settings`: Advanced sampler settings that are common for most text LLMs, but needs experimentation.
|
||||
* `Experimental Settings`: Settings used for testing. `cfg.experimental=True` enables this tab.
|
||||
|
||||
|
|
|
@ -361,8 +361,14 @@ class TTS():
|
|||
use_lora = sampling_kwargs.pop("use_lora", None)
|
||||
dtype = sampling_kwargs.pop("dtype", self.dtype)
|
||||
amp = sampling_kwargs.pop("amp", self.amp)
|
||||
duration_padding = sampling_kwargs.pop("duration_padding", 1.05)
|
||||
|
||||
voice_convert = sampling_kwargs.pop("voice_convert", None)
|
||||
# explicitly require this
|
||||
if task != "vc":
|
||||
voice_convert = None
|
||||
elif voice_convert == None:
|
||||
raise Exception("Voice conversion requested, but no reference clip provided.")
|
||||
|
||||
# transcribe from audio to voice convert from
|
||||
if voice_convert is not None and not text:
|
||||
|
@ -425,6 +431,20 @@ class TTS():
|
|||
|
||||
auto_lang = not language or language == "auto"
|
||||
auto_text_lang = not text_language or text_language == "auto"
|
||||
|
||||
vc_utterance = self.encode_audio( voice_convert, trim_length=0 ) if voice_convert else None
|
||||
prom = self.encode_audio( references, trim_length=input_prompt_length ) if references else None
|
||||
lang = self.encode_lang( language )
|
||||
|
||||
if task in ["ns, sr"]:
|
||||
prom = [
|
||||
task,
|
||||
prom
|
||||
]
|
||||
|
||||
prom = to_device(prom, device=self.device, dtype=torch.int16)
|
||||
lang = to_device(lang, device=self.device, dtype=torch.uint8)
|
||||
|
||||
for line in lines:
|
||||
if out_path is None:
|
||||
output_dir = Path("./data/results/")
|
||||
|
@ -440,14 +460,8 @@ class TTS():
|
|||
if auto_text_lang:
|
||||
text_language = deduced_language
|
||||
|
||||
vc_utterance = self.encode_audio( voice_convert, trim_length=0 ) if voice_convert else None
|
||||
prom = self.encode_audio( references, trim_length=input_prompt_length ) if references else None
|
||||
phns = self.encode_text( line, language=text_language )
|
||||
lang = self.encode_lang( language )
|
||||
|
||||
prom = to_device(prom, device=self.device, dtype=torch.int16)
|
||||
phns = to_device(phns, device=self.device, dtype=torch.uint8 if len(self.symmap) < 256 else torch.int16)
|
||||
lang = to_device(lang, device=self.device, dtype=torch.uint8)
|
||||
|
||||
with torch.autocast(self.device, dtype=dtype, enabled=amp):
|
||||
input_kwargs = dict(
|
||||
|
@ -458,8 +472,12 @@ class TTS():
|
|||
use_lora=use_lora,
|
||||
)
|
||||
if model_len is not None:
|
||||
# extra kwargs
|
||||
duration_padding = sampling_kwargs.pop("duration_padding", 1.05)
|
||||
# skip calculating len_list if possible
|
||||
if task in ["ns, sr"]:
|
||||
len_list = [ prom[1].shape[0] ]
|
||||
elif vc_utterance is not None:
|
||||
len_list = [ vc_utterance.shape[0] ]
|
||||
else:
|
||||
len_list = model_len( **input_kwargs, task_list=["len"], **{"max_duration": 5} ) # "max_duration" is max tokens
|
||||
|
||||
# add an additional X seconds
|
||||
|
|
|
@ -535,11 +535,11 @@ class Base(nn.Module):
|
|||
else:
|
||||
self.proms_emb = AudioEmbedding(
|
||||
[n_audio_tokens] * self.n_resp_levels, d_model,
|
||||
sums=audio_embedding_sums,
|
||||
sums=audio_embedding_sums == "prom" or audio_embedding_sums == True,
|
||||
)
|
||||
self.resps_emb = AudioEmbedding(
|
||||
l_tokens, d_model,
|
||||
sums=audio_embedding_sums,
|
||||
sums=audio_embedding_sums == "resp" or audio_embedding_sums == True,
|
||||
l_names=resp_l_names,
|
||||
)
|
||||
|
||||
|
|
|
@ -127,6 +127,9 @@ def get_speakers():
|
|||
def get_languages():
|
||||
return list(get_lang_symmap().keys()) + ["auto"]
|
||||
|
||||
def get_tasks():
|
||||
return ["tts", "sr", "nr", "vc"]
|
||||
|
||||
#@gradio_wrapper(inputs=layout["dataset"]["inputs"].keys())
|
||||
def load_sample( speaker ):
|
||||
metadata_path = cfg.metadata_dir / f'{speaker}.json'
|
||||
|
@ -208,7 +211,7 @@ def do_inference_tts( progress=gr.Progress(track_tqdm=True), *args, **kwargs ):
|
|||
parser = argparse.ArgumentParser(allow_abbrev=False, add_help=False)
|
||||
# I'm very sure I can procedurally generate this list
|
||||
parser.add_argument("--text", type=str, default=kwargs["text"])
|
||||
parser.add_argument("--task", type=str, default="tts")
|
||||
parser.add_argument("--task", type=str, default=kwargs["task"])
|
||||
parser.add_argument("--modality", type=str, default=kwargs["modality"])
|
||||
parser.add_argument("--references", type=str, default=kwargs["reference"])
|
||||
parser.add_argument("--voice-convert", type=str, default=kwargs["voice-convert"])
|
||||
|
@ -336,7 +339,7 @@ def do_inference_stt( progress=gr.Progress(track_tqdm=True), *args, **kwargs ):
|
|||
|
||||
parser = argparse.ArgumentParser(allow_abbrev=False, add_help=False)
|
||||
# I'm very sure I can procedurally generate this list
|
||||
parser.add_argument("--task", type=str, default="tts")
|
||||
parser.add_argument("--task", type=str, default="stt")
|
||||
parser.add_argument("--references", type=str, default=kwargs["reference"])
|
||||
parser.add_argument("--max-duration", type=int, default=0)
|
||||
parser.add_argument("--language", type=str, default=kwargs["language"])
|
||||
|
@ -460,6 +463,7 @@ with ui:
|
|||
with gr.Row():
|
||||
layout["inference_tts"]["inputs"]["text-language"] = gr.Dropdown(choices=get_languages(), label="Language (Text)", value="auto", info="Language the input text is in.")
|
||||
layout["inference_tts"]["inputs"]["language"] = gr.Dropdown(choices=get_languages(), label="Language (Output)", value="auto", info="Target language/accent to output.")
|
||||
layout["inference_tts"]["inputs"]["task"] = gr.Dropdown(choices=get_tasks(), label="Task", value="tts", info="")
|
||||
with gr.Row():
|
||||
layout["inference_tts"]["inputs"]["split-text-by"] = gr.Dropdown(choices=["sentences", "lines"], label="Text Delimiter", info="How to split the text into utterances.", value="sentences")
|
||||
layout["inference_tts"]["inputs"]["context-history"] = gr.Slider(value=0, minimum=0, maximum=4, step=1, label="(Rolling) Context History", info="How many prior lines to serve as the context/prefix (0 to disable).")
|
||||
|
|
Loading…
Reference in New Issue
Block a user