exposed additional task (ns, sr, vc) (vc is experimental)

2024-12-20 11:15:29 -06:00 · 2024-12-20 11:15:29 -06:00 · 59bf6b8b33
commit 59bf6b8b33
parent 53230efd74
10 changed files with 83 additions and 64 deletions
--- a/docs/README.md
+++ b/docs/README.md
@ -31,14 +31,14 @@ There are far better TTS solutions out there, such as [MaskGCT](https://github.c

 The reference model (`ar+nar-llama-8`/`ar+nar-len-llama-8`):
 * boasts 220M parameters
-* supports English, German, French, and Japanese
-  * support for Korean and Chinese (Mandarin?) soon™
+* supports English, German, French, Japanese, Korean, and Chinese (Mandarin?)
 * has several modalities of inferencing:
  * the primary audio level (RVQ level 0) can be inferenced both autoregressively (`AR`) or non-autoregressively (`NAR-len`)
    * pure-NAR can yield faster-than-realtime output
  * supports predicting the duration of an input
  * supports Speech-to-Text (although it's a second-class feature)
-  * additional tasks such as noise reduction, speech removal, editing, and voice conversion eventually™ (just need to train on it)
+  * supports additional tasks such as speech removal, noice reduction, and voice converison.
+    * additional tasks such as speaker extraction and speech editing eventually™ (just need to train on it)
 * trained on `?` samples / `?` hours of EnCodec-quantized audio at 24KHz

 ## To-Do
@ -79,6 +79,8 @@ The reference model (`ar+nar-llama-8`/`ar+nar-len-llama-8`):
 * [x] objective metrics such as WER / SIM-O
  * [x] WER simply requires transcribing audio then computing word error rates through the transcriptions
  * [x] SIM-O requires passing the raw waveform through a speaker-similarity model
+* [ ] valle.cpp through llama.cpp + encodec.cpp
+  * the latter is easy, the former is not.

 ## "Postmortem"

@ -87,12 +89,11 @@ For the most part, the model is complete. With the `NAR-len` being crammed on, I
 However, while this solution boasts being lightweight, there are some caveats for its given size
 * its at capacity on what it *can* do without additional tasks to augment it further
  * post-fixing it with additional layers glued on doesn't seem to offer very much improvement (12 => 16 layers)
-* wrangling it is a bit of a chore, as some voices work fine under the `AR` but not the `NAR-len`, and vice-versa
-  * some voices outright refuse to work without LoRA training
-  * some sampler settings works on some voices, but others need some tweaking
+  * the only bet is to feed it more data and see how it fares, since the model is still grossly undertrained compared to the 50K+ hour behemoths.
 * subjugating an existing LLM architecture is a bit of a pain, as I would *love* to make full use of LLaMA niceties
-  * `hf`-ifying it is possible, but it'd be a chore to set up the tokenizer properly
-* multi-lingual support is a bit of an afterthought
+  * `hf`-ifying it is possible, but due to the nature of summed audio embeddings and split classifiers, it's not as plug-and-play as I would like for inferencing.
+* speaker similarity is rather mediocre for unseen speakers, the model isn't as robust for mapping speakers to its latent space as it is for seen speakers.
+* despite being rather robust, some vocal stutters makes it way in.

 ## Notices and Citations

--- a/docs/data.md
+++ b/docs/data.md
@ -23,17 +23,6 @@ These durations were reported from the training script directly.

 If you already have a dataset you want, for example, your own large corpus or for finetuning, you can use your own dataset instead.

-0. Set up a `venv` with `https://github.com/m-bain/whisperX/`.
-  + At the moment only WhisperX is utilized. Using other variants like `faster-whisper` is an exercise left to the user at the moment.
-  + It's recommended to use a dedicated virtualenv specifically for transcribing, as WhisperX will break a few dependencies.
-  + The following command should work:
-  ```
-  python3 -m venv venv-whisper
-  source ./venv-whisper/bin/activate
-  pip3 install torch torchvision torchaudio
-  pip3 install git+https://github.com/m-bain/whisperX/
-  ```
-
 1. Populate your source voices under `./voices/{group name}/{speaker name}/`.

 2. Run `python3 -m vall_e.emb.transcribe`. This will generate a transcription with timestamps for your dataset.
@ -114,6 +103,7 @@ This section may be covered elsewhere in the documentation, but coverage here sh
 	* the above, but injects some noise throughout the sampled utterances.

 A mystical `vc` for performing voice conversion is possible, but either requires a dataset to do so, or abusing an emergent property.
+* This emergent property is mostly abused through the NAR-len's demasking routine.

 ## `__main__`

--- a/docs/emb.md
+++ b/docs/emb.md
@ -63,40 +63,28 @@ For audio backends:

 Descript-Audio-Codec was thoroughly tested for promising much, much cleaner output audio, as this model encodes/decodes at 44.1KHz, rather than EnCodec's 24KHz.

-However, due to the nature of the codec, simply throwing it at an attention-based transformer proves to be painful, as a unified AR+NAR model *heavily* suffers from noisy output in the NAR.
+However, due to the nature of the codec, simply throwing it at an attention-based transformer proves to be painful, as the model *heavily* suffers from noisy output in the higher half of the RVQ levels.

 Ironically, testing through erroneously encoded audio (feeding 24KHz audio without upsampling to 44.1KHz) proved to have "cleaner" but bad utterances.

 I'm uncertain on how to remedy this, as my options are:
-* train under a RetNet, if an attention-based transformer is simply the problem
-* train an AR, and train a NAR, if the codec itself is at fault
-* use an SSM like Mamba, if transformers entirely cannot model the codec
-* train a separate model that simply converts from EnCodec to DAC
-* train *all* NAR levels as independent masking sequences.
+* train under a RetNet, if an attention-based transformer is simply the problem (it's not)
+* train an AR, and train a NAR, if the codec itself is at fault (it's probably something inherent to the codec)
+* use an SSM like Mamba, if transformers entirely cannot model the codec (Mamba is too much of a thorn to use)
+* train a separate model that simply converts from EnCodec to DAC (requires another model to juggle, but does not require training a new model)
+* train *all* NAR levels as independent masking sequences similar to the `NAR-len` (complicated)
  * if this works, then it means that there's little to no mappable relation between DAC's RVQ levels

 ## `transcribe.py`

-This script primarily handles taking raw input audio, and outputting adequate metadata containing transcriptions of said audio through `whisperX`.
+This script primarily handles taking raw input audio, and outputting adequate metadata containing transcriptions of said audio through `whisper`.

-The process maintains slices `whisperX` thinks its best per the segments outputted, alongside the deduced language (if not specified).
+By default, `openai/whisper-large-v3` is used through HuggingFace's `pipeline` and everything is handled automatically. The process maintains slices `whisper` thinks its best per the segments outputted, alongside the deduced language (if not specified).

 One limiting factor is that transcription transcribes into normal text, rather than the IPA phonemes the model was trained against. Some flavors *may* exist, but I have yet to test them extensively (if I did ever find one).

 Refer to the `__main__`'s arguments for usage details.

-### Metrics
-
-This script also handles calculating `WER` simply by transcribing the given audio file (and reference, if requested), then comparing the word error rate.
-
-This process *heavily* relies on text normalization, which currently is lacking, but transcribing the reference should keep things "normalized" per the transcriber.
-
-### ROCm
-
-Because life is pain, ROCm requires additional steps to ensure that `whisperX` works. A special fork of `CTranslate2` is required, but simplying following [these](https://github.com/arlo-phoenix/CTranslate2-rocm/blob/rocm/README_ROCM.md) steps should fix things.
-
-In the future, I would love to replace WhisperX for something simple.
-
 ## `process.py`

 This script handles taking raw input audio and its transcribed metadata, and outputs encoded audio (NumPy) files containing encoded audio and associated metadata.
@ -120,7 +108,3 @@ When processing a dataset, this requires already having accompanying metadata ge
 Be *very* careful if you opt to output unsegmented and segmented utterances, as the sliced version may end up amongst the top-K similar candidates.

 Refer to the `__main__`'s arguments for usage details.
-
-### Metrics
-
-This script also handles calculating `SIM-O` per [keonlee9420/evaluate-zero-shot-tts](https://github.com/keonlee9420/evaluate-zero-shot-tts/blob/master/src/evaluate_zero_shot_tts/utils/speaker_verification/verification.py), by making use of a model to create an embedding of a speaker, then computing cosine similarities on those embeddings.
--- a/docs/engines.md
+++ b/docs/engines.md
@ -14,7 +14,7 @@ This script handles the bulk of loading a model and wrapping the model with the

 The checkpoint or weight path is automatically deduced, as well as pre-processing the state dict (if requested) before loading it.
 * resizing modules from the weights to the requested configuration in the YAML is done here.
-* replacing modules with optimized versions or LoRAs are applied here.
+* replacing modules with quantized versions or LoRAs are applied here.
 * the requested optimizer, and params to freeze, for a model is applied here.

 ## `base.py`
--- a/docs/metrics.md
+++ b/docs/metrics.md
@ -1,13 +1,21 @@
 # `metrics.py`

-This file provides helper functions for computing objective metrics, such as word-error rate (WER), character-error rate (CER), and speaker similarity (SIM-O).
+This file provides helper functions for computing objective metrics, such as word-error rate (WER), character-error rate (CER), phoneme-error rate (PER), and speaker similarity (SIM-O).

 ## WER / CER

 Word-error rate (WER) is simply computed by transcribing the requested input, and comparing its transcription against the target transcription.
+* The transcription is cleaned up and normalized to account for inconsistencies between transcriptions with `openai/whisper-large-v3` with the nuances of English.
+* Languages without spaces between words (Chinese, Japanese) should not rely on this, and instead rely on the CER.

-Because of issues with normalization (and not having a robust normalization stack), both transcriptions are then phonemized, then the resultant phonemes are used for error rate calculations.
+Character-error rate (CER) does the same thing as WER, but on a character basis rather than a word basis.

+Phoneme-error rate (PER) does the same thing as CER, but on the phonemized transcription instead. As this is a speech model, this metric is more correct than the prior metrics, but this isn't a universal metric for comparison, as most models don't report this.

+All rates are un-normalized because I think that's the right way to go about it? Papers aren't clear that they do this, but the error rates are even more unusually low without this.

 ## SIM-O
+
+Speaker similarity (SIM-O) is computed by obtaining the embedding of each speaker (the output audio and the input prompt), and computing the cosine similarity between those two embeddings.
+
+These embeddings are obtained through a finetune of WavLM-large geared towards speaker verification.
--- a/docs/models.md
+++ b/docs/models.md
@ -87,6 +87,7 @@ The NAR-len model keeps things simple by:
    * it could be in any base, but it's simple to just treat each token ID as a digit, then cast the string to an int.
    * this could literally also not be relying on an AR sequence to predict.
  * some checkpoints of the model seems to adhere well to outputting silence at the end if the requested duration exceeds the actual duration.
+    * this seems to only happen for models that erroneously causally attend to tokens in the `NAR-len`.
 * inferencing is a simple loop that simply takes the best masked-off k tokens per step, and remasks the remaining.

 Because the model already leverages the magic of attention to derive phoneme-alignment, such annotations are still not required (but they probably help with a naive sampler).
--- a/docs/webui.md
+++ b/docs/webui.md
@ -10,12 +10,25 @@ A Gradio-based web UI is accessible by running `python3 -m vall_e.webui`. You ca

 Synthesizing speech is simple:

-* `Input Prompt`: The guiding text prompt. Each new line will be its own generated audio to be stitched together at the end.
+* `Text`:
+  * `Input Prompt`: The guiding text prompt. Each segment will be its own generated audio to be stitched together at the end.
+* `Audio`:
+  * `Audio Input`: The transcription of the audio will be inserted into the `Text/Input Prompt` box.
+    * For `vc` task, this will serve as the guidance reference audio as well.
+
 * `Audio Input`: The reference audio for the synthesis. Under Gradio, you can trim your clip accordingly, but leaving it as-is works fine.
  - A properly trained model can inference without a prompt to generate a random voice (without even needing to generate a random prompt itself).
 * `Output`: The resultant audio.
 * `Inference`: Button to start generating the audio.
 * `Basic Settings`: Basic sampler settings for most uses.
+  * `Max Steps`: Number of demasking steps to perform for RVQ level 0. For the `NAR-len` modality.
+  * `Max Duration`: Maximum duration the output audio will be.
+  * `Input Prompt Repeat/Trim Length`: The audio prompt will be this duration length, as it will either be trimmed down or repeated (although repeating might cause more harm).
+  * `Language (Text)`: The language of the input text for phonemizing.
+  * `Language (Output)`: The target language for the output audio. Some checkpoints of the model might ignore this due to how it was trained, unfortunately. Some models might steer the output accent.
+  * `Task`: The task to perform (in order): Text-To-Speech, Speech Removal, Noise Reduction, Voice Conversion.
+  * `Text Delimiter`: How to split the `Text/Input Prompt`. Sentences will split by sentences, while lines will split by new lines.
+  * `(Rolling) Context History`: Paired with the above, the previous N utterances will serve as the prefix to extend the generation on, allowing for consistency and stability across pieces.
 * `Sampler Settings`: Advanced sampler settings that are common for most text LLMs, but needs experimentation.
 * `Experimental Settings`: Settings used for testing. `cfg.experimental=True` enables this tab.

--- a/vall_e/inference.py
+++ b/vall_e/inference.py
@ -361,8 +361,14 @@ class TTS():
 		use_lora = sampling_kwargs.pop("use_lora", None)
 		dtype = sampling_kwargs.pop("dtype", self.dtype)
 		amp = sampling_kwargs.pop("amp", self.amp)
+		duration_padding = sampling_kwargs.pop("duration_padding", 1.05)

 		voice_convert = sampling_kwargs.pop("voice_convert", None)
+		# explicitly require this
+		if task != "vc":
+			voice_convert = None
+		elif voice_convert == None:
+			raise Exception("Voice conversion requested, but no reference clip provided.")

 		# transcribe from audio to voice convert from
 		if voice_convert is not None and not text:
@ -425,6 +431,20 @@ class TTS():

 		auto_lang = not language or language == "auto"
 		auto_text_lang = not text_language or text_language == "auto"
+		
+		vc_utterance = self.encode_audio( voice_convert, trim_length=0 ) if voice_convert else None
+		prom = self.encode_audio( references, trim_length=input_prompt_length ) if references else None
+		lang = self.encode_lang( language )
+		
+		if task in ["ns, sr"]:
+			prom = [
+				task,
+				prom
+			]
+		
+		prom = to_device(prom, device=self.device, dtype=torch.int16)
+		lang = to_device(lang, device=self.device, dtype=torch.uint8)
+		
 		for line in lines:
 			if out_path is None:
 				output_dir = Path("./data/results/")
@ -440,14 +460,8 @@ class TTS():
 			if auto_text_lang:
 				text_language = deduced_language

-			vc_utterance = self.encode_audio( voice_convert, trim_length=0 ) if voice_convert else None
-			prom = self.encode_audio( references, trim_length=input_prompt_length ) if references else None
 			phns = self.encode_text( line, language=text_language )
-			lang = self.encode_lang( language )
-
-			prom = to_device(prom, device=self.device, dtype=torch.int16)
 			phns = to_device(phns, device=self.device, dtype=torch.uint8 if len(self.symmap) < 256 else torch.int16)
-			lang = to_device(lang, device=self.device, dtype=torch.uint8)

 			with torch.autocast(self.device, dtype=dtype, enabled=amp):
 				input_kwargs = dict(
@ -458,8 +472,12 @@ class TTS():
 					use_lora=use_lora,
 				)
 				if model_len is not None:
-					# extra kwargs
-					duration_padding = sampling_kwargs.pop("duration_padding", 1.05)
+					# skip calculating len_list if possible
+					if task in ["ns, sr"]:
+						len_list = [ prom[1].shape[0] ]
+					elif vc_utterance is not None:
+						len_list = [ vc_utterance.shape[0] ]
+					else:					
 						len_list = model_len( **input_kwargs, task_list=["len"], **{"max_duration": 5} ) # "max_duration" is max tokens

 						# add an additional X seconds
--- a/vall_e/models/base.py
+++ b/vall_e/models/base.py
@ -535,11 +535,11 @@ class Base(nn.Module):
 		else:
 			self.proms_emb = AudioEmbedding(
 				[n_audio_tokens] * self.n_resp_levels, d_model,
-				sums=audio_embedding_sums,
+				sums=audio_embedding_sums == "prom" or audio_embedding_sums == True,
 			)
 			self.resps_emb = AudioEmbedding(
 				l_tokens, d_model,
-				sums=audio_embedding_sums,
+				sums=audio_embedding_sums == "resp" or audio_embedding_sums == True,
 				l_names=resp_l_names,
 			)

--- a/vall_e/webui.py
+++ b/vall_e/webui.py
@ -127,6 +127,9 @@ def get_speakers():
 def get_languages():
 	return list(get_lang_symmap().keys()) + ["auto"]

+def get_tasks():
+	return ["tts", "sr", "nr", "vc"]
+
 #@gradio_wrapper(inputs=layout["dataset"]["inputs"].keys())
 def load_sample( speaker ):
 	metadata_path = cfg.metadata_dir / f'{speaker}.json'
@ -208,7 +211,7 @@ def do_inference_tts( progress=gr.Progress(track_tqdm=True), *args, **kwargs ):
 	parser = argparse.ArgumentParser(allow_abbrev=False, add_help=False)
 	# I'm very sure I can procedurally generate this list
 	parser.add_argument("--text", type=str, default=kwargs["text"])
-	parser.add_argument("--task", type=str, default="tts")
+	parser.add_argument("--task", type=str, default=kwargs["task"])
 	parser.add_argument("--modality", type=str, default=kwargs["modality"])
 	parser.add_argument("--references", type=str, default=kwargs["reference"])
 	parser.add_argument("--voice-convert", type=str, default=kwargs["voice-convert"])
@ -336,7 +339,7 @@ def do_inference_stt( progress=gr.Progress(track_tqdm=True), *args, **kwargs ):

 	parser = argparse.ArgumentParser(allow_abbrev=False, add_help=False)
 	# I'm very sure I can procedurally generate this list
-	parser.add_argument("--task", type=str, default="tts")
+	parser.add_argument("--task", type=str, default="stt")
 	parser.add_argument("--references", type=str, default=kwargs["reference"])
 	parser.add_argument("--max-duration", type=int, default=0)
 	parser.add_argument("--language", type=str, default=kwargs["language"])
@ -460,6 +463,7 @@ with ui:
 						with gr.Row():
 							layout["inference_tts"]["inputs"]["text-language"] = gr.Dropdown(choices=get_languages(), label="Language (Text)", value="auto", info="Language the input text is in.")
 							layout["inference_tts"]["inputs"]["language"] = gr.Dropdown(choices=get_languages(), label="Language (Output)", value="auto", info="Target language/accent to output.")
+							layout["inference_tts"]["inputs"]["task"] = gr.Dropdown(choices=get_tasks(), label="Task", value="tts", info="")
 						with gr.Row():
 							layout["inference_tts"]["inputs"]["split-text-by"] = gr.Dropdown(choices=["sentences", "lines"], label="Text Delimiter", info="How to split the text into utterances.", value="sentences")
 							layout["inference_tts"]["inputs"]["context-history"] = gr.Slider(value=0, minimum=0, maximum=4, step=1, label="(Rolling) Context History", info="How many prior lines to serve as the context/prefix (0 to disable).")