3.2 KiB
webui.py
A Gradio-based web UI is accessible by running python3 -m vall_e.webui
. You can, optionally, pass:
--yaml=./path/to/your/config.yaml
: will load the targeted YAML--model=./path/to/your/model.sft
: will load the targeted model weights--listen 0.0.0.0:7860
: will set the web UI to listen to all IPs at port 7860. Replace the IP and Port to your preference.
Inference
Synthesizing speech is simple:
-
Text
:Input Prompt
: The guiding text prompt. Each segment will be its own generated audio to be stitched together at the end.
-
Audio
:Audio Input
: The transcription of the audio will be inserted into theText/Input Prompt
box.- For
vc
task, this will serve as the guidance reference audio as well.
- For
-
Audio Input
: The reference audio for the synthesis. Under Gradio, you can trim your clip accordingly, but leaving it as-is works fine.- A properly trained model can inference without a prompt to generate a random voice (without even needing to generate a random prompt itself).
-
Output
: The resultant audio. -
Inference
: Button to start generating the audio. -
Basic Settings
: Basic sampler settings for most uses.Max Steps
: Number of demasking steps to perform for RVQ level 0. For theNAR-len
modality.Max Duration
: Maximum duration the output audio will be.Input Prompt Repeat/Trim Length
: The audio prompt will be this duration length, as it will either be trimmed down or repeated (although repeating might cause more harm).Language (Text)
: The language of the input text for phonemizing.Language (Output)
: The target language for the output audio. Some checkpoints of the model might ignore this due to how it was trained, unfortunately. Some models might steer the output accent.Task
: The task to perform (in order): Text-To-Speech, Speech Removal, Noise Reduction, Voice Conversion.Text Delimiter
: How to split theText/Input Prompt
. Sentences will split by sentences, while lines will split by new lines.(Rolling) Context History
: Paired with the above, the previous N utterances will serve as the prefix to extend the generation on, allowing for consistency and stability across pieces.
-
Sampler Settings
: Advanced sampler settings that are common for most text LLMs, but needs experimentation. -
Experimental Settings
: Settings used for testing.cfg.experimental=True
enables this tab.
All the additional knobs have a description that can be correlated to the inferencing CLI flags.
Speech-To-Text phoneme transcriptions for models that support it can be done using the Speech-to-Text
tab.
Dataset
This tab currently only features exploring a dataset already prepared and referenced in your config.yaml
. You can select a registered voice, and have it randomly sample an utterance.
In the future, this should contain the necessary niceties to process raw audio into a dataset to train/finetune through, without needing to invoke the above commands to prepare the dataset.
Settings
So far, this only allows you to load a different model under a different dtype, device, and/or attention mechanism. without needing to restart. The previous model should seamlessly unload, and the new one will load in place.