*`Max Steps`: Number of demasking steps to perform for RVQ level 0. For the `NAR-len` modality.
*`Max Duration`: Maximum duration the output audio will be.
*`Input Prompt Repeat/Trim Length`: The audio prompt will be this duration length, as it will either be trimmed down or repeated (although repeating might cause more harm).
*`Language (Text)`: The language of the input text for phonemizing.
*`Language (Output)`: The target language for the output audio. Some checkpoints of the model might ignore this due to how it was trained, unfortunately. Some models might steer the output accent.
*`Task`: The task to perform (in order): Text-To-Speech, Speech Removal, Noise Reduction, Voice Conversion.
*`Text Delimiter`: How to split the `Text/Input Prompt`. Sentences will split by sentences, while lines will split by new lines.
*`(Rolling) Context History`: Paired with the above, the previous N utterances will serve as the prefix to extend the generation on, allowing for consistency and stability across pieces.
All the additional knobs have a description that can be correlated to the inferencing CLI flags.
Speech-To-Text phoneme transcriptions for models that support it can be done using the `Speech-to-Text` tab.
## Dataset
This tab currently only features exploring a dataset already prepared and referenced in your `config.yaml`. You can select a registered voice, and have it randomly sample an utterance.
In the future, this *should* contain the necessary niceties to process raw audio into a dataset to train/finetune through, without needing to invoke the above commands to prepare the dataset.
So far, this only allows you to load a different model under a different dtype, device, and/or attention mechanism. without needing to restart. The previous model should seamlessly unload, and the new one will load in place.