*`--ar-temperature`: sampling temperature to use for the AR/NAR pass. 0 enables greedy sampling.
* For the AR, ~1.0 is *fine*, but lowering the temperature adheres better to the prosody of the input prompt.
* For the AR, low temperatures require a repetition penalty to prevent outputs from degenerating.
* For the NAR, greedy sampling is best, but can be raised to 0.2.
*`--input-prompt-length`: the duration of the input prompt (~6 seconds is fine, longer durations lead to slower generations for "better" accuracy). 0 does not repeat/trim.
* If a prompt is shorter than the given duration, it's repeated to the duration size.
*`--min-temperature`: triggers the dynamic temperature pathway, adjusting the temperature based on the confidence of the best token. Acceptable values are between `[0.0, (n)ar-temperature)`.
*`--repetition-penalty`: modifies the probability of tokens if they have appeared before. In the context of audio generation, this is a very iffy parameter to use.
*`--length-penalty`: (AR only) modifies the probability of the stop token based on the current sequence length. This is ***very*** finnicky due to the AR already being well correlated with the length.
*`--beam-width`: (AR only) specifies the number of branches to search through for beam sampling.
+ This is a very naive implementation that's effectively just greedy sampling across `B` spaces.
*`--mirostat-tau`: (AR only) the "surprise value" when performing mirostat sampling.
+ This simply uplifts the [original implementation](https://github.com/basusourya/mirostat/blob/master/mirostat.py) to perform it.
+ **!**NOTE**!**: This is incompatible with beam search sampling (for the meantime at least).
Some arguments are able to be prefixed with `ar-` and `nar-` to only use that setting for its respective pass. At the moment through the CLI, this includes:
Currently, the model only transcribes back into the IPA phonemes it was trained against, as an additional model or external program is required to translate the IPA phonemes back into text.
* this does make a model that can phonemize text, and unphonemize text, more desirable in the future to replace espeak (having an additional task to handle this requires additional embeddings, output heads, and possible harm to the model as actual text is not a modality the model is trained on).