documentation update

2024-11-19 19:19:34 -06:00 · 2024-11-19 19:19:34 -06:00 · efeb55e1b7
commit efeb55e1b7
parent b1369e7824
4 changed files with 23 additions and 5 deletions
--- a/docs/config.md
+++ b/docs/config.md
@ -29,6 +29,8 @@ Additional global-states can be found here, such as:
 	* the end user should not touch this, as this not only depends on the model used, but also governs what audio codec to store processed audio under for the dataset.
 * `weights_format`: the default weights format to save and load state dicts to
 	* the end user shouldn't worry about this, as SafeTensors are primarily used, but the program can easily handle any pickled dicts if requested.
+* `weights_name`: the name (without the extension) to load the weights from directly. Defaults to `fp32`.
+	* the end user shouldn't worry about this, but it makes regression testing much easier without needing to juggle renaming files.

 On initialization, this class then validates its member variables to ensure they're instances of the below classes, rather than dicts.
 * Backwards compatibility validation may be performed during this step as well.
--- a/docs/data.md
+++ b/docs/data.md
@ -162,6 +162,8 @@ This entry in the config YAML handles knobs and features related to the dataload
 * `workers`: number of worker processes to handle dataloading under PyTorch.
 * `cache`: use diskcache when requested to not require subsequent processing. This handles *all* `diskcache` requests throughout the program if requested, but should only really be used under this script.
 * `min_utterances`: number of utterances to treat a speaker as valid.
+* `max_utterances`: maximum number of utterances a speaker can have. The remaining utterances are sliced off.
+	* This is beneficial if you happen to have a section of your dataset with a ton of speakers, but you want to train on a plethora of speakers instead to balance out speaker.
 * `duration_range`: a list of two values to denote the acceptable duration ranges a sample is valid for the dataloader. 
 * `sample_type`: type of sampler to use. Currently accepts `path` (an epoch is all paths in the dataset, and each index maps to each sample) or `speaker` (an epoch is all speakers in the dataset, and each index maps to each speaker)
 * `sample_order`: order to keep the dataloader sample. Currently accepts `interleaved` (tries to balance per speaker) and `duration` (orders by duration to keep throughput and VRAM usage consistent).
--- a/docs/models.md
+++ b/docs/models.md
@ -7,9 +7,10 @@ The underlying model is a robust transformer, where:

 The beauty of a transformer, I feel, is that you can easily define any task at it, and it should follow through with it very well.

-The inputs are sequenced in a way that the given task requires automatically, and the outputs are handled as per the class that extends the base model.
+The inputs are automatically sequenced in a way that a given task requires, and the outputs are handled as per the class that extends the base model.

 While the original paper called for a separate AR model and a NAR model, and by treating the AR and the NAR as unique tasks, you can actually train a unified model (`AR+NAR`) for effectively free, as the internal states of the two should overlap quite a lot.
+* Additionally, you can even train a `NAR-len` model on top of an existing model.

 ## The AR (Autoregressive) Model

@ -33,7 +34,7 @@ The NAR is responsible for generating the remaining RVQ levels of the audio code

 As decoding is done non-autoregressively, the model can process tokens "in place" and have them attended to one another in the past and future, thus speeding up output and allowing for "more accurate" outputs.

-Non-autoregressive trainng is performed by having the input tokens from the previous RVQ level predict the next level's token in place. The output logits are in the same position, and do not require further modifications as required for the AR.
+Non-autoregressive training is performed by having the input tokens from the previous RVQ level predict the next level's token in place. The output logits are in the same position, and do not require further modifications as required for the AR.

 One problem exhibited from a NAR is producing arfifacts ("crust") in the final waveform. I believe this is a confidence problem where the wrong token is inferred.
 * Unfortunately, one solution is to simply train a separate NAR, as this should help bolster the model's NAR capabilities without the AR influencing things, as I imagine being able to both causally and parallel-ly decode tokens harms things.
@ -57,8 +58,7 @@ However, having a pure NAR is challenging, as you need to both explicitly provid
 The NAR-len model keeps things simple by:
 * training with a fixed masking ratio (80% of the tokens are masked and trained to predict the remaining tokens)
  * [this paper](https://arxiv.org/abs/2406.05478v1) mentions a fixed ratio during training yields better results than randomly picking a masking ratio.
-  * randomly picking a duration is actually very ungood and harms the model during trainng.
-    * this may only matter if swapping from a training on a fixed masking ratio to a random ratio without any timestep information being added.
+  * randomly picking a duration ~~is actually very ungood and harms the model during training~~ actually doesn't matter much.
 * not including any specific timestep embedding information
  * some solutions add in the (sinusoidal position'd) timestep embedding, either on top of the input embeddings, or as some normalization weight around the attention head (before and after).
  * it does not seem to be necessary what-so-ever to require this, especially training under a fixed masking ratio.
@ -69,9 +69,13 @@ The NAR-len model keeps things simple by:
    * it could be in any base, but it's simple to just treat each token ID as a digit, then cast the string to an int.
 * inferencing is a simple loop that simply takes the best masked-off k tokens per step, and remasks the remaining.

+Because the model already leverages the magic of attention to derive phoneme-alignment, such annotations are still not required (but they probably help with a naive sampler).
+
 In theory, demasking for the NAR's RVQ level 0 can also be applied to the remaining RVQ levels to further improve the output from the remaining levels.
 * this isn't necessary as the model already has a strong enough relationship between the prompt, the prior levels, and the targeted level.
 * this is technically already offered with `cfg.model.experimental.token_dropout_rate` which mirrors masking, but experimentation has not been done to a large degree.
+* there is a bit of a problem with properly implementing this, as the tokens aren't predicting themselves.
+  * it may be a simple thing to implement anyways.

 It is ***crucial*** to:
 * avoid re-masking tokens that are already "good" enough (this can easily be done by "banning" them in the scoring process)
@ -79,6 +83,8 @@ It is ***crucial*** to:
 * use unfiltered/unprocessed logit scores:
  * not that crucial, but helps stability.

+It is not required to train a model from scratch to use this modality, as using existing weights works just as well, if not better (as it can piggyback off the original model).
+
 ## Embeddings (and Classifiers)

 The "magic" of subjugating a transformer for audio use lies within the ensemble of the embeddings. This is necessary as each piece of a sequence is fundamentally different, but a HF-compatible model can get away with treating each sequence as separate ranges within a total token sequence.
--- a/docs/samplers.md
+++ b/docs/samplers.md
@ -81,4 +81,12 @@ The huge caveat is that this requires tuning the parameters and thresholds per m

 Additionally, one state requires injecting a CoT token, which doesn't have an analog in the audio domain. 

-However, this does seem to serve as a good basis to expand upon this and sample according to the entropy/varentropy of the model's current state.
+However, this does seem to serve as a good basis to expand upon this and sample according to the entropy/varentropy of the model's current state.
+
+### Classifier-Free Guidance
+
+While this isn't a direct sampler type used, a helper function is provided to perform classifier-free guidance, given a positive (the primary) logits, and a negative (the null) logits. While the `NAR-len` modality requires this at the moment, it can easily be adapted everything else.
+
+Rescaling is also applied to avoid clipping the logits.
+
+Due to the logits being the full sequence, and the input lengths differing, a list of lengths are required to be passed to only modify the last N logits.