diff --git a/docs/config.md b/docs/config.md index af1b257..090e902 100644 --- a/docs/config.md +++ b/docs/config.md @@ -72,7 +72,11 @@ This class governs the behavior during the evaluation / validation pass during t If `cfg.evaluation.size > 0`, then the evaluation / validation passes are triggered every `cfg.evaluation.frequency` iteration steps. -During evaluation, a separate copy of the training dataset will be sampled and the inputs will be inferenced to generate an output, while during validation, the validation dataset is sampled from instead. +During evaluation: +* for the `subtrain` evaluation pass, the training dataset is directly sampled through indices, rather than the iterator, to avoid having to duplicate the dataset. + * in the future, the samples during this pass should sample around the training dataloader's current position. +* for the `val` validation pass, the validation dataset is sampled through the dataloader's iterator. + * currently, the validation dataloader's sampler is not stored. A total of `cfg.evaluation.size` samples are inferenced in no more than `cfg.evaluation.batch_size`-sized batches (no more than, because batched samplers may return different sized batches). diff --git a/docs/data.md b/docs/data.md index 47ace4b..7e5e639 100644 --- a/docs/data.md +++ b/docs/data.md @@ -7,12 +7,17 @@ Most of these settings live under `cfg.dataset`. ## Dataset The provided reference model was trained on `?`k hours of audio with a mix of: -* LibriTTS-R's entire dataset -* `small`+`medium`+`duplicate` portions of LibriVox -* Emilia's German, French, and Japanese dataset -* 12K hours of a privately sourced corpus of 425 audiobooks -* a small portion of Emilia's English dataset -* a personal small corpus of transcribed utterances from a selection of video games +* 490.151 hours (out of 585 hours) of LibriTTS-R's entire dataset +* 8362.304 hours (out of 10270 hours) of `small`+`medium`+`duplicate` portions of LibriLight +* 4467.611 hours (out of `?` hours) of Emilia's German, French, and Japanese dataset +* 2927.186 hours (out of `?` hours) of a privately sourced corpus of 425 audiobooks +* 2364.799 hours (out of `?` hours) of Emilia's English dataset +* 54.775 hours of a personal small corpus of transcribed utterances from a selection of video games + +These durations were reported from the training script directly. +* Utterances under 3 seconds or above 32 seconds were culled from the duration count. +* Metadata was *mostly* derived from the transcription metadata, mostly. + * LibriTTS-R's duration metadata was derived from the quantized audio size. ### Leverage Your Own Dataset diff --git a/docs/export.md b/docs/export.md index 3e174af..5ea603e 100644 --- a/docs/export.md +++ b/docs/export.md @@ -4,6 +4,6 @@ To export the models, run: `python -m vall_e.export --yaml=./training/config.yam This will export the latest checkpoints, for example, under `./training/ckpt/ar+nar-retnet-8/fp32.pth`, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats. -Desite being called `fp32.pth`, you can export it to a different precision type with `--dtype=float16|bfloat16|float32`. +Desite being called `fp32.sft` or `fp32.pth`, you can export it to a different precision type with `--dtype=float16|bfloat16|float32`. You can also export to `safetensors` with `--format=sft`, and `fp32.sft` will be exported instead. \ No newline at end of file diff --git a/docs/inferenece.md b/docs/inferenece.md index d6f3fcc..9de98ae 100644 --- a/docs/inferenece.md +++ b/docs/inferenece.md @@ -13,44 +13,51 @@ To synthesize speech: `python -m vall_e --yaml= RVQ level 0, embedding level 2 => RVQ level 1, etc... * I believe this is because the model needs to "know" whether to predict ~~the next token in the sequence, or the token in the same position of the next RVQ level~~ which tokens of a given embedding. * In other words, the AR's RVQ level 0 embedding predicts itself, while the NAR's embeddings predict the next level's embeddings. + * This is evident on how RVQ level 0 can be trained causally and in parallel with its own embeddings, rather than having limiting issues when reusing the embedding across the two. * Unfortunately, providing a token for the current/target RVQ level within the input sequence doesn't seem to help? I don't remember if I experimented with this or not, but testing of a "sane" `resp` embedding proved to be unfruitful. The `prom` and `resp` are split since, in theory, it helps the model know better what audio to source from, and what audio is part of the output sequence. In theory. @@ -127,7 +135,7 @@ Finally, the model *may* then sum each embedding level back down to one sequence * It *could* be beneficial to train a model under mixed modes, but requires experimentation. * The reference model was trained originally without summing, then trained with summing. -Additionally, it's *technically* possible to instead use the embeddings from the core model used to encode the audio, but in theory this may exclude specific features the model has encoded within the embeddings. +Additionally, it's *technically* possible to instead use the embeddings from the model used to encode the audio (for example, EnCodec's embeddings), but in theory this may exclude specific features the model has encoded within the embeddings. #### RVQ Level Embedding @@ -204,7 +212,7 @@ This task will follow a reverse sequence of `