# Model Notes To be filled. ## Emergent Behavior The model can be prompted in creative ways to yield some interesting behaviors: * prompting without an input audio prompt will have the model generate a random voice at the "cost" of some unintelligible utterance at the beginning of the output response (despite doing no promptless training). * finetunes / LoRAs can benefit from this by having input audio promptless synthesis, while opting to have an input audio prompt for guidance. * prompting with an input text prompt being the transcription of the input audio prompt will have the response follow very closely to the input prompt (despite not doing input=output training). * this should allow for easy transcription editing without much fuss. # `models/*` This folder contains scripts relating to models and code for VALL-E use, from the wrapping model to the underlying arch. ## `models/lora.py` This script implements Low-Ranking Adapters, to allow for cheaper and easier finetuning of existing modules. At the moment, two approaches are offered, through replacing `nn.Linear` outright, or parameterizing a `nn.Liner`. The latter is used by default(?). ## `models/base.py` This script implements the core underlying model for VALL-E. This handle: * storing its settings and features, and initializing the right modules * processing inputs into a proper input string * orchestrates running text and audio through the respective embeddings * generating the right padding, masking, and position IDs to feed the underlying arch (if requested) * removes padding from the logits * handles performing loss calculation, both as a whole or in individual pieces, both autoregressively and non-autoregressively * handles sampling through the logits through samplers provided through `./vall_e/samplers.py`, both autoregressively and non-autoregressively. This script aims to implement everything as required per VALL-E agnostically, to allow for different implementations to contain little extra code. ### Tasks The base model handles processing inputs into token sequences, per the requested task assigned to each input in a batch. Most sequences follow a `` sequence, but some tasks will receive the prompt as a list of tensors, instead. The length predictor `len` task will naively output the length in base-10 followed by a stop token. Speech-To-Text will follow a reverse sequence of `