15 KiB
data.py
This script handles the meat of preparing the data to feed the model through the dataloader, and unfortunately makes up for quite a lot of this project's complexity.
Most of these settings live under cfg.dataset
.
Dataset
The provided reference model was trained on ?
k hours of audio with a mix of:
- LibriTTS-R's entire dataset
small
+medium
+duplicate
portions of LibriVox- Emilia's German, French, and Japanese dataset
- 12K hours of a privately sourced corpus of 425 audiobooks
- a small portion of Emilia's English dataset
- a personal small corpus of transcribed utterances from a selection of video games
Leverage Your Own Dataset
If you already have a dataset you want, for example, your own large corpus or for finetuning, you can use your own dataset instead.
- Set up a
venv
withhttps://github.com/m-bain/whisperX/
.
- At the moment only WhisperX is utilized. Using other variants like
faster-whisper
is an exercise left to the user at the moment. - It's recommended to use a dedicated virtualenv specifically for transcribing, as WhisperX will break a few dependencies.
- The following command should work:
python3 -m venv venv-whisper
source ./venv-whisper/bin/activate
pip3 install torch torchvision torchaudio
pip3 install git+https://github.com/m-bain/whisperX/
-
Populate your source voices under
./voices/{group name}/{speaker name}/
. -
Run
python3 -m vall_e.emb.transcribe
. This will generate a transcription with timestamps for your dataset.
- If you're interested in using a different model, edit the script's
model_name
andbatch_size
variables.
- Run
python3 -m vall_e.emb.process
. This will phonemize the transcriptions and quantize the audio.
- If you're using a Descript-Audio-Codec based model, ensure to set the sample rate and audio backend accordingly.
- Run
python3 -m vall_e.emb.similar
. This will calculate the top-k most similar utterances for each utterance for use with sampling.
- Doing this will help the model follow the input prompt stronger, at the possible "cost" of the model not learning how to "infer" the target speaker AND prosidy.
- Copy
./data/config.yaml
to./training/config.yaml
. Customize the training configuration and populate yourdataset.training
list with the values stored under./training/dataset/list.json
.
- Refer to
./vall_e/config.py
for additional configuration details.
Dataset Formats
Two dataset formats are supported:
- the standard way:
- data is stored under
./training/data/{group}/{speaker}/{id}.{enc|dac}
as a NumPy file, whereenc
is for the EnCodec/Vocos backend, anddac
for the Descript-Audio-Codec backend. - it is highly recommended to generate metadata to speed up dataset pre-load with
python3 -m vall_e.data --yaml="./training/config.yaml" --action=metadata
- data is stored under
- using an HDF5 dataset:
- you can convert from the standard way with the following command:
python3 -m vall_e.data --yaml="./training/config.yaml"
(metadata for dataset pre-load is generated alongside HDF5 creation) - this will shove everything into a single HDF5 file and store some metadata alongside (for now, the symbol map generated, and text/audio lengths)
- be sure to also define
use_hdf5
in your config YAML.
- you can convert from the standard way with the following command:
Dataloader
The dataloader handles some simple yet effective features, such as:
- culling samples within a requested duration range
- grouping samples based on:
- speakers (to keep utterances for a given speaker) and groups (to keep similar speakers within a group as defined in the dataset)
- durations, to keep VRAM usage and throughput consistent, if requested (as training requires keeping all samples of a batch the same token length)
- further partitioning samples per GPU
- shuffling then interleaving, per the dataloader sampler settings
- saving/loading sampler states to disk
- preparing a sample in a batch with adequate data for a given task, such as:
- picking an input prompt similar to the sampled output audio, if requested
- picking an input prompt from the same speaker as the sample, if the above is not requested
- preparing the input sequence for the given task (such as non-TTS tasks)
If cfg.dataset.cache == True
, the initial list of paths and duration metadata (used for sorting/bucketing) is cached through under diskcache
{YAML_PATH}/.cache/{DATASET_HASH}/
. To allow for seamless modifications to the loaded dataset, the DATASET_HASH
relies on:
- duration range
- folders/groups in the dataset
- if using HDF5 (due to the key format differing)
Be sure to delete the resultant .cache
folder, as well as the sampler.*
state dicts alongside checkpoints, if you plan to modify the dataloader settings between training sessions.
Tasks
As this handles preparing the data fed into the model for training, this script needs to be aware of what tasks it should attend to, as mostly outlined under SpeechX.
This section may be covered elsewhere in the documentation, but coverage here should focus on the specifics of attending to the task, rather than what the task is.
tts
: basic and naive text-to-speech.- requires a text transcription, input audio prompt, and the output audio response.
tts-c
: also noted as "VALL-E Continuous"- this is what most other TTS solutions abide by (those that require a transcription of the input prompt)
- this should be more accurate as it has the output adhere stronger to the input through guidance, but doesn't seem to be necessary (to train for or inference under).
- naively, this requires just the text transcription and output audio response, where part of the output audio response is trimmed to serve as the input audio prompt.
- non-naively, this requires two text transcriptions, and two output audio responses (where one of them serve as the input audio prompt).
stt
: basic and naive speech-to-text.- requires an input audio prompt and the output text transcription (as phonemes, unfortunately).
ns
: noise suppression.- requires just a text transcription and an output audio response, where the input audio prompt is just the output + noise
- text transcription can optionally be removed to allow for training without text guidance.
sr
: speech removal.- requires just a text transcription and an output audio response, where the input audio prompt is just the sampled utterance + noise, and the output is just the original noise.
- text transcription can optionally be removed to allow for training without text guidance.
tse
: target speech extraction.- requires a text transcription, an input audio prompt of the sampled speaker, utterance sampled from a different speaker, and the output audio response.
- the input prompt is appended with both the output audio and the utterance sampled from a different speaker overlayed on one another.
cse
: clean speech editing.- an ideal world would have phoneme-level transcriptions, but I do not have very-accurate phoneme-level transcriptions.
- to make up for this, this requires multiple samples for the prefix, the original middle, the edited portion for the middle, and the postfix sample.
- the prefix and postfix can be randomly omitted, but keeping them in ensures better editing of speech within the middle.
- requires four full samples.
nse
: noisy speech editing.- the above, but injects some noise throughout the sampled utterances.
A mystical vc
for performing voice conversion is possible, but either requires a dataset to do so, or abusing an emergent property.
__main__
This script can be called directly to perform dataloader-related tasks.
--action=metadata
Invoking this will take processed samples (.enc
for EnCodec, .dac
for Descript-Audio-Codec) from {YAML_PATH}/data/
, as per the YAML's cfg.dataset.{training|validation|noise}
lists, and store helpful metadata under {YAML_PATH}/metadata/
, to speed up dataloader preparations. Since dataloader preparations can cull based on audio durations, being able to look up a sample's duration speeds things up without needing to load the sample and read the file's metadata.
This metadata can be then used to store similar speaker indices.
--action=hdf5
Invoking this will take processed samples (.enc
for EnCodec, .dac
for Descript-Audio-Codec) from {YAML_PATH}/data/
, as per the YAML's cfg.dataset.{training|validation|noise}
lists, and store them within a single .h5
HDF5 file.
Additionally, this implicitly invokes --action=metadata
, to create additional JSON metadata under {YAML_PATH}/metadata/
, to speed up dataloader preparations.
--action=sample
Invoking this will load the dataloader, sample it, and print out the batch's contents.
This serves primarily for debugging purposes during development, and should not be necessary for the end user.
--action=validate
Invoking this will process the dataset to check for any phonemes missing from the tokenizer (as defined under cfg.tokenizer
).
Any missing phonemes will be printed through logger
to make mending the tokenizer dict easier.
This serves primarily for debugging purposes during development, and should not be necessary for the end user. However, additional languages may emit additional IPAs through phonemizer
, so those training additional languages should take care to validate for missing phonemes before training, to avoid headaches.
cfg.dataset
This entry in the config YAML handles knobs and features related to the dataloader. This is defined as Dataset
in ./vall_e/config.py
training
: list of entries to populate the training dataset with. Wildcards are accepted, such asLibriVox/*
to easily load a speaker within a group, without needing to define them individually.validation
: the above, but for the validation dataset.noise
: the above, but for any noise that may be sampled during dataloader sampling. Text is not required for this dataset.speaker_name_getter
: a lambda function to evaluate, to retrieve the speaker name from a given path string.speaker_group_getter
: a lambda function to evaluate, to retrieve the speaker's associated group from a given path string.speaker_languages
: Deprecated. This is a dict that maps language codes to a list of speaker groups, for when the language code was not stored alongside a sample's data.use_hdf5
: use{YAML_PATH}/{cfg.dataset.hdf5_name}
to sample data from, rather than individual files on disk.hdf5_name
: filename (or path?) to the HDF5 dataset file to load, if the above is requested.hdf5_flag
: flag to open the above HDF5 file under. By default this isa
to write to, as it's necessary for HDF5 creation, but will automatically set tor
under distributed settings.use_metadata
: references generated metadata instead of loading samples individually to acquire metadata.validate
: cull samples that do not fall within the requestedcfg.dataset.duration_range
.workers
: number of worker processes to handle dataloading under PyTorch.cache
: use diskcache when requested to not require subsequent processing. This handles alldiskcache
requests throughout the program if requested, but should only really be used under this script.min_utterances
: number of utterances to treat a speaker as valid.duration_range
: a list of two values to denote the acceptable duration ranges a sample is valid for the dataloader.sample_type
: type of sampler to use. Currently acceptspath
(an epoch is all paths in the dataset, and each index maps to each sample) orspeaker
(an epoch is all speakers in the dataset, and each index maps to each speaker)sample_order
: order to keep the dataloader sample. Currently acceptsinterleaved
(tries to balance per speaker) andduration
(orders by duration to keep throughput and VRAM usage consistent).sample_shuffle
: shuffles the dataloader sampler.sample_max_duration_batch
: the maximum total duration a batch can be. Values > 0 will enable batch sampling, where the dataloader sampler returns batches of batches.- This only works under
sample_order=duration
andsample_type=path
, and should raise an exception for any other configuration.
- This only works under
prompt_duration_range
: a list of two values to denote the range a sample's input prompt should be. This keeps the model trained for input prompt durations within these, and a little extra sometimes works without training for it.prompt_max_samples
: maximum number of utterances to sample for an input prompt to combine, if needed to fill the above duration window.prompt_continuous_utterance_p
: probability for a sample's input prompt to instead be the output prompt, and prepare the sample under "continuous" mode.prompt_similar_p
: probability to use a sample's most similar utterance as the input prompt, rather than randomly picking another utterance of the same speaker.- This requires adequate metadata to be available to store the top-K similar indices.
prompt_similar_top_k
: use the top-k candidates for the above sampling.prompt_similar_top_k_offset
: the above, but an offset (as in it will not use the top-K-offset most similar utterances).prompt_inject_noise
: inject some noise in a sample's input prompt. Will harm dataloader throughput, as it requires re-encoding the audio.resps_max_samples
: maximum utterances to use for the sample's input text and output response audio.resps_append_p
: probability to append additional utterances to the sample.resps_pad_silence_p
: probability to pad the output response audio with silence. Does not require re-encoding, unless requested throughreencode_on_concat
.tasks_list
: list of task names a sample can be.- Currently supports:
tts
,stt
,tts-c
,ns
,sr
,tse
,nse
,cse
- Currently supports:
reencode_on_concat
: if enabled, audio will be decoded to a raw waveform, concatted, then reencoded, instead of naively concatting EnCodec codes.- This isn't necessary naively concatting offers trivial inaccuracies.
reencode_device
: device to load EnCodec within the dataloader.- technically only
cpu
should be supported, as loading models in dataloaders causes problems?
- technically only
noise_scale
: multiplier to the noise when applying noise. Lower numbers keep it quieter.retokenize_text
: if the text/phoneme transcription is available in the metadata, use that to re-tokenize instead of relying on the stored tokens itself.- This is helpful if you modify the tokenizer dict in post, but do not want to re-process the dataset to modify the tokenized phonemes.
_frames_per_second
: overrides the internal tokens-per-second-of-audio ratio. Should never require modifying.