vall-e/docs/emb.md
2024-11-05 16:11:01 -06:00

3.3 KiB

emb/*

This folder contains scripts to handle the text and audio data that goes in and out of the model, as well as preparing data for the dataset.

The emb name is a relic of the original implementation used.

g2p.py

This script handles taking text of a given language, and phonemizing into IPAs.

  • This is mainly an abstraction to phonemizer.

For Japanese, text is coerced through pykakasi into kana, then phonemized, as phonemizer does not like kanji.

By default, espeak is used as the backend, but other backends can be passed through encode.

By default, punctuation, stress markers, and stripping are enabled by default, but can be disabled.

To avoid memory leaking through phonemizer, backends and instances are cached for further reuse.

qnt.py

This script handles taking audio waveforms and encoding it as code tokens to run through the model, and code tokens outputted from the model and decoding it into raw waveforms.

  • This is mainly an abstraction to the underlying quantized audio models.

Additionally, audio manipulation helper functions like trim and concat are available.

The audio backend is dependent on the model used, but by default encodec is the default backend with a sample rate of 24khz.

  • if requested, vocos is used as the decoding model, but EnCodec is still used to encode audio.

Audio does not need to be resampled and downmixed, as it should already be handled when being fed to the encode functions.

Audio Backends

For audio backends:

  • encodec: a tried-and-tested EnCodec to encode/decode audio.
  • vocos: a higher quality EnCodec decoder.
    • encoding audio will use the encodec backend automagically, as there's no EnCodec encoder under vocos
  • descript-audio-codec: boasts better compression and quality, but has issues with model convergence.
    • models at 24KHz + 8kbps will NOT converge in any manner.
    • models at 44KHz + 8kbps seems harder to model its "language", and the NAR side of the model suffers greatly.

transcribe.py

This script handles taking raw input audio, and outputting adequate metadata containing transcriptions of said audio through whisperX.

The process maintains slices whisperX thinks its best per the segments outputted.

Refer to the __main__'s arguments for usage details.

process.py

This script handles taking raw input audio and its transcribed metadata, and outputs encoded audio (NumPy) files containing encoded audio and associated metadata.

This process can utilize sliced segments within the transcription metadata, or use the entire file's audio instead for a given utterance.

Refer to the __main__'s arguments for usage details.

similar.py

This script handles taking either raw input audio, or processed encoded audio, and determines the top-K similar utterances for each sample for a given speaker (or dataset).

When processing a dataset, this requires already having accompanying metadata generated through vall_e.data --action=metadata --yaml=./your/training/config.yaml.

Refer to the __main__'s arguments for usage details.