vall-e/docs/metrics.md

1.5 KiB

metrics.py

This file provides helper functions for computing objective metrics, such as word-error rate (WER), character-error rate (CER), phoneme-error rate (PER), and speaker similarity (SIM-O).

WER / CER

Word-error rate (WER) is simply computed by transcribing the requested input, and comparing its transcription against the target transcription.

  • The transcription is cleaned up and normalized to account for inconsistencies between transcriptions with openai/whisper-large-v3 with the nuances of English.
  • Languages without spaces between words (Chinese, Japanese) should not rely on this, and instead rely on the CER.

Character-error rate (CER) does the same thing as WER, but on a character basis rather than a word basis.

Phoneme-error rate (PER) does the same thing as CER, but on the phonemized transcription instead. As this is a speech model, this metric is more correct than the prior metrics, but this isn't a universal metric for comparison, as most models don't report this.

All rates are un-normalized because I think that's the right way to go about it? Papers aren't clear that they do this, but the error rates are even more unusually low without this.

SIM-O

Speaker similarity (SIM-O) is computed by obtaining the embedding of each speaker (the output audio and the input prompt), and computing the cosine similarity between those two embeddings.

These embeddings are obtained through a finetune of WavLM-large geared towards speaker verification.