`metrics.py`

This file provides helper functions for computing objective metrics, such as word-error rate (WER), character-error rate (CER), phoneme-error rate (PER), and speaker similarity (SIM-O).

WER / CER

Word-error rate (WER) is simply computed by transcribing the requested input, and comparing its transcription against the target transcription.

The transcription is cleaned up and normalized to account for inconsistencies between transcriptions with openai/whisper-large-v3 with the nuances of English.
Languages without spaces between words (Chinese, Japanese) should not rely on this, and instead rely on the CER.

Character-error rate (CER) does the same thing as WER, but on a character basis rather than a word basis.

Phoneme-error rate (PER) does the same thing as CER, but on the phonemized transcription instead. As this is a speech model, this metric is more correct than the prior metrics, but this isn't a universal metric for comparison, as most models don't report this.

All rates are un-normalized because I think that's the right way to go about it? Papers aren't clear that they do this, but the error rates are even more unusually low without this.

SIM-O

Speaker similarity (SIM-O) is computed by obtaining the embedding of each speaker (the output audio and the input prompt), and computing the cosine similarity between those two embeddings.

These embeddings are obtained through a finetune of WavLM-large geared towards speaker verification.

1.5 KiB Raw Blame History

metrics.py

WER / CER

SIM-O

1.5 KiB

Raw Blame History

`metrics.py`