vall-e/docs/metrics.md

# `metrics.py`

This file provides helper functions for computing objective metrics, such as word-error rate (WER), character-error rate (CER), phoneme-error rate (PER), and speaker similarity (SIM-O).

## WER / CER

Word-error rate (WER) is simply computed by transcribing the requested input, and comparing its transcription against the target transcription.
* The transcription is cleaned up and normalized to account for inconsistencies between transcriptions with `openai/whisper-large-v3` with the nuances of English.
* Languages without spaces between words (Chinese, Japanese) should not rely on this, and instead rely on the CER.

Character-error rate (CER) does the same thing as WER, but on a character basis rather than a word basis.

Phoneme-error rate (PER) does the same thing as CER, but on the phonemized transcription instead. As this is a speech model, this metric is more correct than the prior metrics, but this isn't a universal metric for comparison, as most models don't report this.

All rates are un-normalized because I think that's the right way to go about it? Papers aren't clear that they do this, but the error rates are even more unusually low without this.

## SIM-O

Speaker similarity (SIM-O) is computed by obtaining the embedding of each speaker (the output audio and the input prompt), and computing the cosine similarity between those two embeddings.

These embeddings are obtained through a finetune of WavLM-large geared towards speaker verification.
imagine my disappointment when the epoch finished just for it to throw an exception 2024-12-17 00:28:01 +00:00			# `metrics.py`

exposed additional task (ns, sr, vc) (vc is experimental) 2024-12-20 17:15:29 +00:00			`This file provides helper functions for computing objective metrics, such as word-error rate (WER), character-error rate (CER), phoneme-error rate (PER), and speaker similarity (SIM-O).`
imagine my disappointment when the epoch finished just for it to throw an exception 2024-12-17 00:28:01 +00:00
			`## WER / CER`

			`Word-error rate (WER) is simply computed by transcribing the requested input, and comparing its transcription against the target transcription.`
exposed additional task (ns, sr, vc) (vc is experimental) 2024-12-20 17:15:29 +00:00			* The transcription is cleaned up and normalized to account for inconsistencies between transcriptions with `openai/whisper-large-v3` with the nuances of English.
			`* Languages without spaces between words (Chinese, Japanese) should not rely on this, and instead rely on the CER.`
imagine my disappointment when the epoch finished just for it to throw an exception 2024-12-17 00:28:01 +00:00
exposed additional task (ns, sr, vc) (vc is experimental) 2024-12-20 17:15:29 +00:00			`Character-error rate (CER) does the same thing as WER, but on a character basis rather than a word basis.`
imagine my disappointment when the epoch finished just for it to throw an exception 2024-12-17 00:28:01 +00:00
exposed additional task (ns, sr, vc) (vc is experimental) 2024-12-20 17:15:29 +00:00			`Phoneme-error rate (PER) does the same thing as CER, but on the phonemized transcription instead. As this is a speech model, this metric is more correct than the prior metrics, but this isn't a universal metric for comparison, as most models don't report this.`
imagine my disappointment when the epoch finished just for it to throw an exception 2024-12-17 00:28:01 +00:00
exposed additional task (ns, sr, vc) (vc is experimental) 2024-12-20 17:15:29 +00:00			`All rates are un-normalized because I think that's the right way to go about it? Papers aren't clear that they do this, but the error rates are even more unusually low without this.`
imagine my disappointment when the epoch finished just for it to throw an exception 2024-12-17 00:28:01 +00:00
exposed additional task (ns, sr, vc) (vc is experimental) 2024-12-20 17:15:29 +00:00			`## SIM-O`

			`Speaker similarity (SIM-O) is computed by obtaining the embedding of each speaker (the output audio and the input prompt), and computing the cosine similarity between those two embeddings.`

			`These embeddings are obtained through a finetune of WavLM-large geared towards speaker verification.`