forked from ecker/tortoise-tts
After training a similar model for a different purpose, I realized that this model is faulty: the contrastive loss it uses only pays attention to high-frequency details which do not contribute meaningfully to output quality. I validated this by comparing a no-CVVP output with a baseline using tts-scores and found no differences. |
||
|---|---|---|
| .. | ||
| data | ||
| models | ||
| utils | ||
| __init__.py | ||
| api.py | ||
| do_tts.py | ||
| eval.py | ||
| get_conditioning_latents.py | ||
| is_this_from_tortoise.py | ||
| read.py | ||