XTTS-2 released #449

New Issue

drakononov · 2023-11-13T16:48:06Z

drakononov commented

2023-11-13 16:48:06 +00:00

Hey! I saw your discussions about XTTS-v1
It felt like coqui's implementation was like a little downgrade comparing to your work

so they made work over the bugs)00
https://huggingface.co/coqui/XTTS-v2

Let's analyze their changelog:
Features

Supports 16 languages. (no questions there)
Voice cloning with just a 6-second audio clip. (Model's cloning compatibility is really strong)
Emotion and style transfer by cloning. (Sound interesting, but I don't think it's something magic)
Cross-language voice cloning. (no questions there)
Multi-lingual speech generation. (no questions there)
24khz sampling rate. (Huh! Someone told them about BigVGAN )

Updates over XTTS-v1

2 new languages; Hungarian and Korean (easy, ok)
Architectural improvements for speaker conditioning. (Hmm....)
Enables the use of multiple speaker references and interpolation between speakers. (Nothing new)
Stability improvements. (Hmmmm.....)
Better prosody and audio quality across the board. (Huh...?)

So...
Somehow they made their model cloning abilities better - how?
Larger variety of speakers in dataset? Or they've done something special with conditioning?

Stability
Generation parameters optimisation?

Architectural improvements for speaker conditioning?
Don't even have an idea on what they have done there

Hey! I saw your discussions about XTTS-v1 It felt like coqui's implementation was like a little downgrade comparing to your work so they made work over the bugs)00 https://huggingface.co/coqui/XTTS-v2 Let's analyze their changelog: Features - Supports 16 languages. (no questions there) - Voice cloning with just a 6-second audio clip. (Model's cloning compatibility is really strong) - Emotion and style transfer by cloning. (Sound interesting, but I don't think it's something magic) - Cross-language voice cloning. (no questions there) - Multi-lingual speech generation. (no questions there) - 24khz sampling rate. (Huh! Someone told them about BigVGAN ) Updates over XTTS-v1 - 2 new languages; Hungarian and Korean (easy, ok) - Architectural improvements for speaker conditioning. (Hmm....) - Enables the use of multiple speaker references and interpolation between speakers. (Nothing new) - Stability improvements. (Hmmmm.....) - Better prosody and audio quality across the board. (Huh...?) So... Somehow they made their model cloning abilities better - how? Larger variety of speakers in dataset? Or they've done something special with conditioning? Stability Generation parameters optimisation? Architectural improvements for speaker conditioning? Don't even have an idea on what they have done there

Sign in to join this conversation.