XTTS-2 released #449

Open
opened 2023-11-13 16:48:06 +00:00 by drakononov · 0 comments

Hey! I saw your discussions about XTTS-v1
It felt like coqui's implementation was like a little downgrade comparing to your work

so they made work over the bugs)00
https://huggingface.co/coqui/XTTS-v2

Let's analyze their changelog:
Features

  • Supports 16 languages. (no questions there)
  • Voice cloning with just a 6-second audio clip. (Model's cloning compatibility is really strong)
  • Emotion and style transfer by cloning. (Sound interesting, but I don't think it's something magic)
  • Cross-language voice cloning. (no questions there)
  • Multi-lingual speech generation. (no questions there)
  • 24khz sampling rate. (Huh! Someone told them about BigVGAN )

Updates over XTTS-v1

  • 2 new languages; Hungarian and Korean (easy, ok)
  • Architectural improvements for speaker conditioning. (Hmm....)
  • Enables the use of multiple speaker references and interpolation between speakers. (Nothing new)
  • Stability improvements. (Hmmmm.....)
  • Better prosody and audio quality across the board. (Huh...?)

So...
Somehow they made their model cloning abilities better - how?
Larger variety of speakers in dataset? Or they've done something special with conditioning?

Stability
Generation parameters optimisation?

Architectural improvements for speaker conditioning?
Don't even have an idea on what they have done there

Hey! I saw your discussions about XTTS-v1 It felt like coqui's implementation was like a little downgrade comparing to your work so they made work over the bugs)00 https://huggingface.co/coqui/XTTS-v2 Let's analyze their changelog: Features - Supports 16 languages. (no questions there) - Voice cloning with just a 6-second audio clip. (Model's cloning compatibility is really strong) - Emotion and style transfer by cloning. (Sound interesting, but I don't think it's something magic) - Cross-language voice cloning. (no questions there) - Multi-lingual speech generation. (no questions there) - 24khz sampling rate. (Huh! Someone told them about BigVGAN ) Updates over XTTS-v1 - 2 new languages; Hungarian and Korean (easy, ok) - Architectural improvements for speaker conditioning. (Hmm....) - Enables the use of multiple speaker references and interpolation between speakers. (Nothing new) - Stability improvements. (Hmmmm.....) - Better prosody and audio quality across the board. (Huh...?) So... Somehow they made their model cloning abilities better - how? Larger variety of speakers in dataset? Or they've done something special with conditioning? Stability Generation parameters optimisation? Architectural improvements for speaker conditioning? Don't even have an idea on what they have done there
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#449
No description provided.