Amazing Suno-ai generates laugh,sigh and other expressions #213

Open
opened 2023-04-21 09:29:01 +07:00 by pheonis · 2 comments

Did you checkout the recently launched https://github.com/suno-ai/bark

it’s generating laughs and other also emotions!

Here are some examples: https://suno-ai.notion.site/Bark-Examples-5edae8b02a604b54a42244ba45ebc2e2

It would be great if we can implement something like this in here.

Did you checkout the recently launched https://github.com/suno-ai/bark it’s generating laughs and other also emotions! Here are some examples: https://suno-ai.notion.site/Bark-Examples-5edae8b02a604b54a42244ba45ebc2e2 It would be great if we can implement something like this in here.

The demos are pretty nice:

  • it does that thing I remember TorToiSe doing where it'll have non-verbal utterances like slight lip smacks and breaths.
  • it boasts having a model already pretrained, so no annoying pains getting one together.
  • the multilingual-ness of VALL-E X is also apparent (I haven't listened to yet it, though).
  • the other features it totes are also neat and might come up eventually.
  • the samples seem pretty simple to get something generated, but I imagine much of the bulk for TorToiSe are the knobs (especially since inferencing with VALL-E is very streamlined because of a lack of knobs).

However:

  • I feel the actual quality of the waveform is a little lacking, but I'm probably extremely biased from all the bandaids of TorToiSe / listening to a bunch of VALL-E output (it has its own slew of quirks). I imagine it's just a matter of the sample rates (COPIUM).
  • muh ethics, as it only says it can only use the provided synthetic voices to clone from, but I'm sure there'll be ways to hack it to allow custom voices.
  • perusing through the issues, it seems it needs more than 8GiB of VRAM. I would say I could work my magic on getting it on smaller cards, but that wasn't for TorToiSe inference, just training.

The last two lines is the deal breaker, as cloning is the main intent of AIVC. I suppose I could integrate it in and wait around until someone cobbles together a way to use user-provided voices. Actually, this fork seems to be much more capable. I'll peruse through that later and see if I have any qualms with it. If I do slot it in, I'd use that fork.

The demos are pretty nice: * it does that thing I remember TorToiSe doing where it'll have non-verbal utterances like slight lip smacks and breaths. * it boasts having a model already pretrained, so no annoying pains getting one together. * the multilingual-ness of VALL-E X is also apparent (I haven't listened to yet it, though). * the other features it totes are also neat and might come up eventually. * the samples seem pretty simple to get something generated, but I imagine much of the bulk for TorToiSe are the knobs (especially since inferencing with VALL-E is very streamlined because of a lack of knobs). However: * I feel the actual quality of the waveform is a little lacking, but I'm probably extremely biased from all the bandaids of TorToiSe / listening to a bunch of VALL-E output (it has its own slew of quirks). I imagine it's just a matter of the sample rates (COPIUM). * muh ethics, as it only says it can only use the provided synthetic voices to clone from, but I'm sure there'll be ways to hack it to allow custom voices. * perusing through the issues, it seems it needs more than 8GiB of VRAM. I would say I could work my magic on getting it on smaller cards, but that wasn't for TorToiSe inference, just training. ~~The last two lines is the deal breaker, as cloning is the main intent of AIVC. I *suppose* I could integrate it in and wait around until someone cobbles together a way to use user-provided voices~~. Actually, [this fork](https://github.com/JonathanFly/bark) seems to be much more capable. I'll peruse through that later and see if I have any qualms with it. If I do slot it in, I'd use that fork.

Tortoise can laugh and emote if you're able to force it. Try: Lower the sample count, raise iterations, enable condition free tag text with [laughing.] HA-HA-HA-HA-HA-HA-HA-HA-HA-HA-HA-HA, [laughing] HAHAHAHAHAHAHAHAHA etc.

Here's a sample of the voice: https://vocaroo.com/1nPeQwvNP4T9
Sample 2: https://vocaroo.com/16WoYldCQR9a

Laugh: https://vocaroo.com/18ktViQms21z
Laugh 2: https://vocaroo.com/1nxNPrqh38Zk
Laugh 3: https://vocaroo.com/1bDqP3SzuTos

Tortoise can laugh and emote if you're able to force it. Try: Lower the sample count, raise iterations, enable condition free tag text with [laughing.] HA-HA-HA-HA-HA-HA-HA-HA-HA-HA-HA-HA, [laughing] HAHAHAHAHAHAHAHAHA etc. Here's a sample of the voice: https://vocaroo.com/1nPeQwvNP4T9 Sample 2: https://vocaroo.com/16WoYldCQR9a Laugh: https://vocaroo.com/18ktViQms21z Laugh 2: https://vocaroo.com/1nxNPrqh38Zk Laugh 3: https://vocaroo.com/1bDqP3SzuTos
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#213
There is no content yet.