American imposter #361

Open
opened 2023-08-31 17:18:28 +00:00 by MrMustachio43 · 5 comments

My data set containing 300+ audio files is of a British dude. There is ZERO American in my dataset. It feels like im playing board game, rolling dice to see if i get the trained voice or this American dude. When it does work, it sounds great though. I feel like the seed is the most controlling factor, is there any way i can make this more predictable?

for reference, I've attached my settings.
thank you :)

My data set containing 300+ audio files is of a British dude. There is ZERO American in my dataset. It feels like im playing board game, rolling dice to see if i get the trained voice or this American dude. When it does work, it sounds great though. I feel like the seed is the most controlling factor, is there any way i can make this more predictable? for reference, I've attached my settings. thank you :)
Owner

I would increase the temperature as 0.2 is a bit low for TorToiSe. I imagine that's the case, because I remember the base model will erase any non-American accents.

I would increase the temperature as 0.2 is a bit low for TorToiSe. I imagine that's the case, because I remember the base model will erase any non-American accents.

I would increase the temperature as 0.2 is a bit low for TorToiSe. I imagine that's the case, because I remember the base model will erase any non-American accents.

seconded... >=0.7 will be better for approximating your speaker

besides that, it really depends on the quality of your training data, it is so crucial... the less consistent your dataset, the more variable your output is going to sound

> I would increase the temperature as 0.2 is a bit low for TorToiSe. I imagine that's the case, because I remember the base model will erase any non-American accents. seconded... >=0.7 will be better for approximating your speaker besides that, it really depends on the quality of your training data, it is so crucial... the less consistent your dataset, the more variable your output is going to sound

Aside from reading thru other issues, does anyone have a guide to fine-tuning? One poster on here generated very good english with a Korean accent. It seemed like he had a low number of samples. There was another issue that seemed to allude to a fine-tuning guide, but the information in that issue wasn't as helpful to me..

I am also trying to generate a english with a eastern european accent. Now that I read thru more maybe I didn't even change generation settings from the 1-shot tortoise model to the fine-tuned model. I need to play around with the chunks setting as well... seems like this has been hinted at. My epochs were also low... maybe I needed to back it a bit more.

It seems tortoise-tts is biased towards male voices a bit. But I did another 1 shot with a few samples for a female and it was able to get the voice accurate enough. Random generation seems to yield a male most of the time.

Having been introduced into this project from tortoise within the past couple of weeks, it seems mrq is focussimg on using the VALL-E backend and generating a base model based on VALL-E .

I am just a user and ecstatic to actually get training and generation running. The GUI front-end is great for just even playing around with the base tortoise..

Aside from reading thru other issues, does anyone have a guide to fine-tuning? One poster on here generated very good english with a Korean accent. It seemed like he had a low number of samples. There was another issue that seemed to allude to a fine-tuning guide, but the information in that issue wasn't as helpful to me.. I am also trying to generate a english with a eastern european accent. Now that I read thru more maybe I didn't even change generation settings from the 1-shot tortoise model to the fine-tuned model. I need to play around with the chunks setting as well... seems like this has been hinted at. My epochs were also low... maybe I needed to back it a bit more. It seems tortoise-tts is biased towards male voices a bit. But I did another 1 shot with a few samples for a female and it was able to get the voice accurate enough. Random generation seems to yield a male most of the time. Having been introduced into this project from tortoise within the past couple of weeks, it seems mrq is focussimg on using the VALL-E backend and generating a base model based on VALL-E . I am just a user and ecstatic to actually get training and generation running. The GUI front-end is great for just even playing around with the base tortoise..

I set my voice chunks to 512 or 256 -- but I think the key is Temperature to 1 or very high -- as someone said about 0.75. I realized this when watching people use the tool via Youtube.

I actually have a fairly decent model of a female speaking eastern european accent. Is it 100%? Nope.. but is 75 to 80% very accurate? Yes. All based on about 5 minutes of speaking.

Also think that you need to click on (Re)compute voice latents after training and before generation?

I set my voice chunks to 512 or 256 -- but I think the key is Temperature to 1 or very high -- as someone said about 0.75. I realized this when watching people use the tool via Youtube. I actually have a fairly decent model of a female speaking eastern european accent. Is it 100%? Nope.. but is 75 to 80% very accurate? Yes. All based on about 5 minutes of speaking. Also think that you need to click on (Re)compute voice latents after training and before generation?

@MrMustachio43
What I'm going to say may seem intuitive, because I don't know that it's explicitly written up in any of the documentation. You need to go to settings and load the finetuned model. The finetuned model is not automatically loaded when you select a voice. You could still be using the default "one-shot" model. Maybe this is written somewhere in the forum comments.

You could still be using the untrained / one shot model. One shot provides about 40 to 50% fidelity.

Using fine-tuned, I am amazed at what 3 minutes of audio can add. Seems like it gets you to about 75% of a voice clone.

@MrMustachio43 What I'm going to say may seem intuitive, because I don't know that it's explicitly written up in any of the documentation. You need to go to settings and load the finetuned model. The finetuned model is not automatically loaded when you select a voice. You could still be using the default "one-shot" model. Maybe this is written somewhere in the forum comments. You could still be using the untrained / one shot model. One shot provides about 40 to 50% fidelity. Using fine-tuned, I am amazed at what 3 minutes of audio can add. Seems like it gets you to about 75% of a voice clone.
Sign in to join this conversation.
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#361
No description provided.