need help, trying to train voice but just sounds like a generic male tts voice? #129

New Issue

amanamaru · 2023-03-13T08:58:21Z

amanamaru commented

2023-03-13 08:58:21 +00:00

attached are the voice files im using to train with as well as the results i'm having, it seems like it doesn't seem to get even close to matching a similar voice

amanamaru closed this issue

2023-03-13 09:05:10 +00:00

amanamaru reopened this issue

2023-03-13 09:05:24 +00:00

amanamaru commented

2023-03-13 09:07:17 +00:00

here are some of the files im using and the results
https://mega.nz/folder/NzhS0R5A#-uzK5nMySVFH1GetYDa-5A

here are some of the files im using and the results https://mega.nz/folder/NzhS0R5A#-uzK5nMySVFH1GetYDa-5A

Bluebomber182 commented

2023-03-17 02:44:29 +00:00

I have the same problem too. I created a Merida dataset with wav files that are more than 0.6 seconds and less than 11 seconds. I trained it up to 2160 steps. With 29 seconds worth of samples, the output files sound like a generic British accent instead of a Scottish accent. If I put in more samples, the output files will have an American accent.

Dataset with train.txt file
https://files.catbox.moe/u0esfz.zip

Model
https://pixeldrain.com/u/pKEfJPdV
Edit: I got it to work by following the suggested training settings linked below. The only thing I changed was setting epochs to 250 after looking at the /vsg/ AI Voice Synthesis General archives. I have 204 wav files that are 10 minutes and 40 seconds in total.
https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training#suggested-settings
https://desuarchive.org/g/thread/91867084/#q91878556
Edit 2: I made a new model with louder wav files, and the accent is American instead of Scottish for some reason. It's the same settings as the previous model. Then I made a new model with the learning rate set to 1e-3 and the sound quality of the output files were terrible. So I made another new model with the learning rate set to 1e-4, huge improvement.

I have the same problem too. I created a Merida dataset with wav files that are more than 0.6 seconds and less than 11 seconds. I trained it up to 2160 steps. With 29 seconds worth of samples, the output files sound like a generic British accent instead of a Scottish accent. If I put in more samples, the output files will have an American accent. Dataset with train.txt file https://files.catbox.moe/u0esfz.zip Model https://pixeldrain.com/u/pKEfJPdV Edit: I got it to work by following the suggested training settings linked below. The only thing I changed was setting epochs to 250 after looking at the /vsg/ AI Voice Synthesis General archives. I have 204 wav files that are 10 minutes and 40 seconds in total. https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training#suggested-settings https://desuarchive.org/g/thread/91867084/#q91878556 Edit 2: I made a new model with louder wav files, and the accent is American instead of Scottish for some reason. It's the same settings as the previous model. Then I made a new model with the learning rate set to 1e-3 and the sound quality of the output files were terrible. So I made another new model with the learning rate set to 1e-4, huge improvement.

Sign in to join this conversation.