need help, trying to train voice but just sounds like a generic male tts voice? #129

Open
opened 2023-03-13 08:58:21 +00:00 by amanamaru · 2 comments

attached are the voice files im using to train with as well as the results i'm having, it seems like it doesn't seem to get even close to matching a similar voice

attached are the voice files im using to train with as well as the results i'm having, it seems like it doesn't seem to get even close to matching a similar voice
Author

here are some of the files im using and the results
https://mega.nz/folder/NzhS0R5A#-uzK5nMySVFH1GetYDa-5A

here are some of the files im using and the results https://mega.nz/folder/NzhS0R5A#-uzK5nMySVFH1GetYDa-5A

I have the same problem too. I created a Merida dataset with wav files that are more than 0.6 seconds and less than 11 seconds. I trained it up to 2160 steps. With 29 seconds worth of samples, the output files sound like a generic British accent instead of a Scottish accent. If I put in more samples, the output files will have an American accent.

Dataset with train.txt file
https://files.catbox.moe/u0esfz.zip

Model
https://pixeldrain.com/u/pKEfJPdV
Edit: I got it to work by following the suggested training settings linked below. The only thing I changed was setting epochs to 250 after looking at the /vsg/ AI Voice Synthesis General archives. I have 204 wav files that are 10 minutes and 40 seconds in total.
https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training#suggested-settings
https://desuarchive.org/g/thread/91867084/#q91878556
Edit 2: I made a new model with louder wav files, and the accent is American instead of Scottish for some reason. It's the same settings as the previous model. Then I made a new model with the learning rate set to 1e-3 and the sound quality of the output files were terrible. So I made another new model with the learning rate set to 1e-4, huge improvement.

I have the same problem too. I created a Merida dataset with wav files that are more than 0.6 seconds and less than 11 seconds. I trained it up to 2160 steps. With 29 seconds worth of samples, the output files sound like a generic British accent instead of a Scottish accent. If I put in more samples, the output files will have an American accent. Dataset with train.txt file https://files.catbox.moe/u0esfz.zip Model https://pixeldrain.com/u/pKEfJPdV Edit: I got it to work by following the suggested training settings linked below. The only thing I changed was setting epochs to 250 after looking at the /vsg/ AI Voice Synthesis General archives. I have 204 wav files that are 10 minutes and 40 seconds in total. https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training#suggested-settings https://desuarchive.org/g/thread/91867084/#q91878556 Edit 2: I made a new model with louder wav files, and the accent is American instead of Scottish for some reason. It's the same settings as the previous model. Then I made a new model with the learning rate set to 1e-3 and the sound quality of the output files were terrible. So I made another new model with the learning rate set to 1e-4, huge improvement.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#129
No description provided.