Epochs, iterations, and datasets #197
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#197
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I'm having a tough time wrapping my head around this process...
The epoch is one pass through the dataset, right?
Given a quality dataset,
Do more epochs during training equal a better clone...
Or is it more iterations per epoch?
Then once trained, the voices folder is used to create a "template" for how the entered text is performed?
So a happy example.wav is more likely to yield a happy performance...and a varied vocal tone for a varied vocal performance. Are multiple wavs required in the vocals folder, or just one good example? How long should these files be, or are they just pulled from the dataset?
Wish there was a discussion forum/discord/chatroom for folks to exchange experiences more easily.
AIUI more iterations per epoch just means a smaller batch size.
Just one.
When the latents are calculated it uses the every .wav in the folder for that voice.
That all makes sense. I have found more epochs leads to cleaner audio...but I'm still getting a smoothed out version of the voice when I generate. What I want is a thick accent like the dataset files...but what I get is either no accent or just a light accent.
Can you make your own autoregressive.pth model with an accent? Or train your dataset back on itself to refine the accent?
Oddly, sometimes when I look for no accent, I get a slight British accent, which I see is fairly common... However, I trained on a voice with a British accent and got no accent in the end.
I tried an Indian accent and it worked well. Confusing.
I'm training anywhere from 100 to 200 epochs, and have between 300 and 500 wavs in the dataset. Always a single speaker. I feel like I'm doing something wrong.
If your dataset has a thick accent you might need to check the transcriptions to make sure that they're accurate.
Will check again, but at first glance, they looked good.
Transcriptions are accurate. Having the same problem generating a British accent now, which seems weird.
You could try restarting with a higher learning rate for a lower number of iterations and see if it makes a difference.