Recommendations for generating latents and finetunes? #173
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#173
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I have a very large dataset, and I want to make sure I'm doing this right.
What's the recommendation on number of epochs when training for a dataset of 200 vs something like 1000 clips (assuming they're all cut down between 1 and 11 seconds and transcribed properly)?
Also, what's the recommendation on voice clips when generating latents? Should I use a small subset of those training data clips? Or can I use the whole set?
Check this out if you haven't
https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training
I have; those settings don't really give me a good point of reference.
It depends on what you're trying to train: the more your desired result differs from what the included autoregressive model produces the more iterations you'll need to get there. A standard native-speaker English accent won't take too long at all, an ESL speaker's accent will take longer, and an accurate model of a foreign language might not be possible without writing a custom tokenizer.
I would recommend using a subset, if only to reduce the time required to calculate the latents.