Recommendations for generating latents and finetunes?
#173
Open
opened
Loading…
Reference in New Issue
There is no content yet.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. It CANNOT be undone. Continue?
I have a very large dataset, and I want to make sure I'm doing this right.
What's the recommendation on number of epochs when training for a dataset of 200 vs something like 1000 clips (assuming they're all cut down between 1 and 11 seconds and transcribed properly)?
Also, what's the recommendation on voice clips when generating latents? Should I use a small subset of those training data clips? Or can I use the whole set?
Check this out if you haven't
https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training
I have; those settings don't really give me a good point of reference.
It depends on what you're trying to train: the more your desired result differs from what the included autoregressive model produces the more iterations you'll need to get there. A standard native-speaker English accent won't take too long at all, an ESL speaker's accent will take longer, and an accurate model of a foreign language might not be possible without writing a custom tokenizer.
I would recommend using a subset, if only to reduce the time required to calculate the latents.