Recommendations for generating latents and finetunes? #173

Open
opened 2023-03-25 08:31:28 +07:00 by hman360 · 3 comments

I have a very large dataset, and I want to make sure I'm doing this right.

What's the recommendation on number of epochs when training for a dataset of 200 vs something like 1000 clips (assuming they're all cut down between 1 and 11 seconds and transcribed properly)?

Also, what's the recommendation on voice clips when generating latents? Should I use a small subset of those training data clips? Or can I use the whole set?

I have a very large dataset, and I want to make sure I'm doing this right. What's the recommendation on number of epochs when training for a dataset of 200 vs something like 1000 clips (assuming they're all cut down between 1 and 11 seconds and transcribed properly)? Also, what's the recommendation on voice clips when generating latents? Should I use a small subset of those training data clips? Or can I use the whole set?
Check this out if you haven't https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training

Check this out if you haven't
https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training

I have; those settings don't really give me a good point of reference.

> Check this out if you haven't > https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training I have; those settings don't really give me a good point of reference.

What's the recommendation on number of epochs when training for a dataset of 200 vs something like 1000 clips (assuming they're all cut down between 1 and 11 seconds and transcribed properly)?

It depends on what you're trying to train: the more your desired result differs from what the included autoregressive model produces the more iterations you'll need to get there. A standard native-speaker English accent won't take too long at all, an ESL speaker's accent will take longer, and an accurate model of a foreign language might not be possible without writing a custom tokenizer.

Also, what's the recommendation on voice clips when generating latents? Should I use a small subset of those training data clips? Or can I use the whole set?

I would recommend using a subset, if only to reduce the time required to calculate the latents.

> What's the recommendation on number of epochs when training for a dataset of 200 vs something like 1000 clips (assuming they're all cut down between 1 and 11 seconds and transcribed properly)? It depends on what you're trying to train: the more your desired result differs from what the included autoregressive model produces the more iterations you'll need to get there. A standard native-speaker English accent won't take too long at all, an ESL speaker's accent will take longer, and an accurate model of a foreign language might not be possible without writing a custom tokenizer. > Also, what's the recommendation on voice clips when generating latents? Should I use a small subset of those training data clips? Or can I use the whole set? I would recommend using a subset, if only to reduce the time required to calculate the latents.
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#173
There is no content yet.