Recommendations for generating latents and finetunes? #173

New Issue

hman360 · 2023-03-25T08:31:28Z

hman360 commented

2023-03-25 08:31:28 +00:00

I have a very large dataset, and I want to make sure I'm doing this right.

What's the recommendation on number of epochs when training for a dataset of 200 vs something like 1000 clips (assuming they're all cut down between 1 and 11 seconds and transcribed properly)?

Also, what's the recommendation on voice clips when generating latents? Should I use a small subset of those training data clips? Or can I use the whole set?

I have a very large dataset, and I want to make sure I'm doing this right. What's the recommendation on number of epochs when training for a dataset of 200 vs something like 1000 clips (assuming they're all cut down between 1 and 11 seconds and transcribed properly)? Also, what's the recommendation on voice clips when generating latents? Should I use a small subset of those training data clips? Or can I use the whole set?

SyntheticVoices commented

2023-03-25 16:10:42 +00:00

Check this out if you haven't
https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training

Check this out if you haven't https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training

hman360 commented

2023-03-25 18:55:02 +00:00

Check this out if you haven't
https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training

I have; those settings don't really give me a good point of reference.

> Check this out if you haven't > https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training I have; those settings don't really give me a good point of reference.

psammites commented

2023-03-25 20:22:23 +00:00

What's the recommendation on number of epochs when training for a dataset of 200 vs something like 1000 clips (assuming they're all cut down between 1 and 11 seconds and transcribed properly)?

It depends on what you're trying to train: the more your desired result differs from what the included autoregressive model produces the more iterations you'll need to get there. A standard native-speaker English accent won't take too long at all, an ESL speaker's accent will take longer, and an accurate model of a foreign language might not be possible without writing a custom tokenizer.

Also, what's the recommendation on voice clips when generating latents? Should I use a small subset of those training data clips? Or can I use the whole set?

I would recommend using a subset, if only to reduce the time required to calculate the latents.

> What's the recommendation on number of epochs when training for a dataset of 200 vs something like 1000 clips (assuming they're all cut down between 1 and 11 seconds and transcribed properly)? It depends on what you're trying to train: the more your desired result differs from what the included autoregressive model produces the more iterations you'll need to get there. A standard native-speaker English accent won't take too long at all, an ESL speaker's accent will take longer, and an accurate model of a foreign language might not be possible without writing a custom tokenizer. > Also, what's the recommendation on voice clips when generating latents? Should I use a small subset of those training data clips? Or can I use the whole set? I would recommend using a subset, if only to reduce the time required to calculate the latents.

Sign in to join this conversation.