Audio artifacts/repetitive words after training #296

New Issue

JoaoPimenta · 2023-07-06T11:25:30Z

JoaoPimenta commented

2023-07-06 11:25:30 +00:00

Hey, I'm training a voice and under training/MyVoice/audio I have about 936 files, and the length varies quite a lot. A lot of them are only a second or two long, many go to around 7 seconds, and few go to around 20 seconds.

I'm adding the graphs of my training below

The output of the audio is not ideal. To get stable audio, I'm reducing all the randomness as much as possible, I'm increasing the penalties for length and repetition, and still, I get audio artefacts. For example, if I type "How was your day?", the audio is like "How was your day? How was your day? How waaaaass ahhhhh"

Any idea of what I'm doing wrong? Is it my audio data? Am I training for too long?

Hey, I'm training a voice and under training/MyVoice/audio I have about 936 files, and the length varies quite a lot. A lot of them are only a second or two long, many go to around 7 seconds, and few go to around 20 seconds. I'm adding the graphs of my training below The output of the audio is not ideal. To get stable audio, I'm reducing all the randomness as much as possible, I'm increasing the penalties for length and repetition, and still, I get audio artefacts. For example, if I type "How was your day?", the audio is like "How was your day? How was your day? How waaaaass ahhhhh" Any idea of what I'm doing wrong? Is it my audio data? Am I training for too long?

visualization.svg

464 KiB

visualization (1).svg

226 KiB

Captura de ecrã 2023-07-06 122202.png

84 KiB

JoaoPimenta changed title from ~~Audio artifacts/repetitve words after trainning~~ to Audio artifacts/repetitive words after training

2023-07-06 11:28:38 +00:00

jJcRucZGXMRZRNdCXCzm commented

2023-07-09 09:06:33 +00:00

Take a look at this, might be helpful for diagnosing your issue #82 (comment)

the text loss quantifies how well the predicted text tokens match the source text. This doesn't necessarily need to have too low of a loss. In fact, trainings that have it lower than the mel loss turns out unusuable.

Take a look at this, might be helpful for diagnosing your issue [#82 (comment)](https://git.ecker.tech/mrq/ai-voice-cloning/issues/82#issuecomment-772) > the text loss quantifies how well the predicted text tokens match the source text. This doesn't necessarily need to have too low of a loss. In fact, trainings that have it lower than the mel loss turns out unusuable.

gforce commented

2023-07-10 10:21:03 +00:00

I wouldn't be surprised if those artifacts and weirdness are because of so many short samples used to train it on.

I would also increase the temperature to over .5, and increase diffusion temp to above .75, decrease the cond free K, and increase the CVVP but ymmv depending on the model and inference samples.

I wouldn't be surprised if those artifacts and weirdness are because of so many short samples used to train it on. I would also increase the temperature to over .5, and increase diffusion temp to above .75, decrease the cond free K, and increase the CVVP but ymmv depending on the model and inference samples.

SyntheticVoices commented

2023-07-30 16:00:22 +00:00

repetitive words are usually the audio and train.txt not matching...which happens a lot even with WhisperX

Sign in to join this conversation.