Audio artifacts/repetitive words after training #296

Open
opened 2023-07-06 11:25:30 +00:00 by JoaoPimenta · 3 comments

Hey, I'm training a voice and under training/MyVoice/audio I have about 936 files, and the length varies quite a lot. A lot of them are only a second or two long, many go to around 7 seconds, and few go to around 20 seconds.

I'm adding the graphs of my training below

The output of the audio is not ideal. To get stable audio, I'm reducing all the randomness as much as possible, I'm increasing the penalties for length and repetition, and still, I get audio artefacts. For example, if I type "How was your day?", the audio is like "How was your day? How was your day? How waaaaass ahhhhh"

Any idea of what I'm doing wrong? Is it my audio data? Am I training for too long?

Hey, I'm training a voice and under training/MyVoice/audio I have about 936 files, and the length varies quite a lot. A lot of them are only a second or two long, many go to around 7 seconds, and few go to around 20 seconds. I'm adding the graphs of my training below The output of the audio is not ideal. To get stable audio, I'm reducing all the randomness as much as possible, I'm increasing the penalties for length and repetition, and still, I get audio artefacts. For example, if I type "How was your day?", the audio is like "How was your day? How was your day? How waaaaass ahhhhh" Any idea of what I'm doing wrong? Is it my audio data? Am I training for too long?
JoaoPimenta changed title from Audio artifacts/repetitve words after trainning to Audio artifacts/repetitive words after training 2023-07-06 11:28:38 +00:00

Take a look at this, might be helpful for diagnosing your issue #82 (comment)

the text loss quantifies how well the predicted text tokens match the source text. This doesn't necessarily need to have too low of a loss. In fact, trainings that have it lower than the mel loss turns out unusuable.

Take a look at this, might be helpful for diagnosing your issue [#82 (comment)](https://git.ecker.tech/mrq/ai-voice-cloning/issues/82#issuecomment-772) > the text loss quantifies how well the predicted text tokens match the source text. This doesn't necessarily need to have too low of a loss. In fact, trainings that have it lower than the mel loss turns out unusuable.

I wouldn't be surprised if those artifacts and weirdness are because of so many short samples used to train it on.

I would also increase the temperature to over .5, and increase diffusion temp to above .75, decrease the cond free K, and increase the CVVP but ymmv depending on the model and inference samples.

I wouldn't be surprised if those artifacts and weirdness are because of so many short samples used to train it on. I would also increase the temperature to over .5, and increase diffusion temp to above .75, decrease the cond free K, and increase the CVVP but ymmv depending on the model and inference samples.

repetitive words are usually the audio and train.txt not matching...which happens a lot even with WhisperX

repetitive words are usually the audio and train.txt not matching...which happens a lot even with WhisperX
Sign in to join this conversation.
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#296
No description provided.