Found some bad audio files during the middle of the training. What to do? #198

Open
opened 2023-04-11 03:01:13 +07:00 by pheonis · 9 comments

I started the training with a 10 minute mp3(american english). After 100 epoches i checked the model and it has successfully cloned the voice but there are some artifacts in the generated voice and also some repeatations of some words and mispronounciation of some words.

Now i checked the audio files ,to find that some audio files have some music in the start. I want to delete those audio files.

Is this a good plan to do so? What will you do if you find some bad audio files in the middle of the training?

Im planning to delete the audio file, modify the train.txt and whisper.json file and remove those parts.

Or should i train to may be 200 epoches?

I started the training with a 10 minute mp3(american english). After 100 epoches i checked the model and it has successfully cloned the voice but there are some artifacts in the generated voice and also some repeatations of some words and mispronounciation of some words. Now i checked the audio files ,to find that some audio files have some music in the start. I want to delete those audio files. Is this a good plan to do so? What will you do if you find some bad audio files in the middle of the training? Im planning to delete the audio file, modify the train.txt and whisper.json file and remove those parts. Or should i train to may be 200 epoches?
pheonis changed title from Found some audio files not good with some music in the middle of the training. What to do? to Found some bad audio files during the middle of the training. What to do? 2023-04-11 03:02:52 +07:00

I'd restart the training with a clean dataset, just to be sure.

I'd restart the training with a clean dataset, just to be sure.

How do you go about it? I read your comment in here #133, You mentioned there that you proofread all the transcriptions and use smaller dataset? So, do you do it manually? like trim the audio files in audacity and then transcribe or do you rely fully on this repo for the preparation of dataset.

Also, do you recommend using Trim Silence option . I used this option in one of my dataset preparation and i didnt like the result, there were harsh cuts in the start and end of audio files..like the speaker is missing some initial letters while speaking the starting word of a sentence.

How do you go about it? I read your comment in here [#133](https://git.ecker.tech/mrq/ai-voice-cloning/issues/133), You mentioned there that you proofread all the transcriptions and use smaller dataset? So, do you do it manually? like trim the audio files in audacity and then transcribe or do you rely fully on this repo for the preparation of dataset. Also, do you recommend using `Trim Silence` option . I used this option in one of my dataset preparation and i didnt like the result, there were harsh cuts in the start and end of audio files..like the speaker is missing some initial letters while speaking the starting word of a sentence.

So, do you do it manually?

Semi-manually, I use whisperx to produce a timestamped transcription and then feed the timestamps into ffmpeg to cut things to size.

Also, do you recommend using Trim Silence option

I've never used it.

> So, do you do it manually? Semi-manually, I use `whisperx` to produce a timestamped transcription and then feed the timestamps into `ffmpeg` to cut things to size. > Also, do you recommend using `Trim Silence` option I've never used it.

Semi-manually, I use whisperx to produce a timestamped transcription and then feed the timestamps into ffmpeg to cut things to size.

-Okay. I tried to follow your method. Installed yt-dl, whisperx and ffmpeg.
-Downloaded the youttube file in mp3 format.
-Used whisperx to generate transcription. It generated transcription in many formats(srt,vtt,txt,tsv)
-Now, How do i cut the audio file to segments by transcription using ffmpeg?

And Once the audio files are segmented after that,how do you create the train.txt file?

> Semi-manually, I use `whisperx` to produce a timestamped transcription and then feed the timestamps into `ffmpeg` to cut things to size. -Okay. I tried to follow your method. Installed yt-dl, whisperx and ffmpeg. -Downloaded the youttube file in mp3 format. -Used whisperx to generate transcription. It generated transcription in many formats(srt,vtt,txt,tsv) -Now, How do i cut the audio file to segments by transcription using ffmpeg? And Once the audio files are segmented after that,how do you create the train.txt file?

The most important thing is that when you do the transcription with whisperx you specify --align_model WAV2VEC2_ASR_LARGE_LV60K_960H or else the timestamps are going to be inaccurate. I have ffmpeg split the file into segments using the segment filter based on the second column of the .tsv file (depending on your OS and version of ffmpeg you may need to truncate the timestamps to 3 digits after the decimal point), then output audio/ + file name of the segment to the train.txt followed by a | and then the contents of the third column of the .tsv, and once all that's done create the folder structure for the dataset under training/ and copy over everything. Probably would be faster to use a bash/<your shell of choice> loop to do it all but it's the kind of small task that I'm too lazy to automate.

The most important thing is that when you do the transcription with `whisperx` you specify `--align_model WAV2VEC2_ASR_LARGE_LV60K_960H` or else the timestamps are going to be inaccurate. I have `ffmpeg` split the file into segments using [the segment filter](https://ffmpeg.org/ffmpeg-formats.html#segment_002c-stream_005fsegment_002c-ssegment) based on the second column of the .tsv file (depending on your OS and version of `ffmpeg` you may need to truncate the timestamps to 3 digits after the decimal point), then output `audio/` + file name of the segment to the train.txt followed by a `|` and then the contents of the third column of the .tsv, and once all that's done create the folder structure for the dataset under training/ and copy over everything. Probably would be faster to use a bash/\<your shell of choice\> loop to do it all but it's the kind of small task that I'm too lazy to automate.

and once all that's done create the folder structure for the dataset under training/ and copy over everything.

I successfully managed to segment the audio file, created the train.txt file ...and moved the clips to a folder ai-voice-cloning/traning/test/audio. Placed the train.txt file in the traning/test folder.

Now, I generated train.yaml under training>generate configuration and started the training,

The traning is runnning smoothly but I can see these errors in the backend.

[Training] [2023-04-12T07:29:55.623011] 23-04-12 07:29:55.313 - INFO: Training Metrics: {"loss_text_ce": 4.016107559204102, "loss_mel_ce": 1.578368067741394, "loss_gpt_total": 1.618529200553894, "lr": 0.0001, "it": 1, "step": 1, "steps": 2, "epoch": 0, "iteration_rate": 22.862075567245483}
[Training] [2023-04-12T07:30:17.453434] 23-04-12 07:30:17.451 - INFO: Training Metrics: {"loss_text_ce": 3.944296360015869, "loss_mel_ce": 1.5434800386428833, "loss_gpt_total": 1.582923173904419, "lr": 0.0001, "it": 2, "step": 2, "steps": 2, "epoch": 0, "iteration_rate": 21.824485540390015}
[Training] [2023-04-12T07:30:18.399538] /content/ai-voice-cloning/modules/dlas/dlas/models/audio/tts/tacotron2/taco_utils.py:18: WavFileWarning: Reached EOF prematurely; finished at 49152 bytes, expected 139342 bytes from header.
[Training] [2023-04-12T07:30:18.399593]   sampling_rate, data = read(full_path)
[Training] [2023-04-12T07:30:28.375561] /content/ai-voice-cloning/modules/dlas/dlas/models/audio/tts/tacotron2/taco_utils.py:18: WavFileWarning: Reached EOF prematurely; finished at 16384 bytes, expected 81998 bytes from header.

It seems there is a problem reading the wav files. I have checked all the wav files though. They were good. If anything im missing here?

> and once all that's done create the folder structure for the dataset under training/ and copy over everything. I successfully managed to segment the audio file, created the train.txt file ...and moved the clips to a folder `ai-voice-cloning/traning/test/audio`. Placed the train.txt file in the `traning/test `folder. Now, I generated train.yaml under training>generate configuration and started the training, The traning is runnning smoothly but I can see these errors in the backend. ``` [Training] [2023-04-12T07:29:55.623011] 23-04-12 07:29:55.313 - INFO: Training Metrics: {"loss_text_ce": 4.016107559204102, "loss_mel_ce": 1.578368067741394, "loss_gpt_total": 1.618529200553894, "lr": 0.0001, "it": 1, "step": 1, "steps": 2, "epoch": 0, "iteration_rate": 22.862075567245483} [Training] [2023-04-12T07:30:17.453434] 23-04-12 07:30:17.451 - INFO: Training Metrics: {"loss_text_ce": 3.944296360015869, "loss_mel_ce": 1.5434800386428833, "loss_gpt_total": 1.582923173904419, "lr": 0.0001, "it": 2, "step": 2, "steps": 2, "epoch": 0, "iteration_rate": 21.824485540390015} [Training] [2023-04-12T07:30:18.399538] /content/ai-voice-cloning/modules/dlas/dlas/models/audio/tts/tacotron2/taco_utils.py:18: WavFileWarning: Reached EOF prematurely; finished at 49152 bytes, expected 139342 bytes from header. [Training] [2023-04-12T07:30:18.399593] sampling_rate, data = read(full_path) [Training] [2023-04-12T07:30:28.375561] /content/ai-voice-cloning/modules/dlas/dlas/models/audio/tts/tacotron2/taco_utils.py:18: WavFileWarning: Reached EOF prematurely; finished at 16384 bytes, expected 81998 bytes from header. ``` It seems there is a problem reading the wav files. I have checked all the wav files though. They were good. If anything im missing here?

I haven't run into that error before. Does running ffprobe on the wav files reveal any errors?

I haven't run into that error before. Does running `ffprobe` on the wav files reveal any errors?
Input #0, wav, from '/content/drive/MyDrive/ai-voice-cloning/training/new1test/audio/0.wav':
Metadata:
encoder : Lavf58.29.100
Duration: 00:00:01.92, bitrate: 256 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s

This is the output of ffprobe, I guess the sample rate needed is 22050hz and these wav files are 16000hz..causing the errors. i will try to change the sample rate and try again.

``` Input #0, wav, from '/content/drive/MyDrive/ai-voice-cloning/training/new1test/audio/0.wav': Metadata: encoder : Lavf58.29.100 Duration: 00:00:01.92, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s ``` This is the output of ffprobe, I guess the sample rate needed is 22050hz and these wav files are 16000hz..causing the errors. i will try to change the sample rate and try again.

Ahh, I forgot to mention that, sorry. Appending -ar 22050 to your ffmpeg command should fix it.

Ahh, I forgot to mention that, sorry. Appending `-ar 22050` to your ffmpeg command should fix it.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#198
There is no content yet.