Validation.txt shows nothing #396

Closed
opened 2023-09-21 11:38:39 +00:00 by DoctorPopi · 4 comments

Hello!

I have been playing around a bit more with the training tool, and especially with the validation settings

Validation Text Length Threshold: transcription text lengths that are below this value are culled and placed in the validation dataset. Set 0 to ignore.
Validation Audio Length Threshold: audio lengths that are below this value are culled and placed in the validation dataset. Set 0 to ignore.

If I understand correctly, these values (number of characters for the text, number of seconds for the audio?) allow to divide the dataset into a validation dataset and a training dataset. However, when I choose some parameters, I don't see anything appear in the validation.txt, which makes me think that the dataset was not divided and that nothing was put in the validation dataset.

My audios are all between 6 and 11 seconds, so I wanted to cull only the audios that were below 8 seconds (I put 0 to ignore the length of the text). I then hit "Transcribe and Process", then replace the whisper.json by the one I have already corrected, then hit "Recreate dataset" in order to correct the train.txt as well, and then go forward with the other steps. But after this step, the validation.txt remains empty.

Is there something I'm doing wrong? How do you know if you indeed have clips in your validation dataset, and how many there are?

Thank you :)

Hello! I have been playing around a bit more with the training tool, and especially with the validation settings > Validation Text Length Threshold: transcription text lengths that are below this value are culled and placed in the validation dataset. Set 0 to ignore. > Validation Audio Length Threshold: audio lengths that are below this value are culled and placed in the validation dataset. Set 0 to ignore. If I understand correctly, these values (number of characters for the text, number of seconds for the audio?) allow to divide the dataset into a validation dataset and a training dataset. However, when I choose some parameters, I don't see anything appear in the validation.txt, which makes me think that the dataset was not divided and that nothing was put in the validation dataset. My audios are all between 6 and 11 seconds, so I wanted to cull only the audios that were below 8 seconds (I put 0 to ignore the length of the text). I then hit "Transcribe and Process", then replace the whisper.json by the one I have already corrected, then hit "Recreate dataset" in order to correct the train.txt as well, and then go forward with the other steps. But after this step, the validation.txt remains empty. Is there something I'm doing wrong? How do you know if you indeed have clips in your validation dataset, and how many there are? Thank you :)
Owner

Should be fixed in commit 17acfee5d0. Validation culling based on audio length was commented out, but validation culling based on text length was still working.

Should be fixed in commit 17acfee5d0e8d0596307bec8462520adc109875a. Validation culling based on audio length was commented out, but validation culling based on text length was still working.
Author

Oh okay wonderful thank you!

For the text length, we are talking about the number of characters right, spaces included?

Also other question: if you see that some of the clips were put in the validation.txt and you don't want them there, is it okay to just remove them manually from the validation.txt and rewrite them in the train.txt? Does the order in which the clips are written in both .txt matter?

Thank you!

Oh okay wonderful thank you! For the text length, we are talking about the number of characters right, spaces included? Also other question: if you see that some of the clips were put in the validation.txt and you don't want them there, is it okay to just remove them manually from the validation.txt and rewrite them in the train.txt? Does the order in which the clips are written in both .txt matter? Thank you!
Owner

For the text length, we are talking about the number of characters right, spaces included?

Yes, it just checks the length of the string. I think the DLAS script does the same thing too, rather than cull by tokens.

if you see that some of the clips were put in the validation.txt and you don't want them there, is it okay to just remove them manually from the validation.txt and rewrite them in the train.txt?

Yes, you can move what you want in and out of the validation dataset as you want. The validation dataset is simply a portion of your training dataset set aside to remain as outside data to extrapolate with. In other words, it's a good way to not overtrain your data and keep an eye on how real world output quality may look like.

Does the order in which the clips are written in both .txt matter?

Nope, it'll get shuffled around in the dataloader and when sampled.

> For the text length, we are talking about the number of characters right, spaces included? Yes, it just checks the length of the string. I think the DLAS script does the same thing too, rather than cull by tokens. > if you see that some of the clips were put in the validation.txt and you don't want them there, is it okay to just remove them manually from the validation.txt and rewrite them in the train.txt? Yes, you can move what you want in and out of the validation dataset as you want. The validation dataset is simply a portion of your training dataset set aside to remain as outside data to extrapolate with. In other words, it's a good way to not overtrain your data and keep an eye on how real world output quality may look like. > Does the order in which the clips are written in both .txt matter? Nope, it'll get shuffled around in the dataloader and when sampled.
Author

Okay thank you for these clarifications!

I'm really struggling with the validation though, the validation yellow line starts going up almost as soon as the training starts :/ I've seen this post which seems to show the same problem, but I don't really know what to do for now.

I think I'll start by rehauling my dataset, maybe cut it differently, and I'll experiment...

Anyway, that will be the subject of another post, I'm closing this one, thank you again for your amazingly quick responses :) always a pleasure

Okay thank you for these clarifications! I'm really struggling with the validation though, the validation yellow line starts going up almost as soon as the training starts :/ I've seen [this post](https://git.ecker.tech/mrq/ai-voice-cloning/issues/248) which seems to show the same problem, but I don't really know what to do for now. I think I'll start by rehauling my dataset, maybe cut it differently, and I'll experiment... Anyway, that will be the subject of another post, I'm closing this one, thank you again for your amazingly quick responses :) always a pleasure
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#396
No description provided.