Why so many models? And about a thousand of other questions :) #384

Closed
opened 2023-09-14 18:48:54 +07:00 by DoctorPopi · 9 comments

Hello! More than an issue, this is actually a question to which I'm not sure I found an answer on the Wiki.

I have finished training a model on about 700 clips (is that too much? First question haha), and I've been surprised to see that in the end I have 101 output models in the finetune folder! Why so many? Are these saves of the state of the training at regular steps?

If this is indeed the case, then I should use the last one generated, is that correct?

Thank you, as you might have guessed, I'm pretty new to machine learning :)

Hello! More than an issue, this is actually a question to which I'm not sure I found an answer on the Wiki. I have finished training a model on about 700 clips (is that too much? First question haha), and I've been surprised to see that in the end I have 101 output models in the finetune folder! Why so many? Are these saves of the state of the training at regular steps? If this is indeed the case, then I should use the last one generated, is that correct? Thank you, as you might have guessed, I'm pretty new to machine learning :)

about 700 clips (is that too much? First question haha)

The more the better. If I remember right, I've been able to get some finetune results with ~50.

I have 101 output models in the finetune folder! Why so many? Are these saves of the state of the training at regular steps?

Previous checkpoints, yes. There should be a setting in the web UI to configure pruning old checkpoints, but I can't remember what it's called.

If this is indeed the case, then I should use the last one generated, is that correct?

Correct; use the highest numbered one. I believe the web UI will either list the latest or all the models under Settings > Autoregressive Model.

> about 700 clips (is that too much? First question haha) The more the better. If I remember right, I've been able to get some finetune results with ~50. > I have 101 output models in the finetune folder! Why so many? Are these saves of the state of the training at regular steps? Previous checkpoints, yes. There should be a setting in the web UI to configure pruning old checkpoints, but I can't remember what it's called. > If this is indeed the case, then I should use the last one generated, is that correct? Correct; use the highest numbered one. I believe the web UI will either list the latest or all the models under `Settings` > `Autoregressive Model`.

Hello! Thank you very much for all this information.

If you don't mind me abusing of your kindness, I have a few more questions regarding the global fine tuning of a model, and I don't really know where else to ask, so I'll just post that here:

  1. I understood that technically this is fine tuning a model. But which model are we fine tuning exactly? I'm confused between JLSpeech, Tortoise, Bark, Vall-E models etc. How do you know which model you're finetuning? (Again, pardon the stupid basic question)

  2. Did you get a good result with 50 clips? How many epochs did you use? I watched a video that seemed to imply that too many epochs could cause degradation of the outputs

  3. I'm actively trying to train a model of a specific voice, and the model that I've trained (with the 700 clips) with all the basic settings is giving pretty good results already, honestly (when ran through Fast Tortoise though, it really doesn't perform well in the Mrq interface, though I don't understand why).

But I'm looking for a way to also train it on non verbal sounds, such as sighs, "hmmm", "errr..." , "shh" or "hahaha", that sort of thing. Would you happen to have some advice to give me to achieve this?

Again, thanks a million for the information, I'm only starting in machine learning and I'm also new in the community so I don't really know where to go ask the questions (if you have reddit / discord communities to advise me to join, I'd be really grateful)

Have a great evening / day and please take your time answering, no emergency :)

Hello! Thank you very much for all this information. If you don't mind me abusing of your kindness, I have a few more questions regarding the global fine tuning of a model, and I don't really know where else to ask, so I'll just post that here: 1. I understood that technically this is fine tuning a model. But which model are we fine tuning exactly? I'm confused between JLSpeech, Tortoise, Bark, Vall-E models etc. How do you know which model you're finetuning? (Again, pardon the stupid basic question) 2. Did you get a good result with 50 clips? How many epochs did you use? I watched a video that seemed to imply that too many epochs could cause degradation of the outputs 3. I'm actively trying to train a model of a specific voice, and the model that I've trained (with the 700 clips) with all the basic settings is giving pretty good results already, honestly (when ran through Fast Tortoise though, it really doesn't perform well in the Mrq interface, though I don't understand why). But I'm looking for a way to also train it on non verbal sounds, such as sighs, "hmmm", "errr..." , "shh" or "hahaha", that sort of thing. Would you happen to have some advice to give me to achieve this? Again, thanks a million for the information, I'm only starting in machine learning and I'm also new in the community so I don't really know where to go ask the questions (if you have reddit / discord communities to advise me to join, I'd be really grateful) Have a great evening / day and please take your time answering, no emergency :)
DoctorPopi changed title from Why so many models? to Why so many models? And about a thousand of other questions :) 2023-09-15 19:27:05 +07:00
  1. I understood that technically this is fine tuning a model. But which model are we fine tuning exactly?

With the default backend selected (Tortoise), you're finetuning Tortoise's autoregressive model, which is the core to generating speech. LJSpeech in the context of my documentation refers to a dataset format that follows LJSpeech's formatting (a text file containing path to the audio file|transcription per line), which the web UI handles when creating the dataset. Bark and VALL-E are additional backends, where VALL-E can be trained/finetuned in the web UI, but the base model isn't all there yet.

  1. Did you get a good result with 50 clips?

Honestly, can't really remember. I think for what it was worth it was fine.

How many epochs did you use? I watched a video that seemed to imply that too many epochs could cause degradation of the outputs

It's more about seeing where your mel loss ends up. Typically too small of a loss risks overtraining the model. My original rule of thumb was a loss sub-1.0, but I think there's better wisdom on that these days.

  1. [...] when ran through Fast Tortoise though, it really doesn't perform well in the Mrq interface, though I don't understand why

There's some voices that require finagling with how you're generating your conditioning latents. In the web UI, you can check a box to use the original method for calculating them, which should align with how base TorToiSe and the 152334H/tortoise-tts-fast fork handles it.

If you're curious about this being the issue, you can always take the cond_latents_[...].pth in your ./voices/{voice}/ folder, and use it in TorToiSe/the fast fork's ./voices/{voice}/ folder, but make sure it's the only file in the folder and named cond_latents.pth. It should load that as the reference instead, and if it's terrible output, then it's the latents.

You might also get better mileage if you took your voice folder and narrowed it down to a handful to use as a reference. I don't remember if I thoroughly tested between a voice folder with lots of reference clips vs a sparing amount.

But I'm looking for a way to also train it on non verbal sounds, such as sighs, "hmmm", "errr..." , "shh" or "hahaha", that sort of thing. Would you happen to have some advice to give me to achieve this?

You might get some luck with hand-labeling your dataset. I haven't dabbled in it myself to find the best way to go about it, but I'm sure it's doable.

if you have reddit / discord communities to advise me to join, I'd be really grateful

I personally don't know of or participate in any, but I'm pretty sure I've seen some floated around here.

> 1. I understood that technically this is fine tuning a model. But which model are we fine tuning exactly? With the default backend selected (Tortoise), you're finetuning Tortoise's autoregressive model, which is the core to generating speech. LJSpeech in the context of my documentation refers to a dataset format that follows LJSpeech's formatting (a text file containing `path to the audio file|transcription` per line), which the web UI handles when creating the dataset. Bark and VALL-E are additional backends, where VALL-E *can* be trained/finetuned in the web UI, but the base model isn't all there yet. > 2. Did you get a good result with 50 clips? Honestly, can't really remember. I think for what it was worth it was fine. > How many epochs did you use? I watched a video that seemed to imply that too many epochs could cause degradation of the outputs It's more about seeing where your mel loss ends up. Typically too small of a loss risks overtraining the model. My original rule of thumb was a loss sub-1.0, but I think there's better wisdom on that these days. > 3. [...] when ran through Fast Tortoise though, it really doesn't perform well in the Mrq interface, though I don't understand why There's some voices that require finagling with how you're generating your conditioning latents. In the web UI, you can check a box to use the original method for calculating them, which *should* align with how base TorToiSe and the 152334H/tortoise-tts-fast fork handles it. If you're curious about this being the issue, you can always take the `cond_latents_[...].pth` in your `./voices/{voice}/` folder, and use it in TorToiSe/the fast fork's `./voices/{voice}/` folder, but make sure it's the only file in the folder and named `cond_latents.pth`. It *should* load that as the reference instead, and if it's terrible output, then it's the latents. You might also get better mileage if you took your voice folder and narrowed it down to a handful to use as a reference. I don't remember if I thoroughly tested between a voice folder with lots of reference clips vs a sparing amount. > But I'm looking for a way to also train it on non verbal sounds, such as sighs, "hmmm", "errr..." , "shh" or "hahaha", that sort of thing. Would you happen to have some advice to give me to achieve this? You *might* get some luck with hand-labeling your dataset. I haven't dabbled in it myself to find the best way to go about it, but I'm sure it's doable. > if you have reddit / discord communities to advise me to join, I'd be really grateful I personally don't know of or participate in any, but I'm pretty sure I've seen some floated around here.

@DoctorPopi
I was using 20 short clips, and it seemed like one off clip ruined the whole voice. So back to the drawing board. I am wondering if I should have just pressed forward with more clips rather than less. I am curious how well 700 did... although I am just starting out with generating a few lines via one-short first. I don't have very clean clips either... I know a lot of people are using this for video game characters.

@DoctorPopi I was using 20 short clips, and it seemed like one off clip ruined the whole voice. So back to the drawing board. I am wondering if I should have just pressed forward with more clips rather than less. I am curious how well 700 did... although I am just starting out with generating a few lines via one-short first. I don't have very clean clips either... I know a lot of people are using this for video game characters.

Thank you for both your answers! My turn:

@mrq

It's more about seeing where your mel loss ends up. Typically too small of a loss risks overtraining the model. My original rule of thumb was a loss sub-1.0, but I think there's better wisdom on that these days.

I'm not sure I understand this, but from what I've learned you're talking about the curve that shows how well the model matches the data? When you talk about loss sub, you mean a delta of about 1.0 with the X axis? And to finish, if I understand correctly, overtraining the model means that you'll get noise on the curve?

You might get some luck with hand-labeling your dataset. I haven't dabbled in it myself to find the best way to go about it, but I'm sure it's doable.

Do you mean in the validation step, write the actual non verbal sounds that might not have been transcribed ? Or do you mean in the name of the clip itself? To be honest with you I wanted indeed to try correcting the validation.txt to actually transcribe the sounds by hand. I also wonder if I could teach the model to recognize some intonations of the voice by putting ** around a sentence that has been said louder in the audio, for example. I'll test that too.

I also have noticed something very strange. Basically, I have two datasets. "Voice_extracted", which is the folder with the 700 audio clips ripped from the video game, and "Voice2", which is my own batch of clips that I did by recording the audio lines on youtube and then cleaning them through audacity (thus it is quite far from being perfect). I trained my model on Voice_extracted. But in fast tortoise, when I feed "Voice_extracted" to my model, it performs very poorly, whereas when I feed "Voice2" to my model, the results are very very good. It sure seems illogical though, shouldn't the model be much more performant on the actual audios that were used to train it?

@FergasunFergie
It did happen to me indeed to have only one voice that ruins it all! It's definitely true that sometimes 5 excellent clips are way better than 50 shitty ones.
My 700-clip model is quite impressive to be honest, besides the weird thing I evoked just above (the fact that the model doesn't perform well when based on the audios that actually served to train it). What's really cool too, is that compared to Eleven Labs, my model actually succeeds in reproducing the accent of the character, or the different "colors" of the character's voice (sometimes high, sometimes very gravelly etc).

My next test will be on a way smaller dataset though, to see the difference, I'll let you know!

A lot of thanks to both of you :)

Thank you for both your answers! My turn: @mrq > It's more about seeing where your mel loss ends up. Typically too small of a loss risks overtraining the model. My original rule of thumb was a loss sub-1.0, but I think there's better wisdom on that these days. I'm not sure I understand this, but from what I've learned you're talking about the curve that shows how well the model matches the data? When you talk about loss sub, you mean a delta of about 1.0 with the X axis? And to finish, if I understand correctly, overtraining the model means that you'll get noise on the curve? > You might get some luck with hand-labeling your dataset. I haven't dabbled in it myself to find the best way to go about it, but I'm sure it's doable. Do you mean in the validation step, write the actual non verbal sounds that might not have been transcribed ? Or do you mean in the name of the clip itself? To be honest with you I wanted indeed to try correcting the validation.txt to actually transcribe the sounds by hand. I also wonder if I could teach the model to recognize some intonations of the voice by putting ** around a sentence that has been said louder in the audio, for example. I'll test that too. I also have noticed something very strange. Basically, I have two datasets. "Voice_extracted", which is the folder with the 700 audio clips ripped from the video game, and "Voice2", which is my own batch of clips that I did by recording the audio lines on youtube and then cleaning them through audacity (thus it is quite far from being perfect). I trained my model on Voice_extracted. But in fast tortoise, when I feed "Voice_extracted" to my model, it performs very poorly, whereas when I feed "Voice2" to my model, the results are very very good. It sure seems illogical though, shouldn't the model be much more performant on the actual audios that were used to train it? @FergasunFergie It did happen to me indeed to have only one voice that ruins it all! It's definitely true that sometimes 5 excellent clips are way better than 50 shitty ones. My 700-clip model is quite impressive to be honest, besides the weird thing I evoked just above (the fact that the model doesn't perform well when based on the audios that actually served to train it). What's really cool too, is that compared to Eleven Labs, my model actually succeeds in reproducing the accent of the character, or the different "colors" of the character's voice (sometimes high, sometimes very gravelly etc). My next test will be on a way smaller dataset though, to see the difference, I'll let you know! A lot of thanks to both of you :)

When you talk about loss sub, you mean a delta of about 1.0 with the X axis?

The average loss during training being below 1.0.

And to finish, if I understand correctly, overtraining the model means that you'll get noise on the curve?

An overfit model will perform very poorly when generalizing to text outside the training set.

Setting aside some of your data to the validation dataset will help figure out if the model is overfitting or not, as that data is unseen to the model.

Do you mean in the validation step, write the actual non verbal sounds that might not have been transcribed ? Or do you mean in the name of the clip itself?

After preparing the dataset but before actually training, under train.txt, add in the additional non-verbal terms into there.

> When you talk about loss sub, you mean a delta of about 1.0 with the X axis? The average loss during training being below 1.0. > And to finish, if I understand correctly, overtraining the model means that you'll get noise on the curve? An overfit model will perform very poorly when generalizing to text outside the training set. Setting aside some of your data to the validation dataset will help figure out if the model is overfitting or not, as that data is unseen to the model. > Do you mean in the validation step, write the actual non verbal sounds that might not have been transcribed ? Or do you mean in the name of the clip itself? After preparing the dataset but before actually training, under train.txt, add in the additional non-verbal terms into there.

After preparing the dataset but before actually training, under train.txt, add in the additional non-verbal terms into there.

Should I also add them in the actual whisper.json transcription?

About the average loss, even through 500 epochs and 700 clips, the best I got was 1.3, is that still too high? I can't seem to reach 1.0, much less below 1.0...

EDIT : a small edit to inform you that the labeling of the non verbal sounds in the train.txt worked really well! Some sounds are not very good yet but I think it's because I don't have enough clips with those sounds. I'll have to build more. I also have labeled them into the whisper.json, just to be safe

EDIT2 : I think I'm starting to understand the relationship between the loss rate and the number of epochs. The more epochs there is, the less error rate, if I'm correct

`After preparing the dataset but before actually training, under train.txt, add in the additional non-verbal terms into there.` Should I also add them in the actual whisper.json transcription? About the average loss, even through 500 epochs and 700 clips, the best I got was 1.3, is that still too high? I can't seem to reach 1.0, much less below 1.0... EDIT : a small edit to inform you that the labeling of the non verbal sounds in the train.txt worked really well! Some sounds are not very good yet but I think it's because I don't have enough clips with those sounds. I'll have to build more. I also have labeled them into the whisper.json, just to be safe EDIT2 : I think I'm starting to understand the relationship between the loss rate and the number of epochs. The more epochs there is, the less error rate, if I'm correct

Should I also add them in the actual whisper.json transcription?

That should be better in the event you regenerate the dataset, as the transcriptions are pulled from there and would overwrite the train.txt

About the average loss, even through 500 epochs and 700 clips, the best I got was 1.3, is that still too high? I can't seem to reach 1.0, much less below 1.0...

That should be fine. Larger datasets will have it tend to a bit of a higher loss and would require a lot more epochs to put in to make it lower.

> Should I also add them in the actual whisper.json transcription? That should be better in the event you regenerate the dataset, as the transcriptions are pulled from there and would overwrite the `train.txt` > About the average loss, even through 500 epochs and 700 clips, the best I got was 1.3, is that still too high? I can't seem to reach 1.0, much less below 1.0... That should be fine. Larger datasets will have it tend to a bit of a higher loss and would require a lot more epochs to put in to make it lower.

Thank you very much for all that information! I think I have all I need for now, I'll make another post should more questions arise. A thousand thanks for your patience and for the great work you did there, it's awesome!

Thank you very much for all that information! I think I have all I need for now, I'll make another post should more questions arise. A thousand thanks for your patience and for the great work you did there, it's awesome!
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#384
There is no content yet.