Resume training with expanded dataset #353

New Issue

twentyeight · 2023-08-28T09:43:01Z

twentyeight commented

2023-08-28 09:43:01 +00:00

Hi mrq, great work here and thank you for replying to the issues opened.

Quick question - I have already completed training on my current dataset. If I want to add more audio files (of the same voice) to this existing dataset, how can I resume training?

I see that you that you have some instructions here on Resuming Training, but I wasn't sure whether this applies to the same dataset or if I can resume training after I've added more audio files to the existing dataset.

Thanks in advance!

Hi mrq, great work here and thank you for replying to the issues opened. Quick question - I have already completed training on my current dataset. If I want to add more audio files (of the same voice) to this existing dataset, how can I resume training? I see that you that you have some instructions [here](https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training#resuming-training) on Resuming Training, but I wasn't sure whether this applies to the same dataset or if I can resume training after I've added more audio files to the existing dataset. Thanks in advance!

mrq commented

2023-08-28 16:22:28 +00:00

The training script should be able to resume training from the last checkpoint without needing to update anything else, even if you modified the dataset.

The "Resume Training" or whatever it was called is for specifying which model weights to start from, which is usually the existing AR model. You can also make use of finetuning existing finetunes (for example, a language finetune being finetuned on a specific voice of that language), which is where the feature comes into play.

You can use the "Resume Training" to start from your previous finetune, the only difference would be a clean set of optimizer states and starting from iteration/epoch 0, but those effectively only govern the LR scheduling.

The training script should be able to resume training from the last checkpoint without needing to update anything else, even if you modified the dataset. The "Resume Training" or whatever it was called is for specifying which model weights to start from, which is usually the existing AR model. You *can* also make use of finetuning existing finetunes (for example, a language finetune being finetuned on a specific voice of that language), which is where the feature comes into play. You *can* use the "Resume Training" to start from your previous finetune, the only difference would be a clean set of optimizer states and starting from iteration/epoch 0, but those effectively only govern the LR scheduling.

twentyeight commented

2023-08-29 10:17:03 +00:00

Thanks for the reply. I have tried to train based on the last saved state path, but am running into an issue where the model stops training after only 10 minutes (from 1005_gpt.pth to 1016_gpt.pth). Would appreciate your advice please.

I've added the new input wav files into ./voices/me2.

Prepare Dataset> I enabled "Skip Already Transcribed" and clicked "Transcribe and Process".

Generate Configuration> I set the same configurations (Epochs: 200, Batch Size: 20, Gradient Accumulation: 10) as before. Resume State Path = <path>\training\me2\finetune\training_state\1005.state

However, in the "Run Training" tab, after I clicked "Train", the model trains for 10 minutes and stops at 1016_gpt.pth, saying that training has completed. I'm not sure why it stops after 10 minutes, and that it only increases from 1005_gpt.pth to 1016_gpt.pth. My original dataset had 100 wav files (took 8h to train the first time), and my dataset now has 233 wav files. Shouldn't this expanded dataset take ~12h to train?

I have checked in ./training/me2 and see that train.txt contains the transcription of all 233 wav files (including the newly added ones).

Am I missing out something in the configurations/settings?

Screenshots for reference:
Training tab -
Console log -
Generate Configuration tab -

Thanks for the reply. I have tried to train based on the last saved state path, but am running into an issue where the model stops training after only 10 minutes (from 1005_gpt.pth to 1016_gpt.pth). Would appreciate your advice please. I've added the new input wav files into ./voices/me2. Prepare Dataset> I enabled "Skip Already Transcribed" and clicked "Transcribe and Process". Generate Configuration> I set the same configurations (Epochs: 200, Batch Size: 20, Gradient Accumulation: 10) as before. Resume State Path = `<path>\training\me2\finetune\training_state\1005.state` However, in the "Run Training" tab, after I clicked "Train", the model trains for 10 minutes and stops at 1016_gpt.pth, saying that training has completed. I'm not sure why it stops after 10 minutes, and that it only increases from 1005_gpt.pth to 1016_gpt.pth. My original dataset had 100 wav files (took 8h to train the first time), and my dataset now has 233 wav files. Shouldn't this expanded dataset take ~12h to train? I have checked in ./training/me2 and see that train.txt contains the transcription of all 233 wav files (including the newly added ones). Am I missing out something in the configurations/settings? Screenshots for reference: Training tab - ![image](/attachments/fbc62c03-8b7e-4283-80ec-f63fc88a6ce8) Console log - ![image](/attachments/b95c6645-40df-40f7-9abc-9430e562398d) Generate Configuration tab - ![image](/attachments/d7618fc8-0d48-4993-aaab-de7408132bc2)

image.png

388 KiB

image.png

268 KiB

image.png

268 KiB

image.png

909 KiB

mrq commented

2023-08-29 12:28:27 +00:00

into an issue where the model stops training after only 10 minutes

Oh, I suppose this is a bit of a mismatch between what is considered an "epoch" between my side creating the YAML, and the DLAS training script, which may or may not be also retaining that bit of information in the checkpointed states, although I'm not really too sure how.

A simple fix would be to increase your "epochs count" from 200 to something more, that should fix it.

> into an issue where the model stops training after only 10 minutes Oh, I suppose this is a bit of a mismatch between what is considered an "epoch" between my side creating the YAML, and the DLAS training script, which may or may not be also retaining that bit of information in the checkpointed states, although I'm not really too sure how. A simple fix would be to increase your "epochs count" from 200 to something more, that should fix it.

twentyeight commented

2023-08-29 13:23:35 +00:00

I see. I've increased the epochs from 200 to 500, and the training seems to work now. Pretty interesting to see the loss metric increase so sharply after adding the new data to the dataset (which I assume is expected behaviour since the model has to undergo more training to learn the new data).

Thank you so much for your quick response, I was honestly mindblown when I saw that you replied so quickly.

I'll close this issue for now, fingers crossed it trains all the way. Will reopen if otherwise.

Thanks and keep up the great work :)

I see. I've increased the epochs from 200 to 500, and the training seems to work now. Pretty interesting to see the loss metric increase so sharply after adding the new data to the dataset (which I assume is expected behaviour since the model has to undergo more training to learn the new data). ![image](/attachments/da4d5853-21e9-4bdf-be7c-72f2363cee10) Thank you so much for your quick response, I was honestly mindblown when I saw that you replied so quickly. I'll close this issue for now, fingers crossed it trains all the way. Will reopen if otherwise. Thanks and keep up the great work :)

image.png

94 KiB

twentyeight closed this issue

2023-08-29 13:23:35 +00:00

twentyeight commented

2023-08-29 21:55:37 +00:00

I have finished this round of training for 300 epochs (increased from 200 to 500 epochs in "Generate Configuration"). I decided to increase to 800 epochs so it would train again for another 300 epochs (800 - 500).

Question - Why does loss_mel_ce spike up when I continue training from 500 epochs? I'm not adding any new data into the dataset.

Train logs:
Stopped at 500 epochs, "loss_mel_ce": 1.9060912132263184
Continuing from 500 epochs, "loss_mel_ce": 2.170781850814819

P.S. It looks like loss_mel_ce goes back to 1.90x at epoch = 501 but I'm curious why it spikes when I continue training from epoch = 500.

I have finished this round of training for 300 epochs (increased from 200 to 500 epochs in "Generate Configuration"). I decided to increase to 800 epochs so it would train again for another 300 epochs (800 - 500). Question - Why does loss_mel_ce spike up when I continue training from 500 epochs? I'm not adding any new data into the dataset. ![image](/attachments/99fbc68c-9b02-4231-ab3a-b6f9565cb03f) Train logs: Stopped at 500 epochs, "loss_mel_ce": 1.9060912132263184 ![image](/attachments/059a1644-76db-4744-a35f-f5626a869a12) Continuing from 500 epochs, "loss_mel_ce": 2.170781850814819 ![image](/attachments/24c298ce-b7b7-49b0-8937-f9a66b49f242) P.S. It looks like loss_mel_ce goes back to 1.90x at epoch = 501 but I'm curious why it spikes when I continue training from epoch = 500.

image.png

112 KiB

image.png

124 KiB

image.png

108 KiB

mrq referenced this issue

2023-09-17 12:51:56 +00:00

resume a training #250

Sign in to join this conversation.