Is it possible to enable saving model / training_state mid-epoch? #39

Closed
opened 2023-02-26 07:07:11 +00:00 by sakharam_gatne · 2 comments

I'm currently training on a huge 100 GB dataset with more than 500,000 audio files. Training one epoch will take about 60 hours on my poor 2060 laptop. (Even colab will probably run out of free quota before completing an epoch).

So is there a way to enable saving the current model and training state multiple times while training one epoch?

EDIT: Currently, I can do it by changing the print_freq and save_checkpoint_freq in the .yaml file, but it seems to mess up the epoch calculations.

EDIT 2: It would be great if there was a way to remove older checkpoints and training states during training, as it can eat up disk space quickly.

I'm currently training on a huge 100 GB dataset with more than 500,000 audio files. Training one epoch will take about 60 hours on my poor 2060 laptop. (Even colab will probably run out of free quota before completing an epoch). So is there a way to enable saving the current model and training state multiple times while training one epoch? EDIT: Currently, I can do it by changing the `print_freq` and `save_checkpoint_freq` in the .yaml file, but it seems to mess up the epoch calculations. EDIT 2: It would be great if there was a way to remove older checkpoints and training states during training, as it can eat up disk space quickly.
Owner

Outside of manually editing the YAML, you can specify a decimal (like 0.5) for printing and it works. I guess this is an incidental side-effect of switching from the sliders to just a number box: it lets you put in any kind of number. I assume this way too, you can also provide a decimal for epochs too.

It would be great if there was a way to remove older checkpoints and training states during training, as it can eat up disk space quickly.

I was just thinking of trying to tackle that yesterday when I was cleaning up the parsing. There's nothing stopping me from just using the checkpoint-check and pruning old ones. I'll get around to it later today.

Outside of manually editing the YAML, you can specify a decimal (like 0.5) for printing and it works. I guess this is an incidental side-effect of switching from the sliders to just a number box: it lets you put in any kind of number. I assume this way too, you can also provide a decimal for epochs too. > It would be great if there was a way to remove older checkpoints and training states during training, as it can eat up disk space quickly. I was just thinking of trying to tackle that yesterday when I was cleaning up the parsing. There's nothing stopping me from just using the checkpoint-check and pruning old ones. I'll get around to it later today.
Owner

It would be great if there was a way to remove older checkpoints and training states during training, as it can eat up disk space quickly.

Sorry for the wait, I just happened to incidentally get to a point where I can add it in commit bc0d9ab3ed. It's not elegant, and I haven't extensively tested it to make sure it doesn't nuke everything, but with some dummy data I was able to get it to work.

In the Training > Run Training tab, there's a slider to set how many datasets to keep on cleanup. Cleanup is done on startup and every time it detects a save message.

> It would be great if there was a way to remove older checkpoints and training states during training, as it can eat up disk space quickly. Sorry for the wait, I just happened to incidentally get to a point where I can add it in commit bc0d9ab3ed456ebe580bbb50e3ffe49b39182ba2. It's not elegant, and I haven't extensively tested it to make sure it doesn't nuke everything, but with some dummy data I was able to get it to work. In the `Training` > `Run Training` tab, there's a slider to set how many datasets to keep on cleanup. Cleanup is done on startup and every time it detects a save message.
mrq closed this issue 2023-02-28 01:11:03 +00:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#39
No description provided.