Is it possible to enable saving model / training_state mid-epoch? #39
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#39
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I'm currently training on a huge 100 GB dataset with more than 500,000 audio files. Training one epoch will take about 60 hours on my poor 2060 laptop. (Even colab will probably run out of free quota before completing an epoch).
So is there a way to enable saving the current model and training state multiple times while training one epoch?
EDIT: Currently, I can do it by changing the
print_freq
andsave_checkpoint_freq
in the .yaml file, but it seems to mess up the epoch calculations.EDIT 2: It would be great if there was a way to remove older checkpoints and training states during training, as it can eat up disk space quickly.
Outside of manually editing the YAML, you can specify a decimal (like 0.5) for printing and it works. I guess this is an incidental side-effect of switching from the sliders to just a number box: it lets you put in any kind of number. I assume this way too, you can also provide a decimal for epochs too.
I was just thinking of trying to tackle that yesterday when I was cleaning up the parsing. There's nothing stopping me from just using the checkpoint-check and pruning old ones. I'll get around to it later today.
Sorry for the wait, I just happened to incidentally get to a point where I can add it in commit
bc0d9ab3ed
. It's not elegant, and I haven't extensively tested it to make sure it doesn't nuke everything, but with some dummy data I was able to get it to work.In the
Training
>Run Training
tab, there's a slider to set how many datasets to keep on cleanup. Cleanup is done on startup and every time it detects a save message.