Two trainings with exact same parameters result in different curves #418

Open
opened 2023-10-19 15:23:53 +07:00 by DoctorPopi · 4 comments

Hello!

Next week-end I have finally succeeded in training a model without overfitting, with two nice training and validation curves gradually going down over 500 epochs.

Today, I tried reproducing the same conditions, but the curves look different : the validation curve stagnates and even starts showing signs of overfitting way before reaching the 500 epochs.

I know that there are a fair amount of random parameters in training a model, but I was wondering if that was normal? Of course I didn't change anything, basically just archived the previous finetuning and hit Train again. Granted, I did reload the training configuration I had put before, and hit the "Save training configuration" button once more, but everything else is the same.

Is there maybe a way to retrieve the seed or random state of the training that worked? I'm searching through the logs for something specific that would need to be constant between two trainings, but I don't see anything.

Thank you for any advice

Hello! Next week-end I have finally succeeded in training a model without overfitting, with two nice training and validation curves gradually going down over 500 epochs. Today, I tried reproducing the same conditions, but the curves look different : the validation curve stagnates and even starts showing signs of overfitting way before reaching the 500 epochs. I know that there are a fair amount of random parameters in training a model, but I was wondering if that was normal? Of course I didn't change anything, basically just archived the previous finetuning and hit Train again. Granted, I did reload the training configuration I had put before, and hit the "Save training configuration" button once more, but everything else is the same. Is there maybe a way to retrieve the seed or random state of the training that worked? I'm searching through the logs for something specific that would need to be constant between two trainings, but I don't see anything. Thank you for any advice

try to check the seed in the previous config, where everything was fine and try to run the training with that seed.

try to check the seed in the previous config, where everything was fine and try to run the training with that seed.

Hey @epp thank you for your answer!

I don't see the seed parameter in the previous documents, can you help me pinpoint it?

Hey @epp thank you for your answer! I don't see the seed parameter in the previous documents, can you help me pinpoint it?

but I was wondering if that was normal?

For model training, it can boil down to the initial seed, but in the scope of finetuning TorToiSe's AR model, I don't think it's worth the trouble. I can't recall the specifics, but I have read somewhere about a model for something not-TorToiSe being trained by simply picking the best seed of the bunch and continuing from there. It's not really something I'd worry with limited compute, since the time you're throwing away training other branches can simply be put back into the initial model to achieve similar performance.

Is there maybe a way to retrieve the seed or random state of the training that worked?

DLAS's code mentions here it'll be printed out in the log file, and given this, you can set a manual seed in the training YAML.

> but I was wondering if that was normal? For model training, it *can* boil down to the initial seed, but in the scope of finetuning TorToiSe's AR model, I don't think it's worth the trouble. I can't recall the specifics, but I have read somewhere about a model for something not-TorToiSe being trained by simply picking the best seed of the bunch and continuing from there. It's not really something I'd worry with limited compute, since the time you're throwing away training other branches can simply be put back into the initial model to achieve similar performance. > Is there maybe a way to retrieve the seed or random state of the training that worked? DLAS's code mentions [here](https://git.ecker.tech/mrq/DL-Art-School/src/branch/master/dlas/train.py#L118) it'll be printed out in the log file, and given [this](https://git.ecker.tech/mrq/DL-Art-School/src/branch/master/dlas/train.py#L114), you can set a manual seed in the training YAML.

Hey MRQ, thank you for pinpointing this in the DLAS output file, I found it in the log indeed! I'm going to try that.

I'm not sure I understand why you don't think it's worth it though?

This in particular I don't understand

I have read somewhere about a model for something not-TorToiSe being trained by simply picking the best seed of the bunch and continuing from there. It's not really something I'd worry with limited compute, since the time you're throwing away training other branches can simply be put back into the initial model to achieve similar performance.

Hey MRQ, thank you for pinpointing this in the DLAS output file, I found it in the log indeed! I'm going to try that. I'm not sure I understand why you don't think it's worth it though? This in particular I don't understand > I have read somewhere about a model for something not-TorToiSe being trained by simply picking the best seed of the bunch and continuing from there. It's not really something I'd worry with limited compute, since the time you're throwing away training other branches can simply be put back into the initial model to achieve similar performance.
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#418
There is no content yet.