Two trainings with exact same parameters result in different curves
#418
Open
opened
Loading…
Reference in New Issue
There is no content yet.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. It CANNOT be undone. Continue?
Hello!
Next week-end I have finally succeeded in training a model without overfitting, with two nice training and validation curves gradually going down over 500 epochs.
Today, I tried reproducing the same conditions, but the curves look different : the validation curve stagnates and even starts showing signs of overfitting way before reaching the 500 epochs.
I know that there are a fair amount of random parameters in training a model, but I was wondering if that was normal? Of course I didn't change anything, basically just archived the previous finetuning and hit Train again. Granted, I did reload the training configuration I had put before, and hit the "Save training configuration" button once more, but everything else is the same.
Is there maybe a way to retrieve the seed or random state of the training that worked? I'm searching through the logs for something specific that would need to be constant between two trainings, but I don't see anything.
Thank you for any advice
try to check the seed in the previous config, where everything was fine and try to run the training with that seed.
Hey @epp thank you for your answer!
I don't see the seed parameter in the previous documents, can you help me pinpoint it?
For model training, it can boil down to the initial seed, but in the scope of finetuning TorToiSe's AR model, I don't think it's worth the trouble. I can't recall the specifics, but I have read somewhere about a model for something not-TorToiSe being trained by simply picking the best seed of the bunch and continuing from there. It's not really something I'd worry with limited compute, since the time you're throwing away training other branches can simply be put back into the initial model to achieve similar performance.
DLAS's code mentions here it'll be printed out in the log file, and given this, you can set a manual seed in the training YAML.
Hey MRQ, thank you for pinpointing this in the DLAS output file, I found it in the log indeed! I'm going to try that.
I'm not sure I understand why you don't think it's worth it though?
This in particular I don't understand