-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] TFTModel.load_from_checkpoint and .fit() is returning an error. #1090
Comments
Hi @criscapdechoy, can you try the following?
When loading the model from checkpoint, it is at epoch 30 (index 29). You need to tell the model to continue training until epoch 30 + additional_n_epochs = 35 (index 34). |
Hi @dennisbader! Thank you for your fast replay. If I use what you propose the error is not showing! However something I was not expecting happens...When I try to retrain, i.e. after |
Hmm, it might be that we lost automatic support for that with the new PyTorch Lightning versions.
|
Hello again! PD:To get the epochs that the model checkpoint loaded I have to call:
However, I am not sure if it is the best way to know how many epochs the model have been trained. I don't know if there's any other attribute from the trainer containing this info? |
@dennisbader , what's your opinion, do you think we should rework the way we handle epochs? |
Closing this as @dennisbader detailed the reason for keeping the current behavior in #1689. |
Describe the bug
First of all we train a model with
TFTModel
with 30 epochs. Then, we aim to do transfer learning by re-training the previous model loading it from last checkpoint. Then, we execute the.fit(..,epochs=additional_n_epochs)
but an error occurs:To Reproduce
Expected behavior
We aim to get a training process departing from the epoch of last checkpoint and continue until the total number of epochs is:
my_model.n_epochs + additional_n_epochs
.System (please complete the following information):
The text was updated successfully, but these errors were encountered: