-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weights of Inner Optimizers Not Saved #2094
Comments
Can you prepare a minimal PR with a new test to cover your case? |
So that we could check if It is similar to #1911 |
Our issues are similar. But the real problem is on the missing weights of the inner RAdam optimizer. First, I reran my program with
Same as #1911. This is because the variable Second, learning rate warmup could help RAdam to re-accumulate the mean and variance statistics with small steps rather than "messing up" the network weights in the first few steps on resuming. This can, to some extent, alleviate the missing of the RAdam weights, but is definitely not the correct solution. Plus, I just checked the sizes of the checkpoint files: Ranger 3381kb and RAdam 5070kb. With an extra slot "slow", the size of the Ranger checkpoint should not be smaller, indicating that the weights of RAdam are missing. I think the reason is evident here. If a PR is still needed, how should the test be conducted? Would saving and loading a model with a Lookahead-wrapped optimizer with slots be enough to demonstrate the problem? |
Lookahead test has no serializzation test currently. |
Check if some of the original author tests could be useful https://github.com/CyberZHG/keras-lookahead/blob/master/tests/test_optimizers.py |
/cc @CyberZHG |
Also check that you are recovering custom objects on load e.g. |
Hi @BinyanHu, thanks for investigating this. Can you provide the minimal code snippet to reproduce the issue, e.g. the way you save the model? Thank you! |
I think this is because of the fact that the value you pass to your optimizer is On the other hand, I feel this is the real issue here
|
* Update lookahead.py Inital fix of tensorflow#2094 tensorflow#2102 * Fix linting * Resolve name conflict with mixed prexision * Track baseline optimizer in avg
System information
Describe the bug
Resume a training process needs the restoration of the optimizer states to continue training RIGHT from the previous state without any loss of accuracy. Currently, the keras interface of saving model
keras.Model.save_weights
checkpoints both the network parameters and the optimizer weights. However, when an optimizer is wrapped inside another, its weights can not be saved by this mean.For example, when I was trying to use the Ranger optimizer, which is constructed by wrapping RAdam with Lookahead:
I noticed a performance drop on resuming training. I found that the weights of the inner RAdam were not saved into the checkpoint. (I checked the
.index
file in the checkpoint folder and there are no variable names like "m" and "v", only "slow", which is the weights of Lookahead). Therefore, after loading the weights from file and restart fitting, the weights of RAdam are randomly reinitialized. This could because the weights of the inner optimizer are not automatically tracked.Experiments
I trained the two LeNets on the FashionMNIST dataset. All the configurations are the same except for the optimizers. Both training are interrupted in the middle and then resumed.
Fig. TensorBoard. Blue: Ranger (Lookahead+RAdam), orange: RAdam.
Note the "bump" of the Ranger curve caused by the reinitialization of RAdam weights. Apparently, the weights of the inner optimizer are not correctly saved.
The text was updated successfully, but these errors were encountered: