Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[Torch classifier agent][bug fix]Fix optimizer loading in classifier agent #4406

Merged
merged 1 commit into from
Mar 9, 2022

Conversation

dexterju27
Copy link
Contributor

Patch description
I encountered an issue where my jobs would crash with KeyError: "param 'initial_lr' is not specified in param_groups[0] when resuming an optimizer during final evaluation.

After some digging, I found the issue was the following:

  1. In torch classifier agent, we didn't load back the optimizer states while loading the checkpoint, instead, we create a new optimizer with model. Parameters.
  2. This breaks the training in the following scenario (when the model was saved during warm up):
  3. The optimizer was created without initial_lr, the intial_lr was added by LambdaLR when last_epoch= -1. torch/optim/lr_scheduler.py:35
  4. The optimizer was then saved with intial_lr.
  5. However, when loading the optimizer back, we ignored the optimizer states_dict, created a new optimizer that doesn't have initial_lr key.
  6. This crashes the warm-up schedule initialization, since we are resuming the optimizer from last_epoch = training steps. It will expect an initial_lr that is not in the optimizer.
    self.init_optim(optim_params)

Proposed changes:
Change this line to what torch generator agent was doing, loading the optimizer states back instead of creating a new one from model.parameters

was_reset = self.init_optim(

Testing steps
You could reproduce the issue by setting a high warm up and load such model back when resume training. The issue was fixed after the proposed change.

@dexterju27 dexterju27 merged commit d4fded0 into main Mar 9, 2022
@dexterju27 dexterju27 deleted the fix-optim-torch-classifier branch March 9, 2022 15:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants