[Torch classifier agent][bug fix]Fix optimizer loading in classifier agent #4406

dexterju27 · 2022-03-08T23:20:08Z

Patch description
I encountered an issue where my jobs would crash with KeyError: "param 'initial_lr' is not specified in param_groups[0] when resuming an optimizer during final evaluation.

After some digging, I found the issue was the following:

In torch classifier agent, we didn't load back the optimizer states while loading the checkpoint, instead, we create a new optimizer with model. Parameters.
This breaks the training in the following scenario (when the model was saved during warm up):
The optimizer was created without initial_lr, the intial_lr was added by LambdaLR when last_epoch= -1. torch/optim/lr_scheduler.py:35
The optimizer was then saved with intial_lr.
However, when loading the optimizer back, we ignored the optimizer states_dict, created a new optimizer that doesn't have initial_lr key.
This crashes the warm-up schedule initialization, since we are resuming the optimizer from last_epoch = training steps. It will expect an initial_lr that is not in the optimizer.

ParlAI/parlai/core/torch_classifier_agent.py

Line 561 in 289942b

self.init_optim(optim_params)

Proposed changes:
Change this line to what torch generator agent was doing, loading the optimizer states back instead of creating a new one from model.parameters

ParlAI/parlai/core/torch_generator_agent.py

Line 525 in 289942b

was_reset = self.init_optim(

Testing steps
You could reproduce the issue by setting a high warm up and load such model back when resume training. The issue was fixed after the proposed change.

fix optim loading in classifier agent

ff120fe

facebook-github-bot added the CLA Signed label Mar 8, 2022

dexterju27 requested review from emilydinan, jxmsML, stephenroller and jaseweston March 8, 2022 23:20

stephenroller approved these changes Mar 9, 2022

View reviewed changes

dexterju27 merged commit d4fded0 into main Mar 9, 2022

dexterju27 deleted the fix-optim-torch-classifier branch March 9, 2022 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Torch classifier agent][bug fix]Fix optimizer loading in classifier agent #4406

[Torch classifier agent][bug fix]Fix optimizer loading in classifier agent #4406

dexterju27 commented Mar 8, 2022

[Torch classifier agent][bug fix]Fix optimizer loading in classifier agent #4406

[Torch classifier agent][bug fix]Fix optimizer loading in classifier agent #4406

Conversation

dexterju27 commented Mar 8, 2022