Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Fix LR scheduler cooldown #3719

Merged
merged 3 commits into from
Jun 15, 2021
Merged

Fix LR scheduler cooldown #3719

merged 3 commits into from
Jun 15, 2021

Conversation

stephenroller
Copy link
Contributor

Patch description
Context:

  • Originally in ParlAI, fixed LR schedulers like cosine/linear would consume (warmup_updates + max_lr_steps) updates, cooling down to 0 eventually
  • [train] New training options for logging/validation based on number of steps #3379 changed this behavior such that only max_lr_steps would be consumed, but did not change the cooldown to be faster
  • This PR changes it so the full cooldown is completed by the end of max_lr_steps.

Testing steps
Adjusted CI, new assertions.

Copy link
Contributor

@emilydinan emilydinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix! this lgtm

if optim_states and saved_optim_type != opt['optimizer']:
# we changed from adam to adamax, or sgd to adam, or similar
logging.warning('Not loading optim state since optim class changed.')
return False
return True
elif optim_states:
# check for any fp16/fp32 conversions we need to do
optimstate_fp16 = 'loss_scaler' in optim_states
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't leave a comment for it, but are the semantics correct for line 1099? the elif not optimstate_fp16 and self.fp16 block? are we returning True always because of the lower precision conversion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

self.scheduler = optim.lr_scheduler.LambdaLR(optimizer, self._linear_lr)

def _linear_lr(self, step):
# this multiplicative factor ensures linear decay rate
# lr_mult = float(self.max_lr_steps - step - 1) / float(self.max_lr_steps - step)
lr_mult = max(0.0, 1e-6 + (1.0 - step / self.max_lr_steps) * (1 - 1e-6))
lr_mult = max(0.0, 1.0 - step / self.max_lr_steps)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need 1e-6 anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an executive call to let it actually go to 0 :P

@stephenroller stephenroller merged commit d3713fe into master Jun 15, 2021
@stephenroller stephenroller deleted the lrschedulemax branch June 15, 2021 23:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants