Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for native amp #1561
support for native amp #1561
Changes from all commits
fd42641
ea1650a
807033d
d698328
794df48
fb6e414
ba02a20
ee6299e
4d06040
199c96c
2af9dc9
afb6801
fa87d1d
c60a885
de22e4f
879e691
60b9963
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mcarilli sanity check this loading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good if you fix the saving https://github.com/PyTorchLightning/pytorch-lightning/pull/1561/files#r413418705
Like saving, loading should occur either at the very beginning of an iteration (before any training-related
scaler
calls for that iteration) or at the end of an iteration, afterscaler.update()
. It doesn't make a lot of sense to load state dicts at the end of an iteration, but if the saved state originated from ascaler.state_dict()
call at the end of, say, iteration 1000 (i.e. after iteration 1000's call toscaler.update()
), then it's ok to callload_state_dict
at the beginning of iteration 1001 to resume.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mcarilli sanity check this saving?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state_dict
is a method, as for modules and optimizers, socheckpoint['native_amp_scaling_state'] = self.scaler.state_dict()
is what you want.checkpoint['native_amp_scaling_state'] = self.scaler.state_dict
would stash the bound-method object itself :PThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you should make sure
state_dict()
is retrieved either at the very beginning of an iteration (before anyscaler
method calls) or at the very end (afterscaler.update()
), and that the model and optimizer state dicts are saved at that same spot.I can't tell from these lines alone if the calling code occurs at a spot that obeys those criteria.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought it was a property haha, but i guess it's consistent with the other state_dict() calls haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol i see. it's consistent with the rest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to consider is that with
torch.cuda.amp
, it's permissible toI think your
if
criteria are flexible enough that both those cases can happen naturally with the appropriate user args but I'm not sure just from looking at it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this code works.
Case 1: Train with amp, load amp
works fine
case 2: Train amp, load and not use amp
in this case, lightning loads the amp state but amp is disabled so user doesn't use it at all
case 3: train regular, resume regular
works fine
case 4: train regular, resume with amp
in this case the checkpoint has no amp state and model starts normal but on amp.