Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stochastic Rounding Optimizers #17
base: master
Are you sure you want to change the base?
Stochastic Rounding Optimizers #17
Changes from all commits
eca05da
71a0c29
e2112f0
88b9850
50cbbe4
46750ee
17696a2
2e130eb
907f568
31bfb75
ae652ea
b765059
6f7a93a
9159b03
59523c8
8ea3245
ccab446
31bd573
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only correct if the kernel's number of registers per thread is <= 32, otherwise register pressure limits your occupancy. You can recompile kernels with
-ptxas-options=-v
as an nvcc option and nvcc will print how many registers they use (this is easiest to do with the kernels in an extension, I'm not sure how you would pass that option to nvcc in a pytorch build).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest concern is, upstream will probably ask you to rewrite this with TensorIterator in some form, as @zasdfgbnm hinted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you generate 4 rng and only use 2. I don't think that's a big problem though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding float here is probably fine IMO, but natalia may ask you to change this to
in_type
(which might require makingget_delta_fp16
a template), and replace the__float2half_rz
call with a wrapper function that has several overloads and the float overload calls__float2half_rz
.