-
Notifications
You must be signed in to change notification settings - Fork 7
[CUDA graphs] [JIT] Capture-safe RNG in nvfuser #593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA graphs] [JIT] Capture-safe RNG in nvfuser #593
Conversation
ss << nvfuser_resources::grid_reduction_cu; | ||
ss << nvfuser_resources::broadcast_cu; | ||
ss << nvfuser_resources::welford_cu; | ||
// How to define PhiloxCudaState for nvtrc, another option: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the best option with what we have today. the new file under nvfuser_resources should be self-contained (unfortunately it can't use #include
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh oh, that means I either have to duplicate my struct definition from CUDAGeneratorImpl.h manually (brittle) or have something autogen the duplication (complicated).
out of curiosity why can't it use #include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you'd need to duplicate the definition. besides the options you mentioned, another idea would be to have the canonical definition under nvfuser_resources, and #include that one from somewhere else - I mention this for brainstorming only, as I think it would be a hack and I don't recommend it.
#include
doesn't work since we are missing the mechanisms to setup the nvrtc include locations (we prototyped this and it can be done, but upstream maintainers had reservations - I still hope we'll overcome the objections at some point but we're not there yet).
the alternative currently implemented is a custom preprocessor for the files under nvfuser_resources, which generates string literals from the .cu files. Technically it would be possible to do a form of C preprocessing at that point, although that would introduce some obvious (and maybe less obvious) complications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, we may have a better solution: if the common code is already in a standalone file we could give it the same treatment as we do for files under nvfuser_resources. The only constraint is that the common file should be valid for textual insertion in the kernel "preamble" - ex. it can't have #includes itself
} | ||
|
||
|
||
namespace at { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this have to be in that at namespace? this could lead to subtle odr violations which would likely be hard to troubleshoot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not, but I don't understand why it's a problem. Classes are always defined in headers without odr violation as long as member functions in the header are inline. Why is it a problem in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the headers generally work since the definitions are identical. If the definitions diverge, even in seemingly harmless ways then it's undefined what happens - in practice the issues are hard to troubleshoot since there's no toolchain support today.
PS. even with headers you can get ODR issues if the translation units are compiled with different flags or with different macros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because this stuff will become a distinct .so from the similarly-defined stuff in ATen?
No, that's an ABI compatibility thing. The potential problem is if anyone ends up including both the aten header and the nvfuser_resources - it seems a bit stretched in the current setup, but let's say someone wants to prototype some CUDA changes by hand. Or if we end up adding support for real header includes to hvrtc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If CPU-compiled and nvrtc-compiled instances of PhiloxCudaState are ABI-incompatible, aren't we screwed regardless because CPU instances are bitcopied into the kernel instances by cuLaunchKernel? How does putting the definition nvrtc compiles in a different namespace save us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the other way around: putting things in the same namespace may lead to extra issues even if we're layout/abi compatible.
// If you change the definition there, you must change the definition here to match. | ||
struct PhiloxCudaState { | ||
PhiloxCudaState() = default; | ||
PhiloxCudaState(const PhiloxCudaState&) = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed
int64_t* ptr; | ||
}; | ||
|
||
uint64_t seed_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd initialize all the members here
}; | ||
|
||
namespace cuda { | ||
namespace philox { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question as the at namespace
// Copy pasted from ATen/cuda/CudaGraphsUtils.cuh, | ||
// because we don't want to codegen directly from something in ATen. | ||
// If you change the definition there, you must change the definition here to match. | ||
__device__ __forceinline__ std::tuple<uint64_t, uint64_t> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should get rid of std::tuple here (create a dedicated structure instead)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eager mode kernels always used std::tuple<uint64_t, uint64_t>, even before my changes. What's the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, but this change bleeds significantly into eager mode kernels. It probably won't cause merge conflicts the next time you pull in upstream, but it might.
aten/src/ATen/CUDAGeneratorImpl.h
Outdated
bool captured_ = false; | ||
}; | ||
// Pulls raw PhiloxCudaState definition into at:: as expected by eager consumers | ||
#include <ATen/cuda/detail/PhiloxCudaStateRaw.cuh> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we use this refactoring strategy (which I think it's much better than copy & paste) - then I we should make the shared headers be as self contained as possible. So move the namespace at
into PhiloxCudaStateRaw.cuh instead of depending on the surrounding context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but how will that play with your earlier concerns about namespaces? Tbh im not sure how exactly you wanted me to organize things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With copy-and-paste, you want the minimal contract - even small source code changes can have unexpected consequences so you want compatible but different definitions.
With a shared "header" we can afford the same namespace. (we can still use different namespaces, but it's not a requirement to do so). Since we can afford same namespace, then the problem becomes packaging the shared file in a robust and intuitive way. We should factor the shared definition as a regular header - which means it shouldn't not depend much on the context where it's included.
Does this make sense?
…udaState because PhiloxCudaState lives in ATen, and ATen can't contain any __device__ annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few small comments
@@ -1,4 +1,3 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this line removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no reason, it was random whitespace at the top of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…_rng_jit_for_csarofeen
Don't merge this yet. After discussions with Christian, I'm going to split all the non-nvfuser diffs into a separate PR against upstream. They are simple so it should land quickly. The nvfuser diffs will stay in this PR, and because they require the non-nvfuser diffs, we'll keep this PR in limbo until the non-nvfuser diffs are pulled back from upstream into 20_12_3_devel. Ensuring all non-nvfuser diffs land upstream first alleviates the need for Facebook to review any non-nvfuser diffs when the big merge of our devel branch into upstream eventually happens. |
I moved all the diffs outside torch/csrc/jit to pytorch#51580. This PR can be merged after pytorch#51580 lands upstream and we pull its changes back in. |
…_rng_jit_for_csarofeen
Tests passed merging this one. |
(formerly pytorch#50148, @csarofeen asked me to PR nvfuser diffs here first)
Eager mode RNG kernels needed some minor changes to interact safely with cuda graphs. This PR extends those changes to the kernels generated by nvfuser.
One thing I'm unclear on is the best way to let NVRTC know the definition of PhiloxCudaState (defined in ATen/CUDAGeneratorImpl.h). I suggested two options in comments (1, 2) but im not sure.
Another thing I'm unclear on is the best way to test these diffs.