Skip to content

Conversation

mcarilli
Copy link

@mcarilli mcarilli commented Jan 7, 2021

(formerly pytorch#50148, @csarofeen asked me to PR nvfuser diffs here first)

Eager mode RNG kernels needed some minor changes to interact safely with cuda graphs. This PR extends those changes to the kernels generated by nvfuser.

One thing I'm unclear on is the best way to let NVRTC know the definition of PhiloxCudaState (defined in ATen/CUDAGeneratorImpl.h). I suggested two options in comments (1, 2) but im not sure.

Another thing I'm unclear on is the best way to test these diffs.

ss << nvfuser_resources::grid_reduction_cu;
ss << nvfuser_resources::broadcast_cu;
ss << nvfuser_resources::welford_cu;
// How to define PhiloxCudaState for nvtrc, another option:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the best option with what we have today. the new file under nvfuser_resources should be self-contained (unfortunately it can't use #include).

Copy link
Author

@mcarilli mcarilli Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh oh, that means I either have to duplicate my struct definition from CUDAGeneratorImpl.h manually (brittle) or have something autogen the duplication (complicated).
out of curiosity why can't it use #include?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you'd need to duplicate the definition. besides the options you mentioned, another idea would be to have the canonical definition under nvfuser_resources, and #include that one from somewhere else - I mention this for brainstorming only, as I think it would be a hack and I don't recommend it.

#include doesn't work since we are missing the mechanisms to setup the nvrtc include locations (we prototyped this and it can be done, but upstream maintainers had reservations - I still hope we'll overcome the objections at some point but we're not there yet).

the alternative currently implemented is a custom preprocessor for the files under nvfuser_resources, which generates string literals from the .cu files. Technically it would be possible to do a form of C preprocessing at that point, although that would introduce some obvious (and maybe less obvious) complications.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, we may have a better solution: if the common code is already in a standalone file we could give it the same treatment as we do for files under nvfuser_resources. The only constraint is that the common file should be valid for textual insertion in the kernel "preamble" - ex. it can't have #includes itself

}


namespace at {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have to be in that at namespace? this could lead to subtle odr violations which would likely be hard to troubleshoot

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not, but I don't understand why it's a problem. Classes are always defined in headers without odr violation as long as member functions in the header are inline. Why is it a problem in this case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the headers generally work since the definitions are identical. If the definitions diverge, even in seemingly harmless ways then it's undefined what happens - in practice the issues are hard to troubleshoot since there's no toolchain support today.

PS. even with headers you can get ODR issues if the translation units are compiled with different flags or with different macros

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because this stuff will become a distinct .so from the similarly-defined stuff in ATen?

No, that's an ABI compatibility thing. The potential problem is if anyone ends up including both the aten header and the nvfuser_resources - it seems a bit stretched in the current setup, but let's say someone wants to prototype some CUDA changes by hand. Or if we end up adding support for real header includes to hvrtc.

Copy link
Author

@mcarilli mcarilli Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CPU-compiled and nvrtc-compiled instances of PhiloxCudaState are ABI-incompatible, aren't we screwed regardless because CPU instances are bitcopied into the kernel instances by cuLaunchKernel? How does putting the definition nvrtc compiles in a different namespace save us?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the other way around: putting things in the same namespace may lead to extra issues even if we're layout/abi compatible.

// If you change the definition there, you must change the definition here to match.
struct PhiloxCudaState {
PhiloxCudaState() = default;
PhiloxCudaState(const PhiloxCudaState&) = default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

int64_t* ptr;
};

uint64_t seed_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd initialize all the members here

};

namespace cuda {
namespace philox {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as the at namespace

// Copy pasted from ATen/cuda/CudaGraphsUtils.cuh,
// because we don't want to codegen directly from something in ATen.
// If you change the definition there, you must change the definition here to match.
__device__ __forceinline__ std::tuple<uint64_t, uint64_t>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should get rid of std::tuple here (create a dedicated structure instead)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eager mode kernels always used std::tuple<uint64_t, uint64_t>, even before my changes. What's the problem?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. unnecessary dependency on the standard library
  2. using a specialized struct results in better ergonomics, type safety and readability

#2 is the big one, although for nvrtc #1 is also something to keep in mind

Copy link
Author

@mcarilli mcarilli Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, but this change bleeds significantly into eager mode kernels. It probably won't cause merge conflicts the next time you pull in upstream, but it might.

bool captured_ = false;
};
// Pulls raw PhiloxCudaState definition into at:: as expected by eager consumers
#include <ATen/cuda/detail/PhiloxCudaStateRaw.cuh>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use this refactoring strategy (which I think it's much better than copy & paste) - then I we should make the shared headers be as self contained as possible. So move the namespace at into PhiloxCudaStateRaw.cuh instead of depending on the surrounding context.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but how will that play with your earlier concerns about namespaces? Tbh im not sure how exactly you wanted me to organize things.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With copy-and-paste, you want the minimal contract - even small source code changes can have unexpected consequences so you want compatible but different definitions.

With a shared "header" we can afford the same namespace. (we can still use different namespaces, but it's not a requirement to do so). Since we can afford same namespace, then the problem becomes packaging the shared file in a robust and intuitive way. We should factor the shared definition as a regular header - which means it shouldn't not depend much on the context where it's included.

Does this make sense?

Michael Carilli added 4 commits January 8, 2021 16:31
…udaState because PhiloxCudaState lives in ATen, and ATen can't contain any __device__ annotations.
Copy link
Collaborator

@tlemo tlemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few small comments

@@ -1,4 +1,3 @@

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this line removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason, it was random whitespace at the top of the file.

Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mcarilli mcarilli changed the title [CUDA graphs] [JIT] Capture-safe RNG in nvfuser [DO NOT MERGE YET] [CUDA graphs] [JIT] Capture-safe RNG in nvfuser Jan 28, 2021
@mcarilli
Copy link
Author

Don't merge this yet.

After discussions with Christian, I'm going to split all the non-nvfuser diffs into a separate PR against upstream. They are simple so it should land quickly. The nvfuser diffs will stay in this PR, and because they require the non-nvfuser diffs, we'll keep this PR in limbo until the non-nvfuser diffs are pulled back from upstream into 20_12_3_devel.

Ensuring all non-nvfuser diffs land upstream first alleviates the need for Facebook to review any non-nvfuser diffs when the big merge of our devel branch into upstream eventually happens.

@mcarilli
Copy link
Author

mcarilli commented Feb 3, 2021

I moved all the diffs outside torch/csrc/jit to pytorch#51580.

This PR can be merged after pytorch#51580 lands upstream and we pull its changes back in.

@mcarilli mcarilli changed the title [DO NOT MERGE YET] [CUDA graphs] [JIT] Capture-safe RNG in nvfuser [CUDA graphs] [JIT] Capture-safe RNG in nvfuser Mar 19, 2021
@jjsjann123
Copy link
Collaborator

Tests passed merging this one.

@jjsjann123 jjsjann123 merged commit 14bd01e into csarofeen:20_12_3_devel Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants