[CUDA graphs] [JIT] Capture-safe RNG in nvfuser #593

mcarilli · 2021-01-07T01:01:07Z

(formerly pytorch#50148, @csarofeen asked me to PR nvfuser diffs here first)

Eager mode RNG kernels needed some minor changes to interact safely with cuda graphs. This PR extends those changes to the kernels generated by nvfuser.

One thing I'm unclear on is the best way to let NVRTC know the definition of PhiloxCudaState (defined in ATen/CUDAGeneratorImpl.h). I suggested two options in comments (1, 2) but im not sure.

Another thing I'm unclear on is the best way to test these diffs.

tlemo · 2021-01-07T01:44:22Z

torch/csrc/jit/codegen/cuda/executor_utils.cpp

  ss << nvfuser_resources::grid_reduction_cu;
  ss << nvfuser_resources::broadcast_cu;
  ss << nvfuser_resources::welford_cu;
+  // How to define PhiloxCudaState for nvtrc, another option:


this is the best option with what we have today. the new file under nvfuser_resources should be self-contained (unfortunately it can't use #include).

Uh oh, that means I either have to duplicate my struct definition from CUDAGeneratorImpl.h manually (brittle) or have something autogen the duplication (complicated).
out of curiosity why can't it use #include?

yes, you'd need to duplicate the definition. besides the options you mentioned, another idea would be to have the canonical definition under nvfuser_resources, and #include that one from somewhere else - I mention this for brainstorming only, as I think it would be a hack and I don't recommend it.

#include doesn't work since we are missing the mechanisms to setup the nvrtc include locations (we prototyped this and it can be done, but upstream maintainers had reservations - I still hope we'll overcome the objections at some point but we're not there yet).

the alternative currently implemented is a custom preprocessor for the files under nvfuser_resources, which generates string literals from the .cu files. Technically it would be possible to do a form of C preprocessing at that point, although that would introduce some obvious (and maybe less obvious) complications.

actually, we may have a better solution: if the common code is already in a standalone file we could give it the same treatment as we do for files under nvfuser_resources. The only constraint is that the common file should be valid for textual insertion in the kernel "preamble" - ex. it can't have #includes itself

tlemo · 2021-01-07T19:39:01Z

torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu

 }
+
+
+namespace at {


does this have to be in that at namespace? this could lead to subtle odr violations which would likely be hard to troubleshoot

I guess not, but I don't understand why it's a problem. Classes are always defined in headers without odr violation as long as member functions in the header are inline. Why is it a problem in this case?

the headers generally work since the definitions are identical. If the definitions diverge, even in seemingly harmless ways then it's undefined what happens - in practice the issues are hard to troubleshoot since there's no toolchain support today.

PS. even with headers you can get ODR issues if the translation units are compiled with different flags or with different macros

Is it because this stuff will become a distinct .so from the similarly-defined stuff in ATen?

No, that's an ABI compatibility thing. The potential problem is if anyone ends up including both the aten header and the nvfuser_resources - it seems a bit stretched in the current setup, but let's say someone wants to prototype some CUDA changes by hand. Or if we end up adding support for real header includes to hvrtc.

If CPU-compiled and nvrtc-compiled instances of PhiloxCudaState are ABI-incompatible, aren't we screwed regardless because CPU instances are bitcopied into the kernel instances by cuLaunchKernel? How does putting the definition nvrtc compiles in a different namespace save us?

it's the other way around: putting things in the same namespace may lead to extra issues even if we're layout/abi compatible.

tlemo · 2021-01-07T19:39:28Z

torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu

+// If you change the definition there, you must change the definition here to match.
+struct PhiloxCudaState {
+  PhiloxCudaState() = default;
+  PhiloxCudaState(const PhiloxCudaState&) = default;


tlemo · 2021-01-07T19:39:46Z

torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu

+    int64_t* ptr;
+  };
+
+  uint64_t seed_;


I'd initialize all the members here

tlemo · 2021-01-07T19:40:02Z

torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu

+};
+
+namespace cuda {
+namespace philox {


same question as the at namespace

tlemo · 2021-01-07T19:40:28Z

torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu

+// Copy pasted from ATen/cuda/CudaGraphsUtils.cuh,
+// because we don't want to codegen directly from something in ATen.
+// If you change the definition there, you must change the definition here to match.
+__device__ __forceinline__ std::tuple<uint64_t, uint64_t>


we should get rid of std::tuple here (create a dedicated structure instead)

Eager mode kernels always used std::tuple<uint64_t, uint64_t>, even before my changes. What's the problem?

unnecessary dependency on the standard library

using a specialized struct results in better ergonomics, type safety and readability

#2 is the big one, although for nvrtc #1 is also something to keep in mind

done, but this change bleeds significantly into eager mode kernels. It probably won't cause merge conflicts the next time you pull in upstream, but it might.

torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu

tlemo · 2021-01-08T23:27:08Z

aten/src/ATen/CUDAGeneratorImpl.h

-  bool captured_ = false;
-};
+// Pulls raw PhiloxCudaState definition into at:: as expected by eager consumers
+#include <ATen/cuda/detail/PhiloxCudaStateRaw.cuh>


if we use this refactoring strategy (which I think it's much better than copy & paste) - then I we should make the shared headers be as self contained as possible. So move the namespace at into PhiloxCudaStateRaw.cuh instead of depending on the surrounding context.

Done, but how will that play with your earlier concerns about namespaces? Tbh im not sure how exactly you wanted me to organize things.

With copy-and-paste, you want the minimal contract - even small source code changes can have unexpected consequences so you want compatible but different definitions.

With a shared "header" we can afford the same namespace. (we can still use different namespaces, but it's not a requirement to do so). Since we can afford same namespace, then the problem becomes packaging the shared file in a robust and intuitive way. We should factor the shared definition as a regular header - which means it shouldn't not depend much on the context where it's included.

Does this make sense?

…udaState because PhiloxCudaState lives in ATen, and ATen can't contain any __device__ annotations.

tlemo

LGTM with a few small comments

aten/src/ATen/cuda/detail/PhiloxCudaStateRaw.cuh

aten/src/ATen/cuda/detail/UnpackRaw.cuh

torch/csrc/jit/codegen/cuda/codegen.cpp

torch/csrc/jit/codegen/cuda/executor_kernel_arg.h

tlemo · 2021-01-11T22:37:46Z

torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu

@@ -1,4 +1,3 @@
-


why was this line removed?

no reason, it was random whitespace at the top of the file.

csarofeen

LGTM

…_rng_jit_for_csarofeen

mcarilli · 2021-01-28T02:59:55Z

Don't merge this yet.

After discussions with Christian, I'm going to split all the non-nvfuser diffs into a separate PR against upstream. They are simple so it should land quickly. The nvfuser diffs will stay in this PR, and because they require the non-nvfuser diffs, we'll keep this PR in limbo until the non-nvfuser diffs are pulled back from upstream into 20_12_3_devel.

Ensuring all non-nvfuser diffs land upstream first alleviates the need for Facebook to review any non-nvfuser diffs when the big merge of our devel branch into upstream eventually happens.

…_rng_jit_for_csarofeen

mcarilli · 2021-02-03T00:04:46Z

I moved all the diffs outside torch/csrc/jit to pytorch#51580.

This PR can be merged after pytorch#51580 lands upstream and we pull its changes back in.

…_rng_jit_for_csarofeen

…unpack

jjsjann123 · 2021-03-25T16:01:28Z

Tests passed merging this one.

nvfuser diffs

d892f7a

mcarilli requested review from kevinstephano and csarofeen January 7, 2021 01:01

tlemo reviewed Jan 7, 2021

View reviewed changes

Copying in Philox utility definitions

823ad0e

tlemo reviewed Jan 7, 2021

View reviewed changes

Michael Carilli added 7 commits January 8, 2021 12:58

give PhiloxCudaState and unpack their own files for nvfuser codegen

dbbd9a8

try with cuh

6fd3407

forgot executor_utils.cpp

7db88ae

CMakeLists.txt

3e58cd0

cargo culting other resource files

82cf408

presumably in-kernel usages should look at local namespace?

2dbbf60

CMAKE_CURRENT_SOURCE_DIR

778a866

tlemo reviewed Jan 8, 2021

View reviewed changes

Michael Carilli added 4 commits January 8, 2021 16:31

fix path

9bae448

namespaces in raw headers

a297eb1

final change

76f5809

Struct for seed and offset. Can't put the logic in getters in PhiloxC…

8477146

…udaState because PhiloxCudaState lives in ATen, and ATen can't contain any __device__ annotations.

tlemo approved these changes Jan 11, 2021

View reviewed changes

Michael Carilli added 5 commits January 11, 2021 16:48

Comment why no pragma once in raw headers

fd0d07e

Remove CUDAGeneratorImpl.h from codegen.cpp and remove ULongArg

31f1c78

forgot to remove push ULongArg

8584a92

Resolving conflict

efb56b5

should have accepted TORCH_CUDA_CPP_API for conflicting diff

48d61d8

csarofeen approved these changes Jan 27, 2021

View reviewed changes

Merge remote-tracking branch 'csarofeen/20_12_3_devel' into graphable…

ed813dc

…_rng_jit_for_csarofeen

mcarilli changed the title ~~[CUDA graphs] [JIT] Capture-safe RNG in nvfuser~~ [DO NOT MERGE YET] [CUDA graphs] [JIT] Capture-safe RNG in nvfuser Jan 28, 2021

Michael Carilli added 2 commits February 2, 2021 16:47

Removing eager mode changes

18ad987

Merge remote-tracking branch 'csarofeen/20_12_3_devel' into graphable…

170c26e

…_rng_jit_for_csarofeen

Michael Carilli added 8 commits March 2, 2021 23:03

Return to original in-kernel api

5c3e4e4

Merge remote-tracking branch 'csarofeen/20_12_3_devel' into graphable…

8699704

…_rng_jit_for_csarofeen

Test passes, but i don't see fusions in profile

daee448

Warmup calls and remove for loop in script??

e7345d7

Manually unpack philox_args to avoid std::stuff in at::cuda::philox::…

280b694

…unpack

Full test passes!

fac66c6

Clean up test

1333182

remove nvtx

fe4f221

mcarilli changed the title ~~[DO NOT MERGE YET] [CUDA graphs] [JIT] Capture-safe RNG in nvfuser~~ [CUDA graphs] [JIT] Capture-safe RNG in nvfuser Mar 19, 2021

jjsjann123 and others added 2 commits March 25, 2021 08:36

Merge remote-tracking branch 'csarofeen/20_12_3_devel' into HEAD

e881d65

clang-format

a934e19

jjsjann123 merged commit 14bd01e into csarofeen:20_12_3_devel Mar 25, 2021

[CUDA graphs] [JIT] Capture-safe RNG in nvfuser #593

[CUDA graphs] [JIT] Capture-safe RNG in nvfuser #593

Uh oh!

Conversation

mcarilli commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Jan 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlemo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csarofeen left a comment

Choose a reason for hiding this comment

Uh oh!

mcarilli commented Jan 28, 2021

Uh oh!

mcarilli commented Feb 3, 2021

Uh oh!

jjsjann123 commented Mar 25, 2021

Uh oh!

Uh oh!

mcarilli commented Jan 7, 2021 •

edited

Loading

mcarilli Jan 7, 2021 •

edited

Loading

mcarilli Jan 7, 2021 •

edited

Loading

mcarilli Jan 11, 2021 •

edited

Loading