Separate kernel compilation API from kernel execution API #1914

jjsjann123 · 2022-08-17T08:07:50Z

Mostly mechanical changes to refactor some of KernelArgumentHolder in our stack instead of direct use of at::Tensor/IValue:
Note: we are still holding a ref counted at::Tensor within kernel arg holder for tensor entries, simply because we want to forward it in case of aliased output. This is quite unsatisfying. But to properly strip framework Tensor from codegen stack, we need quite some refactor to abstract away the ownership of memory and allocator. That's for some future PRs.
Separate compilation from execution of kernels, currently using FusionExecutorCache::compileFusion and FusionExecutorCache::runFusionWithInputs. Note that the compilation API is still experimental. We currently kick off compilation into a separate thread. This part would need to be exposed & integrated into our python API.

TODO for follow up PRs:

trivial forwarding input to outputs
infer outputs should switch from meta tensor to fake tensor in order to preserve device
segmented fusion should/could be compiled in parallel, since we can infer outputs without a compiled kernel.
inputs_id_lookup should be refactored into KernelArgumentHolder, since we currently use args for passing inputs around.
index mode currently is per fusion. which is not neccesary and could be refactored into per segmented fuion instead.
bind kernel inputs should also try to bind cpu scalar with int type, since the runtime value can also be used in shape inference. Generally speaking, cpu scalar dtype should also be checked during validation.
high water mark could be refactored into using occupancy API after compilation, so we are not unnecessarily recompile when we don't have to.

TODO: add ways to discard/skip memory allocation and proper check to safe guard that.

KernelArgumentHolder TODO: expand this to KernelPrecomputedIntegers

csarofeen · 2022-08-22T16:46:37Z

Could you list out and give a breakdown of all the runtime components? I always find it confusing the role the different parts play.
I.E.
Fusion Executor
Fusion Executor Cache
Fusion Kernel Runtime

I would really like to see a cleanup of the overall architecture of handling fusions, segmentation, heuristics, and running. Would also really like to see this logic get placed in its own directory.

shmsong · 2022-08-22T17:15:16Z

torch/csrc/jit/codegen/cuda/kernel_cache.h

+  bool isCompiled() {
+    std::unique_lock<std::mutex> lock0(mutex_, std::try_to_lock);
+    std::unique_lock<std::mutex> lock1(compiling_, std::try_to_lock);
+    if (!lock0.owns_lock() || !lock0.owns_lock()) {


nit: typo?

Suggested change

if (!lock0.owns_lock() || !lock0.owns_lock()) {

if (!lock0.owns_lock() || !lock1.owns_lock()) {

Also would you need a lock to check this? Compiled can only go from false to true.

Oops. good catch~~

This part needs to be thought out later. I don't think what I have here is right. <- I should try_to_lock on compiling_ and lock mutex_.
Two separate mutex is needed because: 1. we want to safe iterating on executors_ while no other threads writing to it. -> hence the need of mutex_; 2. we don't want to duplicate compilation of the same kernel. So if a compiling_ is going on, we might not want to compile another kernel.

I need to go back and double check our heuristic search before refactor this. But the typo is real and I'll fix it.

shmsong · 2022-08-22T17:34:13Z

torch/csrc/jit/codegen/cuda/kernel_cache.cpp

+  // auto func = [args = std::move(args_old), lock = std::move(unique_lock),
+  // this] () mutable {
+  auto func = [args = std::move(args_old), this]() mutable {
+    std::lock_guard<std::mutex> guard(compiling_);


nit: is it double lock/unlocking compiling_ ?

Oh I see. the new thread would wait for compileAsync to return... Is that correct/intended?

That's not intended and unnecessary.

I just haven't worked out how to move the lock through to the lambda, which is really bizarre....

jjsjann123 · 2022-08-17T08:08:30Z

torch/csrc/jit/codegen/cuda/executor.cpp

+      "compile kernel shouldn't hit a pre-existing cache");
+  FUSER_PERF_SCOPE("ExecutorRunFusion::ValidateAndInitialize");
+  // TODO: enable checks
+  // executor_utils::validateKernelInputs(fusion_, inputs, options_.device);


Need to re-enable this. More plumbing!

jjsjann123 · 2022-08-17T08:08:37Z

torch/csrc/jit/codegen/cuda/executor.cpp

@@ -811,16 +964,17 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
    // code path to take when either:
    //   1. no opt_code is provided or
    //   2. `executor_entry` is not initialized
-    executor_utils::validateKernelInputs(fusion_, inputs, options_.device);
+    // TODO: re-enable validate kernel inputs
+    // executor_utils::validateKernelInputs(fusion_, inputs, options_.device);


jjsjann123 · 2022-08-17T08:10:10Z

torch/csrc/jit/codegen/cuda/executor.cpp

+    // TODO: FIX this short-cut
+    // If the output is just trivially the input, just "copy" it over.
+    if (kernel->outputs()[out_i]->isFusionInput()) {
+      TORCH_INTERNAL_ASSERT(false, "trivial input forwarding NOT IMPLEMENTED");


we can support this if we actually feed arguments to this function. But I'm not sure if we actually do support this feature. At least there's no cpp test verifying this one.

I'll double check it with another cpp test.

Can we do this at the same point of input/output aliasing? Seems really easy to move this logic there as it's almost the exact same thing.

If no test is testing it, do in a follow up.

There is some subtle difference. The trivial forward here needs to be kept as real output, versus aliased output is only used for updating on input tensor and should be dropped.

Currently I don't see the trivial forward hooked up when we allocate outputs, so I suspect it just isn't working at all. And it's also not tested anywhere 😆

it can be easily made to work though. I just want to double check that codegen API actually does support marking input tensor as output. I'll handle it in a follow up PR.

aliased output is only used for updating on input tensor and should be dropped

Does the kernel not need it to be plumbed through as an output? I thought the output name would be different in the kernel than the input name so we'd still need to make sure it's positioned correctly in the kernel arguments.

Yes that's how it worked within the kernel. We do have output with different names that points to the same memory.

But when we collect outputs to be returned to framework, we strip aliased outputs away. i.e. BN doesn't return running stats as outputs but just silently update them.
Unlike here in the forwarded output, where they do get returned to framework as outputs from fusion.

jjsjann123 · 2022-08-22T18:55:37Z

torch/csrc/jit/codegen/cuda/kernel_cache.h

+  bool isCompiled() {
+    std::unique_lock<std::mutex> lock0(mutex_, std::try_to_lock);
+    std::unique_lock<std::mutex> lock1(compiling_, std::try_to_lock);
+    if (!lock0.owns_lock() || !lock0.owns_lock()) {


Oops. good catch~~

This part needs to be thought out later. I don't think what I have here is right. <- I should try_to_lock on compiling_ and lock mutex_.
Two separate mutex is needed because: 1. we want to safe iterating on executors_ while no other threads writing to it. -> hence the need of mutex_; 2. we don't want to duplicate compilation of the same kernel. So if a compiling_ is going on, we might not want to compile another kernel.

I need to go back and double check our heuristic search before refactor this. But the typo is real and I'll fix it.

jjsjann123 · 2022-08-22T18:59:08Z

torch/csrc/jit/codegen/cuda/kernel_cache.cpp

+  // auto func = [args = std::move(args_old), lock = std::move(unique_lock),
+  // this] () mutable {
+  auto func = [args = std::move(args_old), this]() mutable {
+    std::lock_guard<std::mutex> guard(compiling_);


That's not intended and unnecessary.

I just haven't worked out how to move the lock through to the lambda, which is really bizarre....

csarofeen

Pushing comments. Didn't finish executor_utils.cpp as the reordering of the functions made it hard to parse. I don't see any major issues. I just think we really need to cleanup these sections of the code base and push them into a runtime directory. As I mentioned in a comment but will mention again here, it's really hard to understand interaction, responsibility, and roles of the different parts of this codebase. It'd be really nice to do iterations to just organize the design. I don't think there's a lot of redundancy, I just think the organization is not clear/clean. It's fine to finish this PR up and get it in and go back over to do some refactoring/organizing.

torch/csrc/jit/codegen/cuda/test/test_gpu.cpp

csarofeen · 2022-08-22T15:05:11Z

torch/csrc/jit/codegen/cuda/utils.cpp

+      int64_t tensor_most_negative_index = 0;
+      for (auto dim_i = 0; dim_i < tensor_input.ndimension(); dim_i++) {
+        // Ignore broadcast dimensions
+        if (tensor_input.size(dim_i) > 1) {


NIT: seems unnecessary since multiplying stride by 0 won't accumulate anything (unless somehow we get a negative size)

I think you are right about broadcasted dimensions where size == 1. Technically I think we are not getting size == 0, so we shouldn't be messing with accumulating a negative stride extent. But I'd probably be more comfortable with a conditional on if (tensor_input.size(dim_1) > 0). which is about the same as what we have here. 😆

So I tend to keep it as-is.

torch/csrc/jit/codegen/cuda/kernel_cache.h

csarofeen · 2022-08-22T23:06:20Z

torch/csrc/jit/codegen/cuda/executor.cpp

+  executor_utils::validateVectorizedTensors(
+      lowered_.get()->kernel(), args, {}, compileTimeDataCache(), expr_eval);
+
+  auto alias_indices_entry = executor_utils::caching::ExecutorCompileTimeEntry<


Can we use this pattern more often? I really like this pattern for cached objects.

Credit goes to @shmsong . IIRC.

I'll try to use more of this then~

torch/csrc/jit/codegen/cuda/executor.cpp

csarofeen · 2022-08-22T23:09:31Z

torch/csrc/jit/codegen/cuda/executor.cpp

@@ -951,15 +1078,12 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
    }
  }

-  KernelArgumentHolder kernel_arguments(options_.index_mode);
-  {
-    FUSER_PERF_SCOPE("ExecutorRunFusion::FillKernelArgStructure");


Nit: I like the cleanliness of pushing these args at the same time. It just makes it really clear about the order of what's going into the arguments holder.

torch/csrc/jit/codegen/cuda/executor_utils.cpp

jjsjann123 · 2022-08-23T17:57:36Z

Pushing comments. Didn't finish executor_utils.cpp as the reordering of the functions made it hard to parse. I don't see any major issues. I just think we really need to cleanup these sections of the code base and push them into a runtime directory. As I mentioned in a comment but will mention again here, it's really hard to understand interaction, responsibility, and roles of the different parts of this codebase. It'd be really nice to do iterations to just organize the design. I don't think there's a lot of redundancy, I just think the organization is not clear/clean. It's fine to finish this PR up and get it in and go back over to do some refactoring/organizing.

Thanks for pointing this out again. I totally agree that it needs some clean up.

The old design doc doesn't necessarily reflect what we have any more https://github.com/csarofeen/pytorch/blob/devel/torch/csrc/jit/codegen/cuda/kernel_cache.h#L253-L303. I'll do a rewrite there on what's currently in the stack and clean it up in following ups.

csarofeen · 2022-08-24T13:49:27Z

torch/csrc/jit/codegen/cuda/executor_utils.cpp


  // This should probably move to EvaluationContext as we may want to bind
  // input values frequently. Bind fusion input values to runtime values.
-  for (const auto i : c10::irange(fusion->inputs().size())) {


This is just de-duplication right?

Yes. just some mechanical code merge.

csarofeen

A few more minor comments. Approving.
I'd like to see a list of all the TODO's somewhere (can put it in a follow up PR, or PRs as you start tackling them one by one).
I would also like to understand the async design and mutex design in more detail. It's still a bit fuzzy exactly what you're going for, what the mutex's are protecting, etc.

csarofeen · 2022-08-24T15:42:52Z

torch/csrc/jit/codegen/cuda/executor.cpp

+void FusionExecutor::setUsedTVs() {
+  auto used_vals = fusion_->usedMathVals();
+  auto used_tvs = ir_utils::filterByType<TensorView>(used_vals);
+  used_tvs_.insert(used_tvs_.begin(), used_tvs.begin(), used_tvs.end());


You removed used_tvs_.clear() before this call. I think generally we wouldn't be calling this function more than once, but I think it's good to leave in for safety.

csarofeen · 2022-08-24T15:43:51Z

torch/csrc/jit/codegen/cuda/executor.h

@@ -202,6 +230,12 @@ class TORCH_CUDA_CU_API FusionExecutor : public NonCopyable {
    return &compile_time_info_cache_;
  }

+  //! ignoring aliased outputs, pushing scalar int 0 as a place holder
+  KernelArgumentHolder evaluateOutputSizes(


I think this function could still use a better comment. I know what you mean but it might not be clear to other people.

csarofeen · 2022-08-24T15:51:21Z

torch/csrc/jit/codegen/cuda/kernel_cache.cpp

+  return getKernelRuntimeFor(args)->isCompiled();
+}
+
+void FusionExecutorCache::compileFusion(const at::ArrayRef<IValue>& inputs) {


This should also have async in the name.

jjsjann123 · 2022-08-26T05:14:01Z

linking to #1893

naoyam · 2022-08-26T21:45:50Z

torch/csrc/jit/codegen/cuda/test/test_gpu.cpp

+
+  executor_cache.compileFusionAsync(aten_inputs);
+
+  while (!executor_cache.isCompiled(aten_inputs)) {


I think we should also have a timeout in case something got broken.

Good point~ I added a TODO in the issue tracking follow up to this PR: #1893 (comment)

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: pytorch#84626 Approved by: https://github.com/malfet

jjsjann123 added 30 commits July 31, 2022 17:37

Added api for query kernel argument

ef1360e

TODO: add ways to discard/skip memory allocation and proper check to safe guard that.

updating kernel arg

d4acec1

adding size() in kernel arg

d7b6d16

adding getPointer in TensorArgAbstract

d3e1707

SchedulerRuntimeInfo to take KernelArgumentHolder

1e4f9f5

fixing build issues

9242048

missing ;

4c01ba7

fixing more compiler errors

b094b5d

remove unwanted changes; adding const qualifier to arg()

e8ea2e4

fixing return for const func arg()

49fdffe

overloading arg() for constness correctness

8e51482

multiple definition error

91359bd

exposing kernel argument holder for test access

d4ebb04

moving index mode computation to KernelArgumentHolder

21a8323

fixing name for function; adding const for getIndexMode()

e962fa6

fixing const for static function

7412fd1

switching scheduling from ArrayRef<at::Tensor> to KernelArgumentHolder

2fc2606

fixing compile issues

0d2d1a0

fixing more renames

b4b3c6e

Converting PrecomputedIntegersBase::FusionPrecomputedIntegers to use

9135a52

KernelArgumentHolder TODO: expand this to KernelPrecomputedIntegers

fixing more KernelArgumentHolder in tests

850ca93

refactor device_index & index_mode to KernelArgumentHolder

1b9a6b5

missing type

d3674f5

switch KernelArgumentHolder creation in benchmakrs

5d40c18

fixing return types

d99dde5

removing duplicated definition

f7ceabd

fixing pushing inputs to KernelArgumentHolder

f4740e1

fixing args construction with permuted tensor

a1144a0

updating push to duplicate entries in KernelArgumentHolder

0f899ab

fixing string name in macro

e1d2a9a

jjsjann123 requested a review from csarofeen August 20, 2022 18:50

jjsjann123 added 2 commits August 20, 2022 12:01

avoid async failure on input check

524c7e6

rename tests

92c2265

shmsong reviewed Aug 22, 2022

View reviewed changes

jjsjann123 commented Aug 22, 2022

View reviewed changes

csarofeen reviewed Aug 22, 2022

View reviewed changes

jjsjann123 added 7 commits August 23, 2022 15:15

Merge remote-tracking branch 'origin/devel' into HEAD

8b591ea

address review comments

f3bc13a

fixing build

9710da7

fixing build

a7c1bf5

typo

ad0ccbf

Merge remote-tracking branch 'origin/devel' into HEAD

08f7d3b

clangformat

309d12c

csarofeen reviewed Aug 24, 2022

View reviewed changes

csarofeen approved these changes Aug 24, 2022

View reviewed changes

jjsjann123 added 3 commits August 25, 2022 21:47

more clean up from review comments

9f8bc95

lintrunner

26fafe6

fixing build

58f1daf

zasdfgbnm mentioned this pull request Aug 26, 2022

Test rand in a fusion with zero tensor input #1932

Merged

jjsjann123 mentioned this pull request Aug 26, 2022

[WIP] Async compilation #1893

Open

4 tasks

lintrunner

e8e9fe2

jjsjann123 merged commit b247dcf into devel Aug 26, 2022

jjsjann123 deleted the async_compilation branch August 26, 2022 05:46

naoyam reviewed Aug 26, 2022

View reviewed changes

IvanYashchuk mentioned this pull request Sep 13, 2022

Segfault when marking the same TensorView as output more than once #1958

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate kernel compilation API from kernel execution API #1914

Separate kernel compilation API from kernel execution API #1914

jjsjann123 commented Aug 17, 2022 •

edited

Loading

csarofeen commented Aug 22, 2022

shmsong Aug 22, 2022 •

edited

Loading

shmsong Aug 22, 2022

jjsjann123 Aug 22, 2022

shmsong Aug 22, 2022

shmsong Aug 22, 2022

jjsjann123 Aug 22, 2022

jjsjann123 Aug 17, 2022

jjsjann123 Aug 17, 2022

jjsjann123 Aug 17, 2022

csarofeen Aug 22, 2022

csarofeen Aug 22, 2022

jjsjann123 Aug 24, 2022

csarofeen Aug 24, 2022

jjsjann123 Aug 26, 2022

jjsjann123 Aug 22, 2022

jjsjann123 Aug 22, 2022

csarofeen left a comment

csarofeen Aug 22, 2022

jjsjann123 Aug 23, 2022

csarofeen Aug 22, 2022

jjsjann123 Aug 24, 2022

csarofeen Aug 22, 2022

jjsjann123 commented Aug 23, 2022

csarofeen Aug 24, 2022

jjsjann123 Aug 24, 2022

csarofeen left a comment

csarofeen Aug 24, 2022

csarofeen Aug 24, 2022

csarofeen Aug 24, 2022

jjsjann123 commented Aug 26, 2022

naoyam Aug 26, 2022

jjsjann123 Aug 29, 2022

	if (!lock0.owns_lock() \|\| !lock0.owns_lock()) {
	if (!lock0.owns_lock() \|\| !lock1.owns_lock()) {


		executor_cache.compileFusionAsync(aten_inputs);

		while (!executor_cache.isCompiled(aten_inputs)) {

Separate kernel compilation API from kernel execution API #1914

Separate kernel compilation API from kernel execution API #1914

Conversation

jjsjann123 commented Aug 17, 2022 • edited Loading

csarofeen commented Aug 22, 2022

shmsong Aug 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csarofeen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjsjann123 commented Aug 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csarofeen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjsjann123 commented Aug 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjsjann123 commented Aug 17, 2022 •

edited

Loading

shmsong Aug 22, 2022 •

edited

Loading