Welford Scheduling Support #561

shmsong · 2020-12-08T07:44:29Z

This PR intends to implement welford scheduling into the fuser pipeline.

(This PR will introduce non-trivial merging with #586 , I will rebase after #586 has merged)

Welford fusion IR interface
Serial Welford kernel generation
Block Parallel Welford kernel generation
Grid Parallel Welford kernel generation
Welford with Rfactor

More scheduling tests and welford scheduler in a subsequent PR.

Example math print containing welford:

%kernel_math {
T1[ iS2{i1}, iS3{i3} ] compute_at( T3, 2 )
   = T0[ iS0{i1}, iS1{i3} ]
   * double(1);
T2[ iS4{i1}, rS5{i3} ] compute_at( T3, 2 )(Var), T3[ ithreadIdx.x6{blockDim.x}, rblockIdx.x7{gridDim.x} ] compute_at( T4, 2 )(Avg), T4[ iS8{i1}, rS9{i3} ](Count) = Welford ( T1[ iS2{i1}, iS3{i3} ] compute_at( T3, 2 ) )
}

Example kernel containing welford:

__global__ void kernel1(Tensor<float, 2> T0, Tensor<float, 1> T2, Tensor<float, 1> T3, Tensor<int64_t, 1> T4, Tensor<float, 1> kT116, Tensor<float, 1> kT121, Tensor<int64_t, 1> kT126, Tensor<int64_t, 1> kT131) {
  alignas(8) extern __shared__ char array[];
  void* shared_mem = array;
  size_t block_size = blockDim.x*blockDim.y*blockDim.z;
  int64_t *shared_mem_var = static_cast<int64_t*>(shared_mem);
  int64_t *shared_mem_avg = shared_mem_var + block_size;
  int64_t *shared_mem_n = shared_mem_avg + block_size;
  T4[(threadIdx.x * T4.stride[0])] = 0;
  T3[(threadIdx.x * T3.stride[0])] = 0;
  T2[(threadIdx.x * T2.stride[0])] = 0;
  float T1[1];
  T1[0]
    = T0[(threadIdx.x * T0.stride[0]) + (blockIdx.x * T0.stride[1])]
    * 1;
  bool T3_pred;
  // Allocate global tensor kT116
  // Allocate global tensor kT121
  // Allocate global tensor kT126
  // Allocate global tensor kT131
  T3_pred = welford::gridWelford<true, false, false, true, true, true>(
    T2[(threadIdx.x * T2.stride[0])],
  T3[(threadIdx.x * T3.stride[0])],
  T4[(threadIdx.x * T4.stride[0])],
    (float) 0,
    T1[0],
    (int64_t)1,
    &kT116[0],
    &kT121[0],
    &kT126[0],
    kT131,
    reinterpret_cast<float*>(shared_mem_var),
    reinterpret_cast<float*>(shared_mem_avg),
    reinterpret_cast<int64_t*>(shared_mem_n),
    true,
    float(0));
}

into multi_output_scan

…into welford_rebase

naoyam · 2021-01-11T22:31:20Z

torch/csrc/jit/codegen/cuda/index_compute.cpp

@@ -918,6 +970,15 @@ generateIndexAndExtentMap(
    loops.pop_back();
  }

+  if (tv->definition()->isA<WelfordOp>()) {


Does this assume WelfordOp is the only expression type with multiple outputs? If so, would it be possible to generalize it so that it could work with any future expressions with multiple outputs?

Yes. WelfordOp is currently the only multi-output use case we considered so far. This PR was trying to support WelfordOp with minimal generalizations but if we have other multi-output cases we can generalize.

On the index compute side the implementation is more temporary than architectural due to the limitation that the loop variable now can only be mapped to one of the outputs. This part will be re-factored after we switch to index compute based on local compute-at and domain maps (@csarofeen). I'd prefer adding multi-output support at that point if we do decide to generalize.

I think it's okay to start with something specific to Welford and generalize it later, as long as it is guarded with assertion about the assumption. For example, if something is only meant to work with Welford, then it should be preceded by TORCH_INTERNAL_ASSERT.

Added TORCH_INTERNAL_ASSERT for the multioutput case. Thanks.

shmsong · 2021-01-19T05:47:26Z

I left my comments. None of them are particularly critical, but some cleanup seems possible.

Thanks for the detailed review and helpful suggestions! 👍

naoyam · 2021-01-19T22:48:50Z

torch/csrc/jit/codegen/cuda/arith.h

@@ -46,6 +46,31 @@ TORCH_CUDA_API TensorView* reductionOp(
    TensorView* v1,
    bool keep_dim = false);

+//! Auxiliary Struct holding result of
+//! a single welford op in ternsorview
+struct TORCH_CUDA_API WelfordResult {


nitpick: use class instead of struct for anything with methods.

naoyam

LGTM. Gave a couple of minor comments.

…ctor

csarofeen

Looks good to me, just one comment on rfactor I'd like to see addressed then can approve.

torch/csrc/jit/codegen/cuda/arith.cpp

csarofeen · 2021-02-17T19:15:56Z

torch/csrc/jit/codegen/cuda/lower_index.cpp

+namespace {
+
+template <typename T>
+kir::Allocate* allocGlobalBuffer(


Can we use this to also simplify the grid reduction code? Would make more sense to do in a follow up if yes.

Yes I think so. Thanks for pointing this out. I will put further simplifications in a follow up.

torch/csrc/jit/codegen/cuda/tensor_view.cpp

csarofeen

LGTM

shmsong and others added 26 commits November 12, 2020 16:35

introduce MultiScanOp

d321193

device-to-device schedule

821e176

Merge branch 'multi_output_scan' of https://github.com/csarofeen/pytorch

c3b85a3

into multi_output_scan

fix codegen

9b0b404

swap in welfordOp

227ffa7

Merge branch '20_12_3_devel' into multi_output_scan

798d49a

format

ff2758a

convert multiscan to welford

3c16587

preliminary kernel gen

e5a42b9

fix serial welford

b2c9c45

Merge remote-tracking branch 'origin/20_12_3_devel' into welford_rebase

2b5d469

add initialization

27f7ecc

format

1bdc500

use independent index lowering

9cbc07b

format

4170533

add serial welford test

17da83c

add scheduling primitives

aa997d7

Merge branch '20_12_3_devel' into welford_rebase

2a3cc45

fix rfactor indexing

919931f

remove unwanted changes

c3d3969

cleanup && clang-tidy

7fc4926

fix sync_flag allocation

f387381

Merge branch '20_12_3_devel' of https://github.com/csarofeen/pytorch …

65b332e

…into welford_rebase

format

486bc8f

refactor allocation

5fb554b

refactor alloc

fa15a61

shmsong changed the title ~~[WIP] Welford Scheduling Support~~ Welford Scheduling Support Jan 8, 2021

shmsong requested review from csarofeen and naoyam January 8, 2021 23:52

naoyam reviewed Jan 11, 2021

View reviewed changes

shmsong added 5 commits January 18, 2021 11:09

Merge remote-tracking branch 'origin/20_12_3_devel' into welford_rebase2

0786658

change welford API

b434d65

revise rfactor interface

ef2dbfe

revise welford root domain map

f843551

add assertions and cleanup conditionals

8126155

shmsong added 2 commits January 18, 2021 21:54

rename helper function

850e768

minor fix

8396373

naoyam reviewed Jan 19, 2021

View reviewed changes

naoyam approved these changes Jan 19, 2021

View reviewed changes

shmsong added 15 commits January 21, 2021 22:59

change rfactor interface

06e6bfb

add a scheduleReduction Test

77094da

change schedule

df05bf9

minor cleanup

3fe883f

Merge remote-tracking branch 'origin/20_12_3_devel' into welford_refa…

4dffcec

…ctor

minor cleanup

6ae1010

update kernel summary pass

d79df8c

fix codegen ; cleanup test

99b648c

bug fix

ff68685

thread_predicate bugfix; cleanup

c73a9b3

clang format

f5e32ae

update comments

a049fb3

minor cleanup

519fa2f

Macro Names

7872914

minor fix

3d92cf0

csarofeen requested changes Feb 17, 2021

View reviewed changes

shmsong requested a review from csarofeen February 18, 2021 18:47

csarofeen approved these changes Feb 18, 2021

View reviewed changes

shmsong merged commit 2bcc6a9 into 20_12_3_devel Feb 18, 2021

csarofeen deleted the multi_output_scan branch June 9, 2021 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Welford Scheduling Support #561

Welford Scheduling Support #561

shmsong commented Dec 8, 2020 •

edited

Loading

naoyam Jan 11, 2021

shmsong Jan 12, 2021

naoyam Jan 12, 2021

shmsong Jan 19, 2021

shmsong commented Jan 19, 2021

naoyam Jan 19, 2021

naoyam left a comment

csarofeen left a comment

csarofeen Feb 17, 2021

shmsong Feb 18, 2021

csarofeen left a comment

Welford Scheduling Support #561

Welford Scheduling Support #561

Conversation

shmsong commented Dec 8, 2020 • edited Loading

naoyam Jan 11, 2021

Choose a reason for hiding this comment

shmsong Jan 12, 2021

Choose a reason for hiding this comment

naoyam Jan 12, 2021

Choose a reason for hiding this comment

shmsong Jan 19, 2021

Choose a reason for hiding this comment

shmsong commented Jan 19, 2021

naoyam Jan 19, 2021

Choose a reason for hiding this comment

naoyam left a comment

Choose a reason for hiding this comment

csarofeen left a comment

Choose a reason for hiding this comment

csarofeen Feb 17, 2021

Choose a reason for hiding this comment

shmsong Feb 18, 2021

Choose a reason for hiding this comment

csarofeen left a comment

Choose a reason for hiding this comment

shmsong commented Dec 8, 2020 •

edited

Loading