[CodeGen][CUDA] Enhance CUDA codegen for SelectNode #4983

wpan11nv · 2020-03-04T19:37:18Z

This patch allows CUDA backend to emit correct code for
selects with vector conditions, which may be produced
by floordiv op lowering etc..
This already works for llvm BE, as llvm select instruction
supports vector conditions.

Signed-off-by: Wei Pan weip@nvidia.com

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

wpan11nv · 2020-03-04T19:41:58Z

The test fails without this patch. It is also exposed by #4968, in which a simple kernel fails during the CUDA codegen

// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 98
// attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 1024
T_relu[ramp(((blockIdx.x2048) + (threadIdx.x2)), 1, 2)] = max((placeholder[ramp(((blockIdx.x2048) + (threadIdx.x2)), 1, 2)] + placeholder[floordiv(ramp(((blockIdx.x2048) + (threadIdx.x2)), 1, 2), x2(3136))]), x2(0f))

- This patch allows CUDA backend to emit correct code for selects with vector conditions, which may be produced by floordiv op lowering etc.. - This already works for llvm BE, as llvm select instruction supports vector conditions. Signed-off-by: Wei Pan <weip@nvidia.com>

wpan11nv · 2020-03-09T18:17:33Z

Kindly ping. Can someone help review this PR?

jmorrill · 2020-03-09T18:35:43Z

While I am not qualified to give a review of this, I have applied your changes on this PR and it I was able to compile a mxnet model to a cuda tvm graphruntime.

autotvm also looks like it is working correctly.

jmorrill · 2020-03-09T18:39:20Z

Kindly ping. Can someone help review this PR?

@wpan11nv, I think you have to tag some reviewers from here:
https://github.com/apache/incubator-tvm/blob/master/CONTRIBUTORS.md#reviewers

wpan11nv · 2020-03-09T18:43:10Z

@vinx13 Could you help review this PR?

wpan11nv · 2020-03-09T18:44:27Z

While I am not qualified to give a review of this, I have applied your changes on this PR and it I was able to compile a mxnet model to a cuda tvm graphruntime.

autotvm also looks like it is working correctly.

@jmorrill Thanks for confirming this fix!

masahi · 2020-03-10T19:30:22Z

@wpan11nv I'll take a look (@vinx13 is currently a grad student, busy). I have a commit right

masahi · 2020-03-11T02:16:54Z

LGTM, but it seems indentation is broken in cuda source codegen. Not important, but would be nice to clean it up.

extern "C" __global__ void default_function_kernel0( float* __restrict__ B,  float* __restrict__ A) {
    float4 _1;
            int4 _2 = make_int4(37, 37, 37, 37);
            int4 _3 = make_int4(0, 0, 0, 0);
            ushort4 _4;
            _4.x = (_2.x>=_3.x);
            _4.y = (_2.y>=_3.y);
            _4.z = (_2.z>=_3.z);
            _4.w = (_2.w>=_3.w);
              int4 _5 = (make_int4)(((((int)threadIdx.x) * 4))+(1*0), ((((int)threadIdx.x) * 4))+(1*1), ((((int)threadIdx.x) * 4))+(1*2), ((((int)threadIdx.x) * 4))+(1*3));
              int4 _6 = make_int4(37, 37, 37, 37);
              int4 _7;
              _7.x = (_5.x%_6.x);
              _7.y = (_5.y%_6.y);
              _7.z = (_5.z%_6.z);
              _7.w = (_5.w%_6.w);
            int4 _8 = make_int4(0, 0, 0, 0);
            ushort4 _9;
            _9.x = (_7.x>=_8.x);
            _9.y = (_7.y>=_8.y);
            _9.z = (_7.z>=_8.z);
            _9.w = (_7.w>=_8.w);
          ushort4 _10;
          _10.x = (_4.x&&_9.x);
          _10.y = (_4.y&&_9.y);
          _10.z = (_4.z&&_9.z);
          _10.w = (_4.w&&_9.w);
            int4 _11 = make_int4(37, 37, 37, 37);
            int4 _12 = make_int4(0, 0, 0, 0);
            ushort4 _13;
            _13.x = (_11.x<_12.x);
            _13.y = (_11.y<_12.y);
            _13.z = (_11.z<_12.z);
            _13.w = (_11.w<_12.w);
              int4 _14 = (make_int4)(((((int)threadIdx.x) * 4))+(1*0), ((((int)threadIdx.x) * 4))+(1*1), ((((int)threadIdx.x) * 4))+(1*2), ((((int)threadIdx.x) * 4))+(1*3));
              int4 _15 = make_int4(37, 37, 37, 37);
              int4 _16;
              _16.x = (_14.x%_15.x);
              _16.y = (_14.y%_15.y);
              _16.z = (_14.z%_15.z);
              _16.w = (_14.w%_15.w);
            int4 _17 = make_int4(0, 0, 0, 0);
            ushort4 _18;
            _18.x = (_16.x<=_17.x);
            _18.y = (_16.y<=_17.y);
            _18.z = (_16.z<=_17.z);
            _18.w = (_16.w<=_17.w);
          ushort4 _19;
          _19.x = (_13.x&&_18.x);
          _19.y = (_13.y&&_18.y);
          _19.z = (_13.z&&_18.z);
          _19.w = (_13.w&&_18.w);
        ushort4 _20;
        _20.x = (_10.x||_19.x);
        _20.y = (_10.y||_19.y);
        _20.z = (_10.z||_19.z);
        _20.w = (_10.w||_19.w);
        int4 _21 = (make_int4)(((((int)threadIdx.x) * 4))+(1*0), ((((int)threadIdx.x) * 4))+(1*1), ((((int)threadIdx.x) * 4))+(1*2), ((((int)threadIdx.x) * 4))+(1*3));
        int4 _22 = make_int4(37, 37, 37, 37);
        int4 _23;
        _23.x = (_21.x/_22.x);
        _23.y = (_21.y/_22.y);
        _23.z = (_21.z/_22.z);
        _23.w = (_21.w/_22.w);
          int4 _24 = (make_int4)(((((int)threadIdx.x) * 4))+(1*0), ((((int)threadIdx.x) * 4))+(1*1), ((((int)threadIdx.x) * 4))+(1*2), ((((int)threadIdx.x) * 4))+(1*3));
          int4 _25 = make_int4(37, 37, 37, 37);
          int4 _26;
          _26.x = (_24.x/_25.x);
          _26.y = (_24.y/_25.y);
          _26.z = (_24.z/_25.z);
          _26.w = (_24.w/_25.w);
        int4 _27 = make_int4(1, 1, 1, 1);
        int4 _28;
        _28.x = (_26.x-_27.x);
        _28.y = (_26.y-_27.y);
        _28.z = (_26.z-_27.z);
        _28.w = (_26.w-_27.w);
      int4 _29;
      _29.x = (bool(_20.x)?_23.x:_28.x);
      _29.y = (bool(_20.y)?_23.y:_28.y);
      _29.z = (bool(_20.z)?_23.z:_28.z);
      _29.w = (bool(_20.w)?_23.w:_28.w);
    _1.x = A[_29.x];
    _1.y = A[_29.y];
    _1.z = A[_29.z];
    _1.w = A[_29.w];
  (( float4*)(B + ((((int)threadIdx.x) * 4))))[0] = _1;
}

masahi · 2020-03-11T02:17:15Z

Thanks @wpan11nv @jmorrill @vinx13

wpan11nv · 2020-03-11T16:05:13Z

LGTM, but it seems indentation is broken in cuda source codegen. Not important, but would be nice to clean it up.

extern "C" __global__ void default_function_kernel0( float* __restrict__ B,  float* __restrict__ A) {
    float4 _1;
            int4 _2 = make_int4(37, 37, 37, 37);
            int4 _3 = make_int4(0, 0, 0, 0);
            ushort4 _4;
            _4.x = (_2.x>=_3.x);
            _4.y = (_2.y>=_3.y);
            _4.z = (_2.z>=_3.z);
            _4.w = (_2.w>=_3.w);
              int4 _5 = (make_int4)(((((int)threadIdx.x) * 4))+(1*0), ((((int)threadIdx.x) * 4))+(1*1), ((((int)threadIdx.x) * 4))+(1*2), ((((int)threadIdx.x) * 4))+(1*3));
              int4 _6 = make_int4(37, 37, 37, 37);
              int4 _7;
              _7.x = (_5.x%_6.x);
              _7.y = (_5.y%_6.y);
              _7.z = (_5.z%_6.z);
              _7.w = (_5.w%_6.w);
            int4 _8 = make_int4(0, 0, 0, 0);
            ushort4 _9;
            _9.x = (_7.x>=_8.x);
            _9.y = (_7.y>=_8.y);
            _9.z = (_7.z>=_8.z);
            _9.w = (_7.w>=_8.w);
          ushort4 _10;
          _10.x = (_4.x&&_9.x);
          _10.y = (_4.y&&_9.y);
          _10.z = (_4.z&&_9.z);
          _10.w = (_4.w&&_9.w);
            int4 _11 = make_int4(37, 37, 37, 37);
            int4 _12 = make_int4(0, 0, 0, 0);
            ushort4 _13;
            _13.x = (_11.x<_12.x);
            _13.y = (_11.y<_12.y);
            _13.z = (_11.z<_12.z);
            _13.w = (_11.w<_12.w);
              int4 _14 = (make_int4)(((((int)threadIdx.x) * 4))+(1*0), ((((int)threadIdx.x) * 4))+(1*1), ((((int)threadIdx.x) * 4))+(1*2), ((((int)threadIdx.x) * 4))+(1*3));
              int4 _15 = make_int4(37, 37, 37, 37);
              int4 _16;
              _16.x = (_14.x%_15.x);
              _16.y = (_14.y%_15.y);
              _16.z = (_14.z%_15.z);
              _16.w = (_14.w%_15.w);
            int4 _17 = make_int4(0, 0, 0, 0);
            ushort4 _18;
            _18.x = (_16.x<=_17.x);
            _18.y = (_16.y<=_17.y);
            _18.z = (_16.z<=_17.z);
            _18.w = (_16.w<=_17.w);
          ushort4 _19;
          _19.x = (_13.x&&_18.x);
          _19.y = (_13.y&&_18.y);
          _19.z = (_13.z&&_18.z);
          _19.w = (_13.w&&_18.w);
        ushort4 _20;
        _20.x = (_10.x||_19.x);
        _20.y = (_10.y||_19.y);
        _20.z = (_10.z||_19.z);
        _20.w = (_10.w||_19.w);
        int4 _21 = (make_int4)(((((int)threadIdx.x) * 4))+(1*0), ((((int)threadIdx.x) * 4))+(1*1), ((((int)threadIdx.x) * 4))+(1*2), ((((int)threadIdx.x) * 4))+(1*3));
        int4 _22 = make_int4(37, 37, 37, 37);
        int4 _23;
        _23.x = (_21.x/_22.x);
        _23.y = (_21.y/_22.y);
        _23.z = (_21.z/_22.z);
        _23.w = (_21.w/_22.w);
          int4 _24 = (make_int4)(((((int)threadIdx.x) * 4))+(1*0), ((((int)threadIdx.x) * 4))+(1*1), ((((int)threadIdx.x) * 4))+(1*2), ((((int)threadIdx.x) * 4))+(1*3));
          int4 _25 = make_int4(37, 37, 37, 37);
          int4 _26;
          _26.x = (_24.x/_25.x);
          _26.y = (_24.y/_25.y);
          _26.z = (_24.z/_25.z);
          _26.w = (_24.w/_25.w);
        int4 _27 = make_int4(1, 1, 1, 1);
        int4 _28;
        _28.x = (_26.x-_27.x);
        _28.y = (_26.y-_27.y);
        _28.z = (_26.z-_27.z);
        _28.w = (_26.w-_27.w);
      int4 _29;
      _29.x = (bool(_20.x)?_23.x:_28.x);
      _29.y = (bool(_20.y)?_23.y:_28.y);
      _29.z = (bool(_20.z)?_23.z:_28.z);
      _29.w = (bool(_20.w)?_23.w:_28.w);
    _1.x = A[_29.x];
    _1.y = A[_29.y];
    _1.z = A[_29.z];
    _1.w = A[_29.w];
  (( float4*)(B + ((((int)threadIdx.x) * 4))))[0] = _1;
}

Yes, I noticed that indention issue too. I will have a look. Thanks!

- This patch allows CUDA backend to emit correct code for selects with vector conditions, which may be produced by floordiv op lowering etc.. - This already works for llvm BE, as llvm select instruction supports vector conditions. Signed-off-by: Wei Pan <weip@nvidia.com>

wpan11nv force-pushed the select_cg branch 2 times, most recently from 7c48a26 to 290c58a Compare March 4, 2020 19:46

wpan11nv mentioned this pull request Mar 9, 2020

[TOPI][CUDA] Teak CUDA injective scheduler #4968

Closed

masahi self-assigned this Mar 10, 2020

vinx13 approved these changes Mar 10, 2020

View reviewed changes

masahi approved these changes Mar 11, 2020

View reviewed changes

masahi merged commit afa8417 into apache:master Mar 11, 2020

wpan11nv deleted the select_cg branch March 19, 2020 18:18

ZihengJiang mentioned this pull request Sep 25, 2020

TVM v0.7 Release Note Candidate #6486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CodeGen][CUDA] Enhance CUDA codegen for SelectNode #4983

[CodeGen][CUDA] Enhance CUDA codegen for SelectNode #4983

wpan11nv commented Mar 4, 2020

wpan11nv commented Mar 4, 2020

wpan11nv commented Mar 9, 2020

jmorrill commented Mar 9, 2020

jmorrill commented Mar 9, 2020 •

edited

Loading

wpan11nv commented Mar 9, 2020

wpan11nv commented Mar 9, 2020

masahi commented Mar 10, 2020 •

edited

Loading

masahi commented Mar 11, 2020

masahi commented Mar 11, 2020

wpan11nv commented Mar 11, 2020

[CodeGen][CUDA] Enhance CUDA codegen for SelectNode #4983

[CodeGen][CUDA] Enhance CUDA codegen for SelectNode #4983

Conversation

wpan11nv commented Mar 4, 2020

wpan11nv commented Mar 4, 2020

wpan11nv commented Mar 9, 2020

jmorrill commented Mar 9, 2020

jmorrill commented Mar 9, 2020 • edited Loading

wpan11nv commented Mar 9, 2020

wpan11nv commented Mar 9, 2020

masahi commented Mar 10, 2020 • edited Loading

masahi commented Mar 11, 2020

masahi commented Mar 11, 2020

wpan11nv commented Mar 11, 2020

jmorrill commented Mar 9, 2020 •

edited

Loading

masahi commented Mar 10, 2020 •

edited

Loading