[Refactor] Remove dead code from depthwise_conv2d for Intel graphics #8381

echuraev · 2021-07-01T09:12:22Z

After fix a66186b, I saw that it should be necessary to do the same fix
for depthwise_conv2d for intel graphics. I saw that we never used the
removed code and it is just the same code from
cuda/depthwise_conv2d.py. So we can use the cuda implementation when it
will be necessary.

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

After fix a66186b, I saw that it should be necessary to do the same fix for depthwise_conv2d for intel graphics. I saw that we never used the removed code and it is just the same code from cuda/depthwise_conv2d.py. So we can use the cuda implementation when it will be necessary.

masahi · 2021-07-02T06:32:39Z

@echuraev BTW is intel_graphics target supposed to be better than the generic opencl target? I did auto scheduling experiment on icelake iGPU using generic opencl and got good performance result (comparing relative performance with CUDA + NV GPU).

echuraev · 2021-07-02T09:07:23Z

@elvin-n did some experiments with intel graphics. Maybe he can give more accurate information about intel grahics.

elvin-n · 2021-07-02T09:39:04Z

@masahi
I observed much better performance on Iris XE graphics with intel graphics than with generic gpu.
On ONNX Mobilenet v2 Generic OpenCL auto scheduler - 14ms, with AUTOTVM IG conv2d_NCHWc 16ms, with autoscheduler conv_packed_data 9ms, after some fixes in IG conv2d_NCHWc and autoschewduler - 4ms, I have not created PR yet since have some questions to IG primitives and wanted to verify everything on previous generation of Intel Graphics.
Eventually I was ably to find Intel graphics (UHD graphics 630) and the situation was much less optimistic, still continue evaluations.

Intel graphics introduce two convolution - NCHWc and packed_data(by fact it is blocked convolution by out channels with further mandatory conversion to planar layout). Both of these convolutions had accuracy issue, I fixed it with only for packed_data and !PR8201 was accepted, but have not fixed accuracy of NCHWc yet
The compute for IG conv2d_NCHWc assumes different flows depending on parameters for scheduling selected by AUTOTVM. it sometimes can add repack by spacial dimentions, sometimes do not add. and AutoTVM using default tuner aborts tuning with claiming of different number of features. This is another reason why I have not created PR yet - NCHWc should be significantly modified and I wanted to make sure that I do not break anything valuable
I would say that in many cases NCHWc kernels themselves are better than generic OpenCL, but often additional conversion of data into NCHWc<-> NCHW eats all performance boost and even make topology work slower
I have not finished comparison on UHD630. On the same time I see 2x slowdown on this platform comparing to competitors. Wanted to take a look into this as well.
Probably it would be more correct way to add model distinguishing UHD and IrisXE in addition to the -device=intel_graphics, but there are several independent places in TVM dealing with device/model and it's hard for me to figure out the right scope of changes for adding model

Several more questions to you

which topologies have you used for your experiment with Skylake graphics?
can you try to tune with target="opencl -device=intel_graphics"?

masahi · 2021-07-02T10:22:41Z

Interesting! Since our intel_graphics work was done a couple of years ago, for the older iGPU, I didn't expect it to be relevant today.

I tested on following model/input, with Gen11, 1.1 fp32 peak tflops.

mlperf ssd-resnet34: input shape (1, 3, 1200, 1200)
DETR (https://github.com/facebookresearch/detr): input shape (1, 3, 750, 800)

The result was compared against GTX 1070 ti, with peak 8 TFLOPS.
All tuning was done with auto scheduler and NHWC layout.

End to end result:
mlperf ssd resnet 34

gen11: 1116.40 ms
1070 ti: 99.34 ms

DETR

gen11: 292.5 ms
1070 ti: 33.1 ms

I consider this results excellent for Gen11, given HW peak difference and the fact that auto scheduler was hyper optimized for CUDA + dGPU.

I can certainly trying tuning with intel_graphics. Thanks for the suggestion!

elvin-n · 2021-07-02T12:45:20Z

The problem is that scheduling for intel_graphics convolutions use(d) "warp" scope while it is not defined neither for OpenCL nor for Metal. And I do not know how it could work previously. I have not seen that warp scope was ever supported for OpenCL backend in TVM

…pache#8381) After fix a66186b, I saw that it should be necessary to do the same fix for depthwise_conv2d for intel graphics. I saw that we never used the removed code and it is just the same code from cuda/depthwise_conv2d.py. So we can use the cuda implementation when it will be necessary.

echuraev force-pushed the echuraev/refactor_depthwise_conv2d branch from cbd8d5d to 3a632b7 Compare July 1, 2021 09:21

echuraev force-pushed the echuraev/refactor_depthwise_conv2d branch from 3a632b7 to d84a9d8 Compare July 1, 2021 09:34

masahi approved these changes Jul 2, 2021

View reviewed changes

masahi merged commit 6f600f1 into apache:main Jul 2, 2021

echuraev deleted the echuraev/refactor_depthwise_conv2d branch September 24, 2021 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Remove dead code from depthwise_conv2d for Intel graphics #8381

[Refactor] Remove dead code from depthwise_conv2d for Intel graphics #8381

echuraev commented Jul 1, 2021

masahi commented Jul 2, 2021

echuraev commented Jul 2, 2021

elvin-n commented Jul 2, 2021 •

edited

Loading

masahi commented Jul 2, 2021 •

edited

Loading

elvin-n commented Jul 2, 2021

[Refactor] Remove dead code from depthwise_conv2d for Intel graphics #8381

[Refactor] Remove dead code from depthwise_conv2d for Intel graphics #8381

Conversation

echuraev commented Jul 1, 2021

masahi commented Jul 2, 2021

echuraev commented Jul 2, 2021

elvin-n commented Jul 2, 2021 • edited Loading

masahi commented Jul 2, 2021 • edited Loading

elvin-n commented Jul 2, 2021

elvin-n commented Jul 2, 2021 •

edited

Loading

masahi commented Jul 2, 2021 •

edited

Loading