Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor] Remove dead code from depthwise_conv2d for Intel graphics #8381

Merged
merged 1 commit into from
Jul 2, 2021

Conversation

echuraev
Copy link
Contributor

@echuraev echuraev commented Jul 1, 2021

After fix a66186b, I saw that it should be necessary to do the same fix
for depthwise_conv2d for intel graphics. I saw that we never used the
removed code and it is just the same code from
cuda/depthwise_conv2d.py. So we can use the cuda implementation when it
will be necessary.

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

@echuraev echuraev force-pushed the echuraev/refactor_depthwise_conv2d branch from cbd8d5d to 3a632b7 Compare July 1, 2021 09:21
After fix a66186b, I saw that it should be necessary to do the same fix
for depthwise_conv2d for intel graphics. I saw that we never used the
removed code and it is just the same code from
cuda/depthwise_conv2d.py. So we can use the cuda implementation when it
will be necessary.
@echuraev echuraev force-pushed the echuraev/refactor_depthwise_conv2d branch from 3a632b7 to d84a9d8 Compare July 1, 2021 09:34
@masahi masahi merged commit 6f600f1 into apache:main Jul 2, 2021
@masahi
Copy link
Member

masahi commented Jul 2, 2021

@echuraev BTW is intel_graphics target supposed to be better than the generic opencl target? I did auto scheduling experiment on icelake iGPU using generic opencl and got good performance result (comparing relative performance with CUDA + NV GPU).

@echuraev
Copy link
Contributor Author

echuraev commented Jul 2, 2021

@elvin-n did some experiments with intel graphics. Maybe he can give more accurate information about intel grahics.

@elvin-n
Copy link
Contributor

elvin-n commented Jul 2, 2021

@masahi
I observed much better performance on Iris XE graphics with intel graphics than with generic gpu.
On ONNX Mobilenet v2 Generic OpenCL auto scheduler - 14ms, with AUTOTVM IG conv2d_NCHWc 16ms, with autoscheduler conv_packed_data 9ms, after some fixes in IG conv2d_NCHWc and autoschewduler - 4ms, I have not created PR yet since have some questions to IG primitives and wanted to verify everything on previous generation of Intel Graphics.
Eventually I was ably to find Intel graphics (UHD graphics 630) and the situation was much less optimistic, still continue evaluations.

  1. Intel graphics introduce two convolution - NCHWc and packed_data(by fact it is blocked convolution by out channels with further mandatory conversion to planar layout). Both of these convolutions had accuracy issue, I fixed it with only for packed_data and !PR8201 was accepted, but have not fixed accuracy of NCHWc yet
  2. The compute for IG conv2d_NCHWc assumes different flows depending on parameters for scheduling selected by AUTOTVM. it sometimes can add repack by spacial dimentions, sometimes do not add. and AutoTVM using default tuner aborts tuning with claiming of different number of features. This is another reason why I have not created PR yet - NCHWc should be significantly modified and I wanted to make sure that I do not break anything valuable
  3. I would say that in many cases NCHWc kernels themselves are better than generic OpenCL, but often additional conversion of data into NCHWc<-> NCHW eats all performance boost and even make topology work slower
  4. I have not finished comparison on UHD630. On the same time I see 2x slowdown on this platform comparing to competitors. Wanted to take a look into this as well.
  5. Probably it would be more correct way to add model distinguishing UHD and IrisXE in addition to the -device=intel_graphics, but there are several independent places in TVM dealing with device/model and it's hard for me to figure out the right scope of changes for adding model

Several more questions to you

  1. which topologies have you used for your experiment with Skylake graphics?
  2. can you try to tune with target="opencl -device=intel_graphics"?

@masahi
Copy link
Member

masahi commented Jul 2, 2021

Interesting! Since our intel_graphics work was done a couple of years ago, for the older iGPU, I didn't expect it to be relevant today.

  1. I tested on following model/input, with Gen11, 1.1 fp32 peak tflops.
mlperf ssd-resnet34: input shape (1, 3, 1200, 1200)
DETR (https://github.com/facebookresearch/detr): input shape (1, 3, 750, 800) 

The result was compared against GTX 1070 ti, with peak 8 TFLOPS.
All tuning was done with auto scheduler and NHWC layout.

End to end result:
mlperf ssd resnet 34

gen11: 1116.40 ms
1070 ti: 99.34 ms

DETR

gen11: 292.5 ms
1070 ti: 33.1 ms

I consider this results excellent for Gen11, given HW peak difference and the fact that auto scheduler was hyper optimized for CUDA + dGPU.

  1. I can certainly trying tuning with intel_graphics. Thanks for the suggestion!

@elvin-n
Copy link
Contributor

elvin-n commented Jul 2, 2021

The problem is that scheduling for intel_graphics convolutions use(d) "warp" scope while it is not defined neither for OpenCL nor for Metal. And I do not know how it could work previously. I have not seen that warp scope was ever supported for OpenCL backend in TVM

@echuraev echuraev deleted the echuraev/refactor_depthwise_conv2d branch September 24, 2021 10:38
ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021
…pache#8381)

After fix a66186b, I saw that it should be necessary to do the same fix
for depthwise_conv2d for intel graphics. I saw that we never used the
removed code and it is just the same code from
cuda/depthwise_conv2d.py. So we can use the cuda implementation when it
will be necessary.
zxy844288792 pushed a commit to zxy844288792/tvm that referenced this pull request Mar 4, 2022
…pache#8381)

After fix a66186b, I saw that it should be necessary to do the same fix
for depthwise_conv2d for intel graphics. I saw that we never used the
removed code and it is just the same code from
cuda/depthwise_conv2d.py. So we can use the cuda implementation when it
will be necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants