[ARM][Performance] Improve ARM CPU depthwise convolution performance #2345

FrozenGene · 2018-12-27T13:28:59Z

original discussion about this PR:
#2028

To leverage existing spatial pack schedule and add tunable compute_at knob to re-implement ARM CPU's depthwise convolution. Currently, this PR named this schedule as contrib_spatial_pack as discussed in the original PR #2028.

On my A53@2.0GHz ARM CPU (MTK6763), which can boost 1.6X performance compared with previous depthwise convolution in the Mobilenet V1 model (I have also checked the correctness of this schedule). However, @yzhliu has also proved that this PR can also boost the performance on x86 CPU.

The following is the Tensorflow Mobilenet V1 model auto tvm training GFLOPS log:
Currently:
[Task 2/20] Current/Best: 0.98/ 2.32 GFLOPS | Progress: (1427/2000) | 2679.82 s Done.
[Task 4/20] Current/Best: 0.56/ 1.15 GFLOPS | Progress: (1072/2000) | 2461.27 s Done.
[Task 6/20] Current/Best: 1.08/ 2.78 GFLOPS | Progress: (1084/2000) | 1987.91 s Done.
[Task 8/20] Current/Best: 0.39/ 1.19 GFLOPS | Progress: (1815/2000) | 2744.70 s Done.
[Task 10/20] Current/Best: 1.09/ 2.33 GFLOPS | Progress: (1222/2000) | 1866.02 s Done.
[Task 12/20] Current/Best: 0.42/ 0.90 GFLOPS | Progress: (1716/2000) | 2528.94 s Done.
[Task 14/20] Current/Best: 1.89/ 2.63 GFLOPS | Progress: (1284/2000) | 2288.55 s Done.
[Task 16/20] Current/Best: 0.47/ 0.96 GFLOPS | Progress: (1467/2000) | 2282.65 s Done.
[Task 18/20] Current/Best: 1.43/ 2.61 GFLOPS | Progress: (1007/2000) | 1525.76 s Done.

The depthwise convolution total execution time on single A53@2.0GHz time can be from 45.3839ms to 28.1945ms.

One thing you must notice to use this schedule: You MUST make the XGBTunner constructor’s feature type argument be feature_type= 'knob'. i.e. XGBTuner(tsk, loss_type='rank', feature_type='knob'). Otherwise your program maybe hang forever.

This schedule is not default schedule (i.e. direct) of arm cpu / x86 cpu depthwise convolution. I will update the auto tuning of ARM CPU tutorial to show how to use this contrib_spatial_pack schedule in the following PR.

@merrymercy @yzhliu @tqchen pls review it.

FrozenGene · 2019-01-03T03:55:10Z

Hi @merrymercy, could you spend some time reviewing? Thanks.

merrymercy · 2019-01-09T11:23:05Z

nnvm/src/top/nn/convolution.cc

@@ -73,15 +73,17 @@ inline bool Conv2DInferShape(const nnvm::NodeAttrs& attrs,
  CHECK_EQ(param.channels % param.groups, 0U)
      << "output channels must divide group size";

-  TShape wshape({param.channels / param.groups,
+  // Restore depthwise conv2d kernel layout


I think the old code is strange and incorrect while your code is straightforward and correct. We can delete this comment.

merrymercy · 2019-01-09T11:26:17Z

topi/python/topi/arm_cpu/depthwise_conv2d.py

+    # Currently, Mali schedule doesn't use it like conv2d.
+    if cfg.is_fallback:
+        ref_log = autotvm.tophub.load_reference_log('arm_cpu', 'rk3399', 'depthwise_conv2d_nchw',
+                                                    'direct')


Suggested change

'direct')

'contrib_spatial_pack')

merrymercy

Overall looks good. But recently we updated the alter_op_layout to support relay (#2356, a new argument 'F' in introduced). Please resolve the conflict.

FrozenGene · 2019-01-10T10:26:40Z

@merrymercy Have modified code as you suggest. Please review it again.

…pache#2345) * Add sptialpack schedule for arm cpu depthwise convolution * Supply comments.

FrozenGene mentioned this pull request Dec 27, 2018

[ARM][Performance]Improve ARM CPU depthwise convolution performance #2028

Closed

tqchen assigned merrymercy Dec 27, 2018

merrymercy reviewed Jan 9, 2019

View reviewed changes

merrymercy requested changes Jan 9, 2019

View reviewed changes

FrozenGene force-pushed the arm_cpu_depthwise_convolution branch from f26cdfd to 4cfc239 Compare January 10, 2019 09:59

FrozenGene requested review from Huyuwei, Laurawly, nhynes and phisiart as code owners January 10, 2019 09:59

FrozenGene force-pushed the arm_cpu_depthwise_convolution branch from 4cfc239 to 1467f34 Compare January 10, 2019 10:19

FrozenGene force-pushed the arm_cpu_depthwise_convolution branch 3 times, most recently from 6fccb6d to ff30450 Compare January 10, 2019 10:55

Add sptialpack schedule for arm cpu depthwise convolution

0e86fe0

FrozenGene force-pushed the arm_cpu_depthwise_convolution branch from ff30450 to 0e86fe0 Compare January 10, 2019 10:55

Supply comments.

fab4729

merrymercy approved these changes Jan 11, 2019

View reviewed changes

merrymercy merged commit 394cf9f into apache:master Jan 11, 2019

FrozenGene mentioned this pull request Jan 14, 2019

[Doc][Tutorial] Add the instructions how to use contrib_spatial_pack #2427

Merged

ZihengJiang mentioned this pull request Feb 1, 2019

TVM 0.5 Release Note #2448

Closed

wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019

[ARM][Performance] Improve ARM CPU depthwise convolution performance (a…

2a60ff6

…pache#2345) * Add sptialpack schedule for arm cpu depthwise convolution * Supply comments.

wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019

[ARM][Performance] Improve ARM CPU depthwise convolution performance (a…

97b0d2a

…pache#2345) * Add sptialpack schedule for arm cpu depthwise convolution * Supply comments.

FrozenGene deleted the arm_cpu_depthwise_convolution branch September 10, 2019 13:39

tqchen unassigned merrymercy Nov 4, 2019

FrozenGene mentioned this pull request Jul 22, 2020

Improve NHWC depthwise convolution for AArch64 #6095

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM][Performance] Improve ARM CPU depthwise convolution performance #2345

[ARM][Performance] Improve ARM CPU depthwise convolution performance #2345

FrozenGene commented Dec 27, 2018 •

edited

Loading

FrozenGene commented Jan 3, 2019

merrymercy Jan 9, 2019

merrymercy Jan 9, 2019

merrymercy left a comment •

edited

Loading

FrozenGene commented Jan 10, 2019

[ARM][Performance] Improve ARM CPU depthwise convolution performance #2345

[ARM][Performance] Improve ARM CPU depthwise convolution performance #2345

Conversation

FrozenGene commented Dec 27, 2018 • edited Loading

FrozenGene commented Jan 3, 2019

merrymercy Jan 9, 2019

Choose a reason for hiding this comment

merrymercy Jan 9, 2019

Choose a reason for hiding this comment

merrymercy left a comment • edited Loading

Choose a reason for hiding this comment

FrozenGene commented Jan 10, 2019

FrozenGene commented Dec 27, 2018 •

edited

Loading

merrymercy left a comment •

edited

Loading