Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM][Performance] Improve ARM CPU depthwise convolution performance #2345

Merged
merged 2 commits into from
Jan 11, 2019

Conversation

FrozenGene
Copy link
Member

@FrozenGene FrozenGene commented Dec 27, 2018

original discussion about this PR:
#2028

To leverage existing spatial pack schedule and add tunable compute_at knob to re-implement ARM CPU's depthwise convolution. Currently, this PR named this schedule as contrib_spatial_pack as discussed in the original PR #2028.

On my A53@2.0GHz ARM CPU (MTK6763), which can boost 1.6X performance compared with previous depthwise convolution in the Mobilenet V1 model (I have also checked the correctness of this schedule). However, @yzhliu has also proved that this PR can also boost the performance on x86 CPU.

The following is the Tensorflow Mobilenet V1 model auto tvm training GFLOPS log:
Currently:
[Task 2/20] Current/Best: 0.98/ 2.32 GFLOPS | Progress: (1427/2000) | 2679.82 s Done.
[Task 4/20] Current/Best: 0.56/ 1.15 GFLOPS | Progress: (1072/2000) | 2461.27 s Done.
[Task 6/20] Current/Best: 1.08/ 2.78 GFLOPS | Progress: (1084/2000) | 1987.91 s Done.
[Task 8/20] Current/Best: 0.39/ 1.19 GFLOPS | Progress: (1815/2000) | 2744.70 s Done.
[Task 10/20] Current/Best: 1.09/ 2.33 GFLOPS | Progress: (1222/2000) | 1866.02 s Done.
[Task 12/20] Current/Best: 0.42/ 0.90 GFLOPS | Progress: (1716/2000) | 2528.94 s Done.
[Task 14/20] Current/Best: 1.89/ 2.63 GFLOPS | Progress: (1284/2000) | 2288.55 s Done.
[Task 16/20] Current/Best: 0.47/ 0.96 GFLOPS | Progress: (1467/2000) | 2282.65 s Done.
[Task 18/20] Current/Best: 1.43/ 2.61 GFLOPS | Progress: (1007/2000) | 1525.76 s Done.

After this PR optimization:
[Task 2/20] Current/Best: 0.00/ 4.83 GFLOPS | Progress: (1682/2000) | 1470.40 s Done.
[Task 4/20] Current/Best: 1.35/ 3.17 GFLOPS | Progress: (1257/2000) | 1032.80 s Done.
[Task 6/20] Current/Best: 2.04/ 5.49 GFLOPS | Progress: (1904/2000) | 1623.10 s Done.
[Task 8/20] Current/Best: 0.75/ 3.15 GFLOPS | Progress: (1885/2000) | 1546.22 s Done.
[Task 10/20] Current/Best: 2.09/ 6.07 GFLOPS | Progress: (2000/2000) | 1640.41 s Done.
[Task 12/20] Current/Best: 2.99/ 3.80 GFLOPS | Progress: (1853/2000) | 1547.13 s Done.
[Task 14/20] Current/Best: 4.59/ 6.06 GFLOPS | Progress: (1355/2000) | 1091.93 s Done.
[Task 16/20] Current/Best: 1.96/ 4.01 GFLOPS | Progress: (2000/2000) | 1586.18 s Done.
[Task 18/20] Current/Best: 2.33/ 4.63 GFLOPS | Progress: (2000/2000) | 1599.89 s Done.

The depthwise convolution total execution time on single A53@2.0GHz time can be from 45.3839ms to 28.1945ms.

One thing you must notice to use this schedule: You MUST make the XGBTunner constructor’s feature type argument be feature_type= 'knob'. i.e. XGBTuner(tsk, loss_type='rank', feature_type='knob'). Otherwise your program maybe hang forever.

This schedule is not default schedule (i.e. direct) of arm cpu / x86 cpu depthwise convolution. I will update the auto tuning of ARM CPU tutorial to show how to use this contrib_spatial_pack schedule in the following PR.

@merrymercy @yzhliu @tqchen pls review it.

@FrozenGene
Copy link
Member Author

Hi @merrymercy, could you spend some time reviewing? Thanks.

@@ -73,15 +73,17 @@ inline bool Conv2DInferShape(const nnvm::NodeAttrs& attrs,
CHECK_EQ(param.channels % param.groups, 0U)
<< "output channels must divide group size";

TShape wshape({param.channels / param.groups,
// Restore depthwise conv2d kernel layout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old code is strange and incorrect while your code is straightforward and correct. We can delete this comment.

# Currently, Mali schedule doesn't use it like conv2d.
if cfg.is_fallback:
ref_log = autotvm.tophub.load_reference_log('arm_cpu', 'rk3399', 'depthwise_conv2d_nchw',
'direct')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'direct')
'contrib_spatial_pack')

Copy link
Member

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. But recently we updated the alter_op_layout to support relay (#2356, a new argument 'F' in introduced). Please resolve the conflict.

@FrozenGene FrozenGene force-pushed the arm_cpu_depthwise_convolution branch from f26cdfd to 4cfc239 Compare January 10, 2019 09:59
@FrozenGene FrozenGene force-pushed the arm_cpu_depthwise_convolution branch from 4cfc239 to 1467f34 Compare January 10, 2019 10:19
@FrozenGene
Copy link
Member Author

@merrymercy Have modified code as you suggest. Please review it again.

@FrozenGene FrozenGene force-pushed the arm_cpu_depthwise_convolution branch 3 times, most recently from 6fccb6d to ff30450 Compare January 10, 2019 10:55
@FrozenGene FrozenGene force-pushed the arm_cpu_depthwise_convolution branch from ff30450 to 0e86fe0 Compare January 10, 2019 10:55
@merrymercy merrymercy merged commit 394cf9f into apache:master Jan 11, 2019
@ZihengJiang ZihengJiang mentioned this pull request Feb 1, 2019
wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019
…pache#2345)

* Add sptialpack schedule for arm cpu depthwise convolution

* Supply comments.
wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019
…pache#2345)

* Add sptialpack schedule for arm cpu depthwise convolution

* Supply comments.
@FrozenGene FrozenGene deleted the arm_cpu_depthwise_convolution branch September 10, 2019 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants