[RFC] Winograd NNPACK for the ARM_CPU backend #2692

hlu1 · 2019-02-28T01:26:45Z

--with @ajtulloch
We're working on shipping TVM on android for some internal products that uses 3x3 conv heavily. We found that the winograd_nnpack + TVM approach the best for shipping for a broad variety of android devices. Here are the reasons:

We're restricted to AOT compilation and can only ship one model (packed with TVM generated code) to all android devices.
We did autotuning with the direct and winograd implementation on a raspberry pi (Cortex-A53) and found that NNPACK actually outperforms the best autotuned schedules for most of the layers
In the case where AutoTVM does find better schedules on Cortex-A53, the performance does not necessarily transfer to other microarchitectures. Autotuning works best for a fixed CPU microarchitecture. When we ship a model, we care about its performance on wide variety of devices (Cortex-A7, A9, A35, A53, A57, A72, A73, A75, Qualcomm Kryo, Samsung Mongoose M1, M2, and Meerkat M3, to name a few). It is very hard to get a schedule that performs well across the board.

In the end, we decided to use NNPACK Winograd for all 3x3 conv and use TVM for the rest of layers (so we can fuse the layers and parallelize them). This gives us the best overall performances. At the same time, we're building up the infra to target ship model based on the CPU microarchitecture so we can leverage AutoTVM to get the best performance.

Here is our work in progress: https://github.com/hlu1/tvm/tree/winograd-nnpack-ARM. We would like to contribute back if there's interest from the community.

yidawang · 2019-02-28T04:15:03Z

Thanks for proposing it. This will probably serve as a great example of leveraging external libraries in TVM. On the other hand, your observation indicates that within TVM we still have much room to improve the performance. BTW, do you have any performance comparison numbers between AutoTVM and NNPACK on Winograd? Also, does NNPACK use one implementation for a wide variety of devices?

tqchen · 2019-02-28T06:09:59Z

Thanks for the proposal. can you clarify if the winograd AutoTVM is using the template from the mainline? It would be really great if we can also work together to improve the AutoTVM itself for ARM

hlu1 · 2019-02-28T06:57:19Z

I did the autotuning (AutoTVM with winograd from TVM master) a while ago. I'm going to redo it to get more up-to-date numbers. NNPACK has two implementations for convolution with winograd transform, wt8x8 and wt8x8_fp16. The fp16 implementation is usually the 20% - 30% faster than the fp32 version. On ARM CPUs with fp16 compute support (only Cortex-A76 I believe), NNPACK would run the microkernel with fp16 compute. On other ARM CPUs with fp16 support but not fp16 compute, NNPACK uses fp16 for intermediate storage only (eg. transformed input and kernel, see https://github.com/Maratyszcza/NNPACK/blob/master/src/init.c#L469-L491). It'll be hard to beat NNPACK without fp16.

hlu1 · 2019-03-04T02:35:47Z

I redid autotuning with the ARM winograd implementation in the master on a fixed frequency raspberry pi cluster (600 MHz, single threaded). The network is a unet with 3x3 convs (https://github.com/hlu1/tvm/blob/nnpack-precompute-ARM/unet/unet.py). Here is the result:

The difference between the nnpack fp16 and fp32 implementation is not that big. The gap between winograd_nnpack and the TVM winograd is almost 2x.

tqchen · 2019-03-04T04:02:07Z

Would be interesting to see if we can improve the tvm's f32 winograd by reusing some of the nnpack's tricks(looking into the ukernel) . In the meanwhile, we can add the nnpack support to the tvm as external lib

hlu1 · 2019-03-04T04:57:20Z

PR : #2721

kevinthesun · 2019-03-05T03:06:07Z

@hlu1 If I understand correctly, In your use case you care about the performance for a variety of microarchitectures, given a single schedule template. If this is the case, we might want a different optimization target for autotvm.

hlu1 · 2019-03-05T06:15:52Z

We added template, 'winograd_nnpack', for autotvm, similar to 'direct' and 'winograd'.

hlu1 · 2019-03-05T06:20:59Z

One thing that I'm not quite sure of is how to handle fp16. For the 'winograd_fp16' implementation, the transformed weights are stored in fp16 format. For us it's fine because we transform the weights from fp32 to fp16 at runtime with C++. However, I get incorrect results when testing with python.

FrozenGene · 2019-03-05T06:36:51Z

Hi @hlu1 I have one question. If we use NNPACK FP16, whether it has correctness problem? Because our convolution's input / weight is FP32.

hlu1 · 2019-03-05T06:44:06Z

It depends on the application. There is accuracy drop with the winograd_fp16 implementation. We find it acceptable with most of our mobile CV models. It's not uncommon to have fp16 inference with fp32 weights, e.g., Apple's Metal Performance Shader library, which uses fp16 compute on GPU.

FrozenGene · 2019-03-05T06:55:07Z

@hlu1 In my option, I would suggest we have one option to control whether we turn on tvm.contrib.nnpack.ConvolutionAlgorithm.WT_8x8_FP16 during FP32 computation. Because we don't know what other people's requirement / application. Maybe they can not afford the accuracy drop.

Additionally, I am interested in FP16's performance. Because most of ARM CPU doesn't support FP16 directly (if I remember correctly) like GPU, so ARM CPU will use intrinsic to compute and should have worse performance, however I find it has better performance . The reason is the bandwidth could be half of FP32? Could you help to explain it?

hlu1 · 2019-03-05T07:55:40Z

How about having one template winograd_nnpack_fp16 for the fp16 implementation and winograd_nnpack_fp32 for the fp32 one? This way the users will have the option of turning fp16 on/off.

The performance benefit of fp16 comes from:

reduced memory bandwidth
better cache efficiency due to less memory footprint. Fp16 is used for input/output/kernel transforms

For most ARM CPUs that don't support fp16 arithmetic intrinsics, NNPACK uses fp16 only for intermediate storage. It automatically runs the ukernel with fp16 compute on CPUs that do support fp16 intrinsics.

FrozenGene · 2019-03-05T09:14:17Z

@hlu1 yes. I think winograd_nnpack_fp16 and winograd_nnpack_fp32 is ok. Could you update it in your PR #2721 too?

For NNPACK FP16. if CPU doesn't support FP16 arithmetic, NNPACK will use FP32 to compute (extend FP16 to FP32 and truncate it back to FP16 when computation is complete) or else? If it is, I think I could understand FP16 has better performance and why it doesn't beyond FP32 much.

hlu1 · 2019-03-07T05:29:30Z

I'll update my PR to reflect this change.
Yes, NNPACK does fp32 compute when fp16 compute is not available (extend FP16 to FP32 and truncate it back to FP16 when computation is complete).

hlu1 mentioned this issue Mar 4, 2019

[TOP][ARM] Winograd nnpack backend #2721

Merged

hlu1 closed this as completed Mar 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Winograd NNPACK for the ARM_CPU backend #2692

[RFC] Winograd NNPACK for the ARM_CPU backend #2692

hlu1 commented Feb 28, 2019

yidawang commented Feb 28, 2019

tqchen commented Feb 28, 2019

hlu1 commented Feb 28, 2019 •

edited

Loading

hlu1 commented Mar 4, 2019

tqchen commented Mar 4, 2019

hlu1 commented Mar 4, 2019

kevinthesun commented Mar 5, 2019

hlu1 commented Mar 5, 2019

hlu1 commented Mar 5, 2019 •

edited

Loading

FrozenGene commented Mar 5, 2019

hlu1 commented Mar 5, 2019

FrozenGene commented Mar 5, 2019 •

edited

Loading

hlu1 commented Mar 5, 2019 •

edited

Loading

FrozenGene commented Mar 5, 2019 •

edited

Loading

hlu1 commented Mar 7, 2019

[RFC] Winograd NNPACK for the ARM_CPU backend #2692

[RFC] Winograd NNPACK for the ARM_CPU backend #2692

Comments

hlu1 commented Feb 28, 2019

yidawang commented Feb 28, 2019

tqchen commented Feb 28, 2019

hlu1 commented Feb 28, 2019 • edited Loading

hlu1 commented Mar 4, 2019

tqchen commented Mar 4, 2019

hlu1 commented Mar 4, 2019

kevinthesun commented Mar 5, 2019

hlu1 commented Mar 5, 2019

hlu1 commented Mar 5, 2019 • edited Loading

FrozenGene commented Mar 5, 2019

hlu1 commented Mar 5, 2019

FrozenGene commented Mar 5, 2019 • edited Loading

hlu1 commented Mar 5, 2019 • edited Loading

FrozenGene commented Mar 5, 2019 • edited Loading

hlu1 commented Mar 7, 2019

hlu1 commented Feb 28, 2019 •

edited

Loading

hlu1 commented Mar 5, 2019 •

edited

Loading

FrozenGene commented Mar 5, 2019 •

edited

Loading

hlu1 commented Mar 5, 2019 •

edited

Loading

FrozenGene commented Mar 5, 2019 •

edited

Loading