Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Winograd NNPACK for the ARM_CPU backend #2692

Closed
hlu1 opened this issue Feb 28, 2019 · 15 comments
Closed

[RFC] Winograd NNPACK for the ARM_CPU backend #2692

hlu1 opened this issue Feb 28, 2019 · 15 comments

Comments

@hlu1
Copy link
Contributor

hlu1 commented Feb 28, 2019

--with @ajtulloch
We're working on shipping TVM on android for some internal products that uses 3x3 conv heavily. We found that the winograd_nnpack + TVM approach the best for shipping for a broad variety of android devices. Here are the reasons:

  • We're restricted to AOT compilation and can only ship one model (packed with TVM generated code) to all android devices.
  • We did autotuning with the direct and winograd implementation on a raspberry pi (Cortex-A53) and found that NNPACK actually outperforms the best autotuned schedules for most of the layers
  • In the case where AutoTVM does find better schedules on Cortex-A53, the performance does not necessarily transfer to other microarchitectures. Autotuning works best for a fixed CPU microarchitecture. When we ship a model, we care about its performance on wide variety of devices (Cortex-A7, A9, A35, A53, A57, A72, A73, A75, Qualcomm Kryo, Samsung Mongoose M1, M2, and Meerkat M3, to name a few). It is very hard to get a schedule that performs well across the board.

In the end, we decided to use NNPACK Winograd for all 3x3 conv and use TVM for the rest of layers (so we can fuse the layers and parallelize them). This gives us the best overall performances. At the same time, we're building up the infra to target ship model based on the CPU microarchitecture so we can leverage AutoTVM to get the best performance.

Here is our work in progress: https://github.com/hlu1/tvm/tree/winograd-nnpack-ARM. We would like to contribute back if there's interest from the community.

@yidawang
Copy link
Contributor

Thanks for proposing it. This will probably serve as a great example of leveraging external libraries in TVM. On the other hand, your observation indicates that within TVM we still have much room to improve the performance. BTW, do you have any performance comparison numbers between AutoTVM and NNPACK on Winograd? Also, does NNPACK use one implementation for a wide variety of devices?

@tqchen
Copy link
Member

tqchen commented Feb 28, 2019

Thanks for the proposal. can you clarify if the winograd AutoTVM is using the template from the mainline? It would be really great if we can also work together to improve the AutoTVM itself for ARM

@hlu1
Copy link
Contributor Author

hlu1 commented Feb 28, 2019

I did the autotuning (AutoTVM with winograd from TVM master) a while ago. I'm going to redo it to get more up-to-date numbers. NNPACK has two implementations for convolution with winograd transform, wt8x8 and wt8x8_fp16. The fp16 implementation is usually the 20% - 30% faster than the fp32 version. On ARM CPUs with fp16 compute support (only Cortex-A76 I believe), NNPACK would run the microkernel with fp16 compute. On other ARM CPUs with fp16 support but not fp16 compute, NNPACK uses fp16 for intermediate storage only (eg. transformed input and kernel, see https://github.com/Maratyszcza/NNPACK/blob/master/src/init.c#L469-L491). It'll be hard to beat NNPACK without fp16.

@hlu1
Copy link
Contributor Author

hlu1 commented Mar 4, 2019

I redid autotuning with the ARM winograd implementation in the master on a fixed frequency raspberry pi cluster (600 MHz, single threaded). The network is a unet with 3x3 convs (https://github.com/hlu1/tvm/blob/nnpack-precompute-ARM/unet/unet.py). Here is the result:
tvm_winograd
The difference between the nnpack fp16 and fp32 implementation is not that big. The gap between winograd_nnpack and the TVM winograd is almost 2x.

@tqchen
Copy link
Member

tqchen commented Mar 4, 2019

Would be interesting to see if we can improve the tvm's f32 winograd by reusing some of the nnpack's tricks(looking into the ukernel) . In the meanwhile, we can add the nnpack support to the tvm as external lib

@hlu1
Copy link
Contributor Author

hlu1 commented Mar 4, 2019

PR : #2721

@kevinthesun
Copy link
Contributor

@hlu1 If I understand correctly, In your use case you care about the performance for a variety of microarchitectures, given a single schedule template. If this is the case, we might want a different optimization target for autotvm.

@hlu1
Copy link
Contributor Author

hlu1 commented Mar 5, 2019

We added template, 'winograd_nnpack', for autotvm, similar to 'direct' and 'winograd'.

@hlu1
Copy link
Contributor Author

hlu1 commented Mar 5, 2019

One thing that I'm not quite sure of is how to handle fp16. For the 'winograd_fp16' implementation, the transformed weights are stored in fp16 format. For us it's fine because we transform the weights from fp32 to fp16 at runtime with C++. However, I get incorrect results when testing with python.

@FrozenGene
Copy link
Member

Hi @hlu1 I have one question. If we use NNPACK FP16, whether it has correctness problem? Because our convolution's input / weight is FP32.

@hlu1
Copy link
Contributor Author

hlu1 commented Mar 5, 2019

It depends on the application. There is accuracy drop with the winograd_fp16 implementation. We find it acceptable with most of our mobile CV models. It's not uncommon to have fp16 inference with fp32 weights, e.g., Apple's Metal Performance Shader library, which uses fp16 compute on GPU.

@FrozenGene
Copy link
Member

FrozenGene commented Mar 5, 2019

@hlu1 In my option, I would suggest we have one option to control whether we turn on tvm.contrib.nnpack.ConvolutionAlgorithm.WT_8x8_FP16 during FP32 computation. Because we don't know what other people's requirement / application. Maybe they can not afford the accuracy drop.

Additionally, I am interested in FP16's performance. Because most of ARM CPU doesn't support FP16 directly (if I remember correctly) like GPU, so ARM CPU will use intrinsic to compute and should have worse performance, however I find it has better performance . The reason is the bandwidth could be half of FP32? Could you help to explain it?

@hlu1
Copy link
Contributor Author

hlu1 commented Mar 5, 2019

How about having one template winograd_nnpack_fp16 for the fp16 implementation and winograd_nnpack_fp32 for the fp32 one? This way the users will have the option of turning fp16 on/off.

The performance benefit of fp16 comes from:

  1. reduced memory bandwidth
  2. better cache efficiency due to less memory footprint. Fp16 is used for input/output/kernel transforms

For most ARM CPUs that don't support fp16 arithmetic intrinsics, NNPACK uses fp16 only for intermediate storage. It automatically runs the ukernel with fp16 compute on CPUs that do support fp16 intrinsics.

@FrozenGene
Copy link
Member

FrozenGene commented Mar 5, 2019

@hlu1 yes. I think winograd_nnpack_fp16 and winograd_nnpack_fp32 is ok. Could you update it in your PR #2721 too?

For NNPACK FP16. if CPU doesn't support FP16 arithmetic, NNPACK will use FP32 to compute (extend FP16 to FP32 and truncate it back to FP16 when computation is complete) or else? If it is, I think I could understand FP16 has better performance and why it doesn't beyond FP32 much.

@hlu1
Copy link
Contributor Author

hlu1 commented Mar 7, 2019

I'll update my PR to reflect this change.
Yes, NNPACK does fp32 compute when fp16 compute is not available (extend FP16 to FP32 and truncate it back to FP16 when computation is complete).

@hlu1 hlu1 closed this as completed Mar 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants