-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Winograd NNPACK for the ARM_CPU backend #2692
Comments
Thanks for proposing it. This will probably serve as a great example of leveraging external libraries in TVM. On the other hand, your observation indicates that within TVM we still have much room to improve the performance. BTW, do you have any performance comparison numbers between AutoTVM and NNPACK on Winograd? Also, does NNPACK use one implementation for a wide variety of devices? |
Thanks for the proposal. can you clarify if the winograd AutoTVM is using the template from the mainline? It would be really great if we can also work together to improve the AutoTVM itself for ARM |
I did the autotuning (AutoTVM with winograd from TVM master) a while ago. I'm going to redo it to get more up-to-date numbers. NNPACK has two implementations for convolution with winograd transform, wt8x8 and wt8x8_fp16. The fp16 implementation is usually the 20% - 30% faster than the fp32 version. On ARM CPUs with fp16 compute support (only Cortex-A76 I believe), NNPACK would run the microkernel with fp16 compute. On other ARM CPUs with fp16 support but not fp16 compute, NNPACK uses fp16 for intermediate storage only (eg. transformed input and kernel, see https://github.com/Maratyszcza/NNPACK/blob/master/src/init.c#L469-L491). It'll be hard to beat NNPACK without fp16. |
I redid autotuning with the ARM winograd implementation in the master on a fixed frequency raspberry pi cluster (600 MHz, single threaded). The network is a unet with 3x3 convs (https://github.com/hlu1/tvm/blob/nnpack-precompute-ARM/unet/unet.py). Here is the result: |
Would be interesting to see if we can improve the tvm's f32 winograd by reusing some of the nnpack's tricks(looking into the ukernel) . In the meanwhile, we can add the nnpack support to the tvm as external lib |
PR : #2721 |
@hlu1 If I understand correctly, In your use case you care about the performance for a variety of microarchitectures, given a single schedule template. If this is the case, we might want a different optimization target for autotvm. |
We added template, 'winograd_nnpack', for autotvm, similar to 'direct' and 'winograd'. |
One thing that I'm not quite sure of is how to handle fp16. For the 'winograd_fp16' implementation, the transformed weights are stored in fp16 format. For us it's fine because we transform the weights from fp32 to fp16 at runtime with C++. However, I get incorrect results when testing with python. |
Hi @hlu1 I have one question. If we use NNPACK FP16, whether it has correctness problem? Because our convolution's input / weight is FP32. |
It depends on the application. There is accuracy drop with the winograd_fp16 implementation. We find it acceptable with most of our mobile CV models. It's not uncommon to have fp16 inference with fp32 weights, e.g., Apple's Metal Performance Shader library, which uses fp16 compute on GPU. |
@hlu1 In my option, I would suggest we have one option to control whether we turn on tvm.contrib.nnpack.ConvolutionAlgorithm.WT_8x8_FP16 during FP32 computation. Because we don't know what other people's requirement / application. Maybe they can not afford the accuracy drop. Additionally, I am interested in FP16's performance. Because most of ARM CPU doesn't support FP16 directly (if I remember correctly) like GPU, so ARM CPU will use intrinsic to compute and should have worse performance, however I find it has better performance . The reason is the bandwidth could be half of FP32? Could you help to explain it? |
How about having one template The performance benefit of fp16 comes from:
For most ARM CPUs that don't support fp16 arithmetic intrinsics, NNPACK uses fp16 only for intermediate storage. It automatically runs the ukernel with fp16 compute on CPUs that do support fp16 intrinsics. |
@hlu1 yes. I think For NNPACK FP16. if CPU doesn't support FP16 arithmetic, NNPACK will use FP32 to compute (extend FP16 to FP32 and truncate it back to FP16 when computation is complete) or else? If it is, I think I could understand FP16 has better performance and why it doesn't beyond FP32 much. |
I'll update my PR to reflect this change. |
--with @ajtulloch
We're working on shipping TVM on android for some internal products that uses 3x3 conv heavily. We found that the winograd_nnpack + TVM approach the best for shipping for a broad variety of android devices. Here are the reasons:
In the end, we decided to use NNPACK Winograd for all 3x3 conv and use TVM for the rest of layers (so we can fuse the layers and parallelize them). This gives us the best overall performances. At the same time, we're building up the infra to target ship model based on the CPU microarchitecture so we can leverage AutoTVM to get the best performance.
Here is our work in progress: https://github.com/hlu1/tvm/tree/winograd-nnpack-ARM. We would like to contribute back if there's interest from the community.
The text was updated successfully, but these errors were encountered: