-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache reorder tensors #331
base: ideep_pytorch
Are you sure you want to change the base?
Conversation
Reorder weights are cached in the LRU cache on aarch64, avoiding unnecessary reordering. To take advantage of this feature, the IDEEP_CACHE_MATMUL_REORDERS environment variable needs to be set to 1 and the LRU_CACHE_CAPACITY needs to be set to a meaningful amount. Speedups on aarch64: gemma 1 thread - 3.27x gemma 4 threads - 3.64x gemma 16 threads - 3.73x clip 1 thread - 1.04x clip 4 threads - 1.06x clip 1 thread - 1.04x clip 4 threads - 1.06x clip 16 threads - 1.09x Memory usage increase on aarch64: gemma - 1.36x clip - 1.08x Change-Id: I59e53c63bb706708aa15ec712909dfbac75d9f3f
Hi, could @Xia-Weiwen, @yanbing-j, @jgong5 have a look at this please? Thank you |
Hi @Ryo-not-rio Thanks for your PR. I have two questions. Weights are supposed to be reordered ahead of time so that reordering can be avoided at runtime. Why does not this work in your case? Will your change also benefit X86? |
@Xia-Weiwen In the model we've tested, it doesn't seem to be the case that weights are reordered ahead of time since the |
The upfront reorder happens in the upper-level SW stack in inductor when the model is frozen: https://github.com/pytorch/pytorch/blob/6bddfb95463332070cec0aaedd103f42a74564d0/torch/_inductor/fx_passes/mkldnn_fusion.py#L1028 Please note that we will be moving ideep into PyTorch core. Adding runtime packing support in ideep that depends on special environment variables complicate things. Can we leverage the upfront weight packing design in PyTorch? |
@Ryo-not-rio You will need to run |
@jgong5 This change is to improve performance during eager mode so we can't take advantage of the inductor. @Xia-Weiwen I've also checked and it looks like |
In my opinion, if we want to do that for eager mode, it is more Pythonic to leverage PyTorch frontend APIs (e.g. https://github.com/pytorch/pytorch/blob/7c93c4f8cf343774d26486bedc5a85892df08024/torch/utils/mkldnn.py, https://github.com/pytorch/pytorch/blob/7c93c4f8cf343774d26486bedc5a85892df08024/torch/fx/experimental/optimization.py#L131) to let user opt in, instead of via a low-level environment variable that controls the behavior of the underlying library. Also, more benefit of a frontend API is that you don't have to keep two copies of weights. The old weights can be swapped out then. |
Hi @Ryo-not-rio. On X86, we reorder weights ahead-of-time, which is called prepacking. Example are https://github.com/pytorch/pytorch/blob/cf11fc0dcbb9c907cf6e851109b92f4157e445c9/aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp#L292 and https://github.com/pytorch/pytorch/blob/cf11fc0dcbb9c907cf6e851109b92f4157e445c9/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp#L493 |
@Xia-Weiwen this patch is targeting aarch64 machines where we don't prepack and the expected weight format is not known at the pytorch level |
Why not? I didn't see any difference between aarch64 and x86 when you call these APIs. Are there any particular reasons that prevent you doing that?
You don't have to know the format. Just provide some info and let oneDNN determine. |
@Xia-Weiwen I have implemented your suggestion in pytorch/pytorch#139387. Please have a look and let me know if you have any feedback, thank you |
@Ryo-not-rio Sure. Also CC. @jgong5 for review. |
Reorder weights are cached in the LRU cache on aarch64, avoiding unnecessary reordering. To take advantage of this feature, the IDEEP_CACHE_MATMUL_REORDERS environment variable needs to be set to 1 and the LRU_CACHE_CAPACITY needs to be set to a meaningful amount.
Speedups on aarch64:
gemma 1 thread - 3.27x
gemma 4 threads - 3.64x
gemma 16 threads - 3.73x
clip 1 thread - 1.04x
clip 4 threads - 1.06x
clip 16 threads - 1.09x
Memory usage increase on aarch64:
gemma - 1.36x
clip - 1.08x