Cache reorder tensors #331

Ryo-not-rio · 2024-08-21T11:02:51Z

Reorder weights are cached in the LRU cache on aarch64, avoiding unnecessary reordering. To take advantage of this feature, the IDEEP_CACHE_MATMUL_REORDERS environment variable needs to be set to 1 and the LRU_CACHE_CAPACITY needs to be set to a meaningful amount.

Speedups on aarch64:
gemma 1 thread - 3.27x
gemma 4 threads - 3.64x
gemma 16 threads - 3.73x

clip 1 thread - 1.04x
clip 4 threads - 1.06x
clip 16 threads - 1.09x

Memory usage increase on aarch64:
gemma - 1.36x
clip - 1.08x

Reorder weights are cached in the LRU cache on aarch64, avoiding unnecessary reordering. To take advantage of this feature, the IDEEP_CACHE_MATMUL_REORDERS environment variable needs to be set to 1 and the LRU_CACHE_CAPACITY needs to be set to a meaningful amount. Speedups on aarch64: gemma 1 thread - 3.27x gemma 4 threads - 3.64x gemma 16 threads - 3.73x clip 1 thread - 1.04x clip 4 threads - 1.06x clip 1 thread - 1.04x clip 4 threads - 1.06x clip 16 threads - 1.09x Memory usage increase on aarch64: gemma - 1.36x clip - 1.08x Change-Id: I59e53c63bb706708aa15ec712909dfbac75d9f3f

Ryo-not-rio · 2024-08-21T11:04:34Z

Hi, could @Xia-Weiwen, @yanbing-j, @jgong5 have a look at this please? Thank you

Xia-Weiwen · 2024-08-21T11:23:04Z

Hi @Ryo-not-rio Thanks for your PR. I have two questions. Weights are supposed to be reordered ahead of time so that reordering can be avoided at runtime. Why does not this work in your case? Will your change also benefit X86?

Ryo-not-rio · 2024-08-21T14:17:39Z

@Xia-Weiwen In the model we've tested, it doesn't seem to be the case that weights are reordered ahead of time since the do_compute function is called during runtime. Could you point me to the part where the reorders should be happening ahead of time? I suspect it would also benefit x86 if we want to enable it for x86.

jgong5 · 2024-08-22T00:50:21Z

@Xia-Weiwen In the model we've tested, it doesn't seem to be the case that weights are reordered ahead of time since the do_compute function is called during runtime. Could you point me to the part where the reorders should be happening ahead of time? I suspect it would also benefit x86 but I will post performance numbers for x86 shortly.

The upfront reorder happens in the upper-level SW stack in inductor when the model is frozen: https://github.com/pytorch/pytorch/blob/6bddfb95463332070cec0aaedd103f42a74564d0/torch/_inductor/fx_passes/mkldnn_fusion.py#L1028

Please note that we will be moving ideep into PyTorch core. Adding runtime packing support in ideep that depends on special environment variables complicate things. Can we leverage the upfront weight packing design in PyTorch?

Xia-Weiwen · 2024-08-22T04:59:38Z

@Ryo-not-rio You will need to run prepare before compute. Weight is reordered when you call prepare. You may also try what Jiong commented.

Ryo-not-rio · 2024-08-22T13:04:13Z

@jgong5 This change is to improve performance during eager mode so we can't take advantage of the inductor. @Xia-Weiwen I've also checked and it looks like prepare() is being called but since the reorder_if_differ() is in the do_compute() function which is called after prepare(), there still seems to be reorders happening

jgong5 · 2024-08-23T01:01:36Z

@jgong5 This change is to improve performance during eager mode so we can't take advantage of the inductor. @Xia-Weiwen I've also checked and it looks like prepare() is being called but since the reorder_if_differ() is in the do_compute() function which is called after prepare(), there still seems to be reorders happening

In my opinion, if we want to do that for eager mode, it is more Pythonic to leverage PyTorch frontend APIs (e.g. https://github.com/pytorch/pytorch/blob/7c93c4f8cf343774d26486bedc5a85892df08024/torch/utils/mkldnn.py, https://github.com/pytorch/pytorch/blob/7c93c4f8cf343774d26486bedc5a85892df08024/torch/fx/experimental/optimization.py#L131) to let user opt in, instead of via a low-level environment variable that controls the behavior of the underlying library. Also, more benefit of a frontend API is that you don't have to keep two copies of weights. The old weights can be swapped out then.

Ryo-not-rio · 2024-08-29T08:53:05Z

@jgong5 Unfortunately I don't think we can leverage the frontend API since we don't know the required weight format at the frontend level for aarch64. The weight format is determined in oneDNN after requesting the weight format from ComputeLibrary here

Xia-Weiwen · 2024-08-30T01:20:32Z

Hi @Ryo-not-rio. On X86, we reorder weights ahead-of-time, which is called prepacking. Example are https://github.com/pytorch/pytorch/blob/cf11fc0dcbb9c907cf6e851109b92f4157e445c9/aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp#L292 and https://github.com/pytorch/pytorch/blob/cf11fc0dcbb9c907cf6e851109b92f4157e445c9/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp#L493
Prepacking can be done when you prepare/quantize the model for inference or on the first run of inference. So, we don't have to reorder weight at runtime in most cases. May I know any difficulties preventing you doing that?

Ryo-not-rio · 2024-08-30T09:32:09Z

@Xia-Weiwen this patch is targeting aarch64 machines where we don't prepack and the expected weight format is not known at the pytorch level

Xia-Weiwen · 2024-08-30T09:40:33Z

we don't prepack

Why not? I didn't see any difference between aarch64 and x86 when you call these APIs. Are there any particular reasons that prevent you doing that?

the expected weight format is not known at the pytorch level

You don't have to know the format. Just provide some info and let oneDNN determine.

Ryo-not-rio · 2024-11-05T14:14:40Z

@Xia-Weiwen I have implemented your suggestion in pytorch/pytorch#139387. Please have a look and let me know if you have any feedback, thank you

Xia-Weiwen · 2024-11-06T01:14:26Z

@Ryo-not-rio Sure. Also CC. @jgong5 for review.

Ryo-not-rio · 2025-02-11T15:57:29Z

Hi @jgong5 @Xia-Weiwen,
We would like to revisit this PR now that we’ve conducted extensive research and determined that the caching approach is the most effective solution going forward. Previously, we experimented with two PyTorch-based strategies:
The first approach was introducing a new frontend function which would allow users to opt-in to packing the weights as such:

model = torch.nn.Module()
model = pack_linear_weights(model)

This was met by a request to integrate this as a feature to automatically pack the weights instead of as an optional feature: pytorch/pytorch#139387 (review). It seems that PyTorch is strictly against implementing a new frontend API for this functionality.
The second approach therefore integrated this into [torch.no](http://torch.no/)_grad() - packing the weights once a user entered a with [torch.no](http://torch.no/)_grad() block and unpacking them once the user got out of the block. This allows us to pack the weights without adding a new frontend API.
However, there are three fundamental issues with implementing this functionality in PyTorch. First is that as pointed out here, that changing the state of a model goes against the philosophy of eager mode execution where everything is expected to happen as it is requested.
Secondly, packing is not necessary for models that require GEMV calls instead of GEMMs. In PyTorch, while GEMM calls are sent to ComputeLibrary, meaning they require reorders and thus require packing, GEMVs are sent to OpenBlas which do not require packing. This results in a problem where if the input shapes change the problem for GEMM to GEMV at runtime such as when autoregressive models are called, weights are required to be unpacked, adding an additional overhead to the runtime. By doing this, we observed an even slower execution time for packed autoregressive models than non-packed autoregressive models. We also tried not unpacking but instead sending packed GEMVs to ComputeLibrary as well but this resulted in a regression as expected.
Finally, this should be an opt-in feature as mentioned in pytorch/pytorch#139387 (review). For example if a user were to run the models on a GPU or were to run a packed model on a non-mkldnn machine, the PyTorch packed models would crash.
Our caching approach in ideep addresses all of these issues without conflicting with PyTorch’s guiding principles, thereby offering both flexibility and performance benefits. We truly believe this is the optimal path forward and would greatly appreciate your reconsideration for a prompt merge.

yanbing-j · 2025-02-12T01:32:16Z

Also cc @mingfeima to review.

Xia-Weiwen · 2025-02-12T02:11:49Z

Hi @Ryo-not-rio Thanks for your feedback. Can you please describe the usage scenario a little more? We don't usually expect good performance with eager mode. The eager mode is more likely to be used in cases where high performance is not required. To get the best performance, one should consider Inductor or JIT. If your use case is not common, is it Ok for you to consider implementing it simply in the frontend in your Python script or project, not in PyTorch core?

Ryo-not-rio · 2025-02-27T18:23:47Z

Closing PR as we will be trying a new approach

Ryo-not-rio mentioned this pull request Oct 31, 2024

Add prepacking for linear weights pytorch/pytorch#139387

Closed

Ryo-not-rio closed this Feb 27, 2025

Cache reorder tensors #331

Cache reorder tensors #331

Uh oh!

Conversation

Ryo-not-rio commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ryo-not-rio commented Aug 21, 2024

Uh oh!

Xia-Weiwen commented Aug 21, 2024

Uh oh!

Ryo-not-rio commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgong5 commented Aug 22, 2024

Uh oh!

Xia-Weiwen commented Aug 22, 2024

Uh oh!

Ryo-not-rio commented Aug 22, 2024

Uh oh!

jgong5 commented Aug 23, 2024

Uh oh!

Ryo-not-rio commented Aug 29, 2024

Uh oh!

Xia-Weiwen commented Aug 30, 2024

Uh oh!

Ryo-not-rio commented Aug 30, 2024

Uh oh!

Xia-Weiwen commented Aug 30, 2024

Uh oh!

Ryo-not-rio commented Nov 5, 2024

Uh oh!

Xia-Weiwen commented Nov 6, 2024

Uh oh!

Ryo-not-rio commented Feb 11, 2025

Uh oh!

yanbing-j commented Feb 12, 2025

Uh oh!

Xia-Weiwen commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ryo-not-rio commented Feb 27, 2025

Uh oh!

Uh oh!

Ryo-not-rio commented Aug 21, 2024 •

edited

Loading

Ryo-not-rio commented Aug 21, 2024 •

edited

Loading

Xia-Weiwen commented Feb 12, 2025 •

edited

Loading