Skip to content

Cache reorder tensors #331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Ryo-not-rio
Copy link

@Ryo-not-rio Ryo-not-rio commented Aug 21, 2024

Reorder weights are cached in the LRU cache on aarch64, avoiding unnecessary reordering. To take advantage of this feature, the IDEEP_CACHE_MATMUL_REORDERS environment variable needs to be set to 1 and the LRU_CACHE_CAPACITY needs to be set to a meaningful amount.

Speedups on aarch64:
gemma 1 thread - 3.27x
gemma 4 threads - 3.64x
gemma 16 threads - 3.73x

clip 1 thread - 1.04x
clip 4 threads - 1.06x
clip 16 threads - 1.09x

Memory usage increase on aarch64:
gemma - 1.36x
clip - 1.08x

Reorder weights are cached in the LRU cache on aarch64,
avoiding unnecessary reordering. To take advantage of
this feature, the IDEEP_CACHE_MATMUL_REORDERS environment
variable needs to be set to 1 and the LRU_CACHE_CAPACITY
needs to be set to a meaningful amount.

Speedups on aarch64:
gemma 1 thread - 3.27x
gemma 4 threads - 3.64x
gemma 16 threads - 3.73x

clip 1 thread - 1.04x
clip 4 threads - 1.06x
clip 1 thread - 1.04x
clip 4 threads - 1.06x
clip 16 threads - 1.09x

Memory usage increase on aarch64:
gemma - 1.36x
clip - 1.08x

Change-Id: I59e53c63bb706708aa15ec712909dfbac75d9f3f
@Ryo-not-rio
Copy link
Author

Hi, could @Xia-Weiwen, @yanbing-j, @jgong5 have a look at this please? Thank you

@Xia-Weiwen
Copy link
Contributor

Hi @Ryo-not-rio Thanks for your PR. I have two questions. Weights are supposed to be reordered ahead of time so that reordering can be avoided at runtime. Why does not this work in your case? Will your change also benefit X86?

@Ryo-not-rio
Copy link
Author

Ryo-not-rio commented Aug 21, 2024

@Xia-Weiwen In the model we've tested, it doesn't seem to be the case that weights are reordered ahead of time since the do_compute function is called during runtime. Could you point me to the part where the reorders should be happening ahead of time? I suspect it would also benefit x86 if we want to enable it for x86.

@jgong5
Copy link

jgong5 commented Aug 22, 2024

@Xia-Weiwen In the model we've tested, it doesn't seem to be the case that weights are reordered ahead of time since the do_compute function is called during runtime. Could you point me to the part where the reorders should be happening ahead of time? I suspect it would also benefit x86 but I will post performance numbers for x86 shortly.

The upfront reorder happens in the upper-level SW stack in inductor when the model is frozen: https://github.com/pytorch/pytorch/blob/6bddfb95463332070cec0aaedd103f42a74564d0/torch/_inductor/fx_passes/mkldnn_fusion.py#L1028

Please note that we will be moving ideep into PyTorch core. Adding runtime packing support in ideep that depends on special environment variables complicate things. Can we leverage the upfront weight packing design in PyTorch?

@Xia-Weiwen
Copy link
Contributor

@Ryo-not-rio You will need to run prepare before compute. Weight is reordered when you call prepare. You may also try what Jiong commented.

@Ryo-not-rio
Copy link
Author

@jgong5 This change is to improve performance during eager mode so we can't take advantage of the inductor. @Xia-Weiwen I've also checked and it looks like prepare() is being called but since the reorder_if_differ() is in the do_compute() function which is called after prepare(), there still seems to be reorders happening

@jgong5
Copy link

jgong5 commented Aug 23, 2024

@jgong5 This change is to improve performance during eager mode so we can't take advantage of the inductor. @Xia-Weiwen I've also checked and it looks like prepare() is being called but since the reorder_if_differ() is in the do_compute() function which is called after prepare(), there still seems to be reorders happening

In my opinion, if we want to do that for eager mode, it is more Pythonic to leverage PyTorch frontend APIs (e.g. https://github.com/pytorch/pytorch/blob/7c93c4f8cf343774d26486bedc5a85892df08024/torch/utils/mkldnn.py, https://github.com/pytorch/pytorch/blob/7c93c4f8cf343774d26486bedc5a85892df08024/torch/fx/experimental/optimization.py#L131) to let user opt in, instead of via a low-level environment variable that controls the behavior of the underlying library. Also, more benefit of a frontend API is that you don't have to keep two copies of weights. The old weights can be swapped out then.

@Ryo-not-rio
Copy link
Author

@jgong5 Unfortunately I don't think we can leverage the frontend API since we don't know the required weight format at the frontend level for aarch64. The weight format is determined in oneDNN after requesting the weight format from ComputeLibrary here

@Xia-Weiwen
Copy link
Contributor

Hi @Ryo-not-rio. On X86, we reorder weights ahead-of-time, which is called prepacking. Example are https://github.com/pytorch/pytorch/blob/cf11fc0dcbb9c907cf6e851109b92f4157e445c9/aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp#L292 and https://github.com/pytorch/pytorch/blob/cf11fc0dcbb9c907cf6e851109b92f4157e445c9/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp#L493
Prepacking can be done when you prepare/quantize the model for inference or on the first run of inference. So, we don't have to reorder weight at runtime in most cases. May I know any difficulties preventing you doing that?

@Ryo-not-rio
Copy link
Author

@Xia-Weiwen this patch is targeting aarch64 machines where we don't prepack and the expected weight format is not known at the pytorch level

@Xia-Weiwen
Copy link
Contributor

we don't prepack

Why not? I didn't see any difference between aarch64 and x86 when you call these APIs. Are there any particular reasons that prevent you doing that?

the expected weight format is not known at the pytorch level

You don't have to know the format. Just provide some info and let oneDNN determine.

@Ryo-not-rio
Copy link
Author

@Xia-Weiwen I have implemented your suggestion in pytorch/pytorch#139387. Please have a look and let me know if you have any feedback, thank you

@Xia-Weiwen
Copy link
Contributor

@Ryo-not-rio Sure. Also CC. @jgong5 for review.

@Ryo-not-rio
Copy link
Author

Hi @jgong5 @Xia-Weiwen,
We would like to revisit this PR now that we’ve conducted extensive research and determined that the caching approach is the most effective solution going forward. Previously, we experimented with two PyTorch-based strategies:
The first approach was introducing a new frontend function which would allow users to opt-in to packing the weights as such:

model = torch.nn.Module()
model = pack_linear_weights(model)

This was met by a request to integrate this as a feature to automatically pack the weights instead of as an optional feature: pytorch/pytorch#139387 (review). It seems that PyTorch is strictly against implementing a new frontend API for this functionality.
The second approach therefore integrated this into [torch.no](http://torch.no/)_grad() - packing the weights once a user entered a with [torch.no](http://torch.no/)_grad() block and unpacking them once the user got out of the block. This allows us to pack the weights without adding a new frontend API.
However, there are three fundamental issues with implementing this functionality in PyTorch. First is that as pointed out here, that changing the state of a model goes against the philosophy of eager mode execution where everything is expected to happen as it is requested.
Secondly, packing is not necessary for models that require GEMV calls instead of GEMMs. In PyTorch, while GEMM calls are sent to ComputeLibrary, meaning they require reorders and thus require packing, GEMVs are sent to OpenBlas which do not require packing. This results in a problem where if the input shapes change the problem for GEMM to GEMV at runtime such as when autoregressive models are called, weights are required to be unpacked, adding an additional overhead to the runtime. By doing this, we observed an even slower execution time for packed autoregressive models than non-packed autoregressive models. We also tried not unpacking but instead sending packed GEMVs to ComputeLibrary as well but this resulted in a regression as expected.
Finally, this should be an opt-in feature as mentioned in pytorch/pytorch#139387 (review). For example if a user were to run the models on a GPU or were to run a packed model on a non-mkldnn machine, the PyTorch packed models would crash.
Our caching approach in ideep addresses all of these issues without conflicting with PyTorch’s guiding principles, thereby offering both flexibility and performance benefits. We truly believe this is the optimal path forward and would greatly appreciate your reconsideration for a prompt merge.

@yanbing-j
Copy link
Contributor

Also cc @mingfeima to review.

@Xia-Weiwen
Copy link
Contributor

Xia-Weiwen commented Feb 12, 2025

Hi @Ryo-not-rio Thanks for your feedback. Can you please describe the usage scenario a little more? We don't usually expect good performance with eager mode. The eager mode is more likely to be used in cases where high performance is not required. To get the best performance, one should consider Inductor or JIT. If your use case is not common, is it Ok for you to consider implementing it simply in the frontend in your Python script or project, not in PyTorch core?

@Ryo-not-rio
Copy link
Author

Closing PR as we will be trying a new approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants