Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfcs: graph api: Constant Block Weight Mechanism Graph API design #2280

Open
wants to merge 1 commit into
base: rfcs
Choose a base branch
from

Conversation

xiang1guo
Copy link
Contributor

The RFC proposes a design for handling constant block weight mechanism in graph API.

Rendered version: link

@xiang1guo xiang1guo added the RFC A design document label Dec 17, 2024
@xiang1guo xiang1guo requested a review from a team December 17, 2024 16:02
@xiang1guo xiang1guo self-assigned this Dec 17, 2024
@xiang1guo xiang1guo requested a review from a team as a code owner December 17, 2024 16:02
graph library cannot directly store the constant weight cache in the compiled
partition cache. Instead, it needs to distinguish between these different
constant weight tensors. However, distinguishing constant weights requires
additional information from the user side. This raises new considerations for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problematic case is not quite clear to me: is the case that different data is supplied with a single compiled partition that it's caching is not possible? When re-usage of compiled partition happens exactly? What's the real-world scenario for it?

Copy link
Contributor Author

@xiang1guo xiang1guo Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Dima, thanks for the review!

is the case that different data is supplied with a single compiled partition that it's caching is not possible? When re-usage of compiled partition happens exactly?

Consider a whole graph/model contains 2 subgraphs, the first subgraph compiles and caches the compiled partition with constant weight cache, and the second subgraph compilation will cache hit (it's possible) the first compiled partition cache and reuse the constant tensor cache in this compiled partition cache(with constant cache feature enabled). However, the second subgraph has a different weight input(the constant tensor cache for weight input in the first cp cache cannot be reused), thus leading to an accuracy issue.

What's the real-world scenario for it?

Currently we are integrating graph API in PyTorch for SDPA op direct optimization, the workflow is that each SDPA's compilation, and execution with graph API happens in every iteration of the PyTorch SDPA operation call. Thus there may be some compiled partition cache hit and reuse but with different input.

But we are safe now since SDPA doesn't contain any constant tensor cache(constant weight), so compiled partition cache reuse won't introduce any issues. However, we are planning for some other ops that may need constant tensor cache, like the MLP optimization on the OpenVINO side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second subgraph compilation will cache hit (it's possible) the first compiled partition cache

Hasn't compiled partition have an ID which should be checked by the CP-cache? Will underneath LTs' IDs be ignored in case as this potential hit as well?

Copy link
Contributor Author

@xiang1guo xiang1guo Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second subgraph compilation will cache hit (it's possible) the first compiled partition cache

Hasn't compiled partition have an ID which should be checked by the CP-cache? Will underneath LTs' IDs be ignored in case as this potential hit as well?

We do have ID(both logical tensor and op ID) hashed into the key of CP-cache, it works very well on the legacy integration on IPEX and ITEX, since the whole model will be mapped to the graph API with unique ID.

We are trying to design a new API/Mechanism for the current new integration solution of OP direct optimization. Such integration exists under a single operation, they may use the same ID to create op and logical tensor. So the cp cache will hit. Actually, for such new integration solution, compilation happens on every iteration, more cp cache hit helps to eliminate the compilation time. That's why we want to address the issue without hurting cp cache hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC A design document
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants