-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfcs: graph api: Constant Block Weight Mechanism Graph API design #2280
base: rfcs
Are you sure you want to change the base?
Conversation
graph library cannot directly store the constant weight cache in the compiled | ||
partition cache. Instead, it needs to distinguish between these different | ||
constant weight tensors. However, distinguishing constant weights requires | ||
additional information from the user side. This raises new considerations for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problematic case is not quite clear to me: is the case that different data is supplied with a single compiled partition that it's caching is not possible? When re-usage of compiled partition happens exactly? What's the real-world scenario for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, Dima, thanks for the review!
is the case that different data is supplied with a single compiled partition that it's caching is not possible? When re-usage of compiled partition happens exactly?
Consider a whole graph/model contains 2 subgraphs, the first subgraph compiles and caches the compiled partition with constant weight cache, and the second subgraph compilation will cache hit (it's possible) the first compiled partition cache and reuse the constant tensor cache in this compiled partition cache(with constant cache feature enabled). However, the second subgraph has a different weight input(the constant tensor cache for weight input in the first cp cache cannot be reused), thus leading to an accuracy issue.
What's the real-world scenario for it?
Currently we are integrating graph API in PyTorch for SDPA op direct optimization, the workflow is that each SDPA's compilation, and execution with graph API happens in every iteration of the PyTorch SDPA operation call. Thus there may be some compiled partition cache hit and reuse but with different input.
But we are safe now since SDPA doesn't contain any constant tensor cache(constant weight), so compiled partition cache reuse won't introduce any issues. However, we are planning for some other ops that may need constant tensor cache, like the MLP optimization on the OpenVINO side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the second subgraph compilation will cache hit (it's possible) the first compiled partition cache
Hasn't compiled partition have an ID which should be checked by the CP-cache? Will underneath LTs' IDs be ignored in case as this potential hit as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the second subgraph compilation will cache hit (it's possible) the first compiled partition cache
Hasn't compiled partition have an ID which should be checked by the CP-cache? Will underneath LTs' IDs be ignored in case as this potential hit as well?
We do have ID(both logical tensor and op ID) hashed into the key of CP-cache, it works very well on the legacy integration on IPEX and ITEX, since the whole model will be mapped to the graph API with unique ID.
We are trying to design a new API/Mechanism for the current new integration solution of OP direct optimization. Such integration exists under a single operation, they may use the same ID to create op and logical tensor. So the cp cache will hit. Actually, for such new integration solution, compilation happens on every iteration, more cp cache hit helps to eliminate the compilation time. That's why we want to address the issue without hurting cp cache hit.
6d686ac
to
2560609
Compare
The RFC proposes a design for handling constant block weight mechanism in graph API.
Rendered version: link