-
Notifications
You must be signed in to change notification settings - Fork 25
Leverage update_cache op to reduce overhead from cache update #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
378d613 to
05c39f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The perf boost seems significant! Thanks for adding the custom KV cache to HF models
| "-m", "--model", type=str, required=True, help="Model ID on huggingface.co or path on disk to load model from." | ||
| "-m", | ||
| "--model", | ||
| type=str, | ||
| required=True, | ||
| help="Model ID on huggingface.co or path on disk to load model from.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you pip install '.[quality]', then run make style? It will run linter and will fix the Code Quality failures in the CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the style requirement here abiding to hf/optimum guidelines? or et ones? I dont know if we are running different linters between et and here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, guideline by hf/optimum. https://github.com/huggingface/optimum/blob/main/CONTRIBUTING.md#how-to-create-a-pull-request
05c39f1 to
d6d33eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this!
|
can you rerun the tests? some suggest no space left on device? I will fix the code quality one. |
3d0deb8 to
0f02dc9
Compare
| @pytest.mark.skipif( | ||
| parse(transformers.__version__) < parse("4.52.0"), | ||
| reason="Only available on transformers >= 4.52.0", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I noticed this test is skipped in the CI https://github.com/huggingface/optimum-executorch/actions/runs/15429343459/job/43424006619?pr=69#step:5:876.
It's skipped because the installed version 4.52.0.dev0 (because it's installed from source in the CI as when I enabled gemma3, 4.52 is not released yet). To fix this you will need just compare the base version or compare version with dev0, example here
optimum-executorch/tests/models/test_modeling_gemma3.py
Lines 43 to 46 in 34cece4
| @pytest.mark.skipif( | |
| is_transformers_version("<", "4.52.0.dev0"), | |
| reason="Only available on transformers >= 4.52.0.dev0", | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let me fix that. BTW there are other test failures that seem unrelated
0f02dc9 to
975143f
Compare
Summary: This likely might be a short lived optimization where in future we can replace update_cache op with index_put_ op. This is what original StaticCache does, however this requires cache transpose for custom_sdpa (which can also be fixed). We will leverage custom cache for now, however in near future this should not be needed. This option however will allow us to bypass any transposes if the need continues Test Plan: CI Reviewers: Subscribers: Tasks: Tags:
975143f to
2ddcaa4
Compare
Summary:
This likely might be a short lived optimization where in future we can replace update_cache op with index_put_ op.
This is what original StaticCache does, however this requires cache transpose for custom_sdpa (which can also be fixed).
We will leverage custom cache for now, however in near future this should not be needed. This option however will allow us to bypass any transposes if the need continues
Test Plan:
CI
Reviewers:
Subscribers:
Tasks:
Tags: