[Executorch][llm] Add ring buffer based kv cache and mask calculation to MHA #10609

kimishpatel · 2025-05-01T17:19:11Z

Stack from ghstack (oldest at bottom):

Leveraging previous work now we allow MHA to have ring buffer cache. If ring buffer cache is used
then we query the mask from kv cache and use that for sdpa instead of using precalculated mask.

In this process we had to adjsut ring buffer implementation to allow keeping the context of
full sliding window. See code for comment.

Differential Revision: D73891425

… to MHA Leveraging previous work now we allow MHA to have ring buffer cache. If ring buffer cache is used then we query the mask from kv cache and use that for sdpa instead of using precalculated mask. In this process we had to adjsut ring buffer implementation to allow keeping the context of full sliding window. See code for comment. Differential Revision: [D73891425](https://our.internmc.facebook.com/intern/diff/D73891425/) [ghstack-poisoned]

pytorch-bot · 2025-05-01T17:19:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10609

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 7d3fdd7 with merge base 1ae8c2c ():

NEW FAILURE - The following job has failed:

.github/workflows/build-presets.yml (gh)

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / android / run-emulator (gh) (trunk failure)
The process '/usr/bin/sh' failed with exit code 255

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-05-01T17:19:36Z

This pull request was exported from Phabricator. Differential Revision: D73891425

…calculation to MHA" Leveraging previous work now we allow MHA to have ring buffer cache. If ring buffer cache is used then we query the mask from kv cache and use that for sdpa instead of using precalculated mask. In this process we had to adjsut ring buffer implementation to allow keeping the context of full sliding window. See code for comment. Differential Revision: [D73891425](https://our.internmc.facebook.com/intern/diff/D73891425/) [ghstack-poisoned]

facebook-github-bot · 2025-05-05T14:07:59Z

This pull request was exported from Phabricator. Differential Revision: D73891425

…calculation to MHA" Leveraging previous work now we allow MHA to have ring buffer cache. If ring buffer cache is used then we query the mask from kv cache and use that for sdpa instead of using precalculated mask. In this process we had to adjsut ring buffer implementation to allow keeping the context of full sliding window. See code for comment. Differential Revision: [D73891425](https://our.internmc.facebook.com/intern/diff/D73891425/) [ghstack-poisoned]

facebook-github-bot · 2025-05-07T04:04:14Z

This pull request was exported from Phabricator. Differential Revision: D73891425

kimishpatel requested review from lucylq and jackzhxng as code owners May 1, 2025 17:19

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 1, 2025

facebook-github-bot added the fb-exported label May 1, 2025

kimishpatel added the release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava label May 5, 2025

digantdesai approved these changes May 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Executorch][llm] Add ring buffer based kv cache and mask calculation to MHA #10609

[Executorch][llm] Add ring buffer based kv cache and mask calculation to MHA #10609

kimishpatel commented May 1, 2025 •

edited

Loading

pytorch-bot bot commented May 1, 2025 •

edited

Loading

facebook-github-bot commented May 1, 2025

facebook-github-bot commented May 5, 2025

facebook-github-bot commented May 7, 2025

[Executorch][llm] Add ring buffer based kv cache and mask calculation to MHA #10609

Are you sure you want to change the base?

[Executorch][llm] Add ring buffer based kv cache and mask calculation to MHA #10609

Conversation

kimishpatel commented May 1, 2025 • edited Loading

pytorch-bot bot commented May 1, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10609

❌ 1 New Failure, 1 Unrelated Failure

facebook-github-bot commented May 1, 2025

facebook-github-bot commented May 5, 2025

facebook-github-bot commented May 7, 2025

kimishpatel commented May 1, 2025 •

edited

Loading

pytorch-bot bot commented May 1, 2025 •

edited

Loading