Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs][V1] Prefix caching design #12598

Merged
merged 4 commits into from
Jan 31, 2025
Merged

[Docs][V1] Prefix caching design #12598

merged 4 commits into from
Jan 31, 2025

Conversation

comaniac
Copy link
Collaborator

@comaniac comaniac commented Jan 31, 2025

  • Create v1 design document section in docs.
  • Add prefix caching design doc.

@WoosukKwon @ywang96

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added the documentation Improvements or additions to documentation label Jan 31, 2025
docs/source/index.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The doc is very clear and contains all the details. 👍 👍 👍

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
@comaniac comaniac enabled auto-merge (squash) January 31, 2025 20:04
@comaniac comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 31, 2025
@simon-mo simon-mo disabled auto-merge January 31, 2025 20:30
@simon-mo simon-mo merged commit 60bcef0 into vllm-project:main Jan 31, 2025
18 of 35 checks passed
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting the effort in making these figures!

@comaniac comaniac deleted the v1-apc branch January 31, 2025 23:14
mawong-amd added a commit to ROCm/vllm that referenced this pull request Feb 1, 2025
commit 5d5071c
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Sat Feb 1 01:13:23 2025 +0000

    reduce split kv amount

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 5fe1d1d
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Sat Feb 1 00:56:45 2025 +0000

    format

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 0d66687
Author: Simon Mo <simon.mo@hey.com>
Date:   Fri Jan 31 16:39:19 2025 -0800

    Update loader.py

    Co-authored-by: Michael Goin <mgoin64@gmail.com>
    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 5002734
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Sat Feb 1 00:14:14 2025 +0000

    simplification

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit fac827f
Merge: db2c583 44bbca7
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Sat Feb 1 00:09:36 2025 +0000

    Merge remote-tracking branch 'origin/main' into mla-fp8

commit db2c583
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Sat Feb 1 00:06:10 2025 +0000

    filter compressed tensor models better

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit e144da8
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Fri Jan 31 18:41:35 2025 -0500

    Update vllm/model_executor/model_loader/loader.py

    Co-authored-by: Simon Mo <simon.mo@hey.com>
    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 1621381
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Fri Jan 31 18:41:22 2025 -0500

    Update vllm/model_executor/model_loader/loader.py

    Co-authored-by: Simon Mo <simon.mo@hey.com>
    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 9829fae
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Fri Jan 31 23:40:12 2025 +0000

    misc

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 44bbca7
Author: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Date:   Fri Jan 31 17:38:48 2025 -0600

    [Doc] int4 w4a16 example (vllm-project#12585)

    Based on a request by @mgoin , with @kylesayrs we have added an example
    doc for int4 w4a16 quantization, following the pre-existing int8 w8a8
    quantization example and the example available in
    [`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py)

    FIX #n/a (no issue created)

    @kylesayrs and I have discussed a couple additional improvements for the
    quantization docs. We will revisit at a later date, possibly including:
    - A section for "choosing the correct quantization scheme/ compression
    technique"
    - Additional vision or audio calibration datasets

    ---------

    Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
    Co-authored-by: Michael Goin <michael@neuralmagic.com>

commit 60808bd
Author: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Date:   Fri Jan 31 23:38:35 2025 +0000

    [Doc] Improve installation signposting (vllm-project#12575)

    - Make device tab names more explicit
    - Add comprehensive list of devices to
    https://docs.vllm.ai/en/latest/getting_started/installation/index.html
    - Add `attention` blocks to the intro of all devices that don't have
    pre-built wheels/images

    ---------

    Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

commit fc54214
Author: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com>
Date:   Fri Jan 31 18:37:30 2025 -0500

    [Feature] Fix guided decoding blocking bitmask memcpy (vllm-project#12563)

    **[Guided decoding performance optimization]** Sending the guided
    decoding bitmask in xgrammar to the GPU
    (`self.token_bitmask.to(scores.device)`) is a blocking operation that
    prevents the CPU from pre-launching the sampler kernels. The CPU waits
    until decode is complete, then copies the bitmask over. This PR changes
    the operation to async via setting `non-blocking=True`.

    (Current) The CPU is blocked on a `cudaStreamSynchronize` and only
    pre-empts the sampling kernels after bitmask application. Below is the
    Nsys profile for one decode phase from Llama 3.1 8B.

    ![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824)

    With the optimization, this is no longer the case:

    ![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7)

    ---------

    Signed-off-by: Ryan N <ryan.nguyen@centml.ai>

commit eb5741a
Author: Tyler Michael Smith <tyler@neuralmagic.com>
Date:   Fri Jan 31 18:29:11 2025 -0500

    [Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (vllm-project#12587)

    Integrates the block-quantized kernels introduced in
    vllm-project#11868 for use in linear
    layers.

    Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

commit 145c2ff
Author: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Date:   Fri Jan 31 18:28:47 2025 -0500

    [Bugfix] Revert MoE Triton Config Default (vllm-project#12629)

    SUMMARY:
    * previous PR for pulling in block configs also changed defaults
    (https://github.com/vllm-project/vllm/pull/11589/files) for FP8
    * this broke L4 MoE since there was not enough SHM for the default
    configuration
    * this reverts the non-block example to the default

    Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

commit 415f194
Author: Kevin H. Luu <kevin@anyscale.com>
Date:   Fri Jan 31 13:39:36 2025 -0800

    [release] Add input step to ask for Release version (vllm-project#12631)

    Instead of having to create a new build with release version put in as
    env var.

commit 4251506
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Fri Jan 31 21:26:13 2025 +0000

    fixes

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit c9d72cb
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Fri Jan 31 21:17:23 2025 +0000

    more cleanup

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 3cdd2ce
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Fri Jan 31 21:16:42 2025 +0000

    cleanup

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 89003c4
Author: Chen Zhang <zhangch99@outlook.com>
Date:   Sat Feb 1 05:13:04 2025 +0800

    [v1][Bugfix] Add extra_keys to block_hash for prefix caching (vllm-project#12603)

    This pr adds extra key to block hash, to generate different hash value
    for two blocks with the same token string but different extra_keys in
    their parent blocks. For example, it can generate different hash value
    for the second block of the following two requests:
    ```python
    request1 = make_request(
            request_id=0,
            prompt_token_ids=[_ for _ in range(6)],
            mm_positions=[{
                "offset": 0,
                "length": 3
            }, {
                "offset": 3,
                "length": 3
            }],
            mm_hashes=["hash1", "hash2"],
        )
        request2 = make_request(
            request_id=1,
            prompt_token_ids=[_ for _ in range(6)],
            mm_positions=[{
                "offset": 0,
                "length": 3
            }, {
                "offset": 3,
                "length": 3
            }],
            mm_hashes=["hash3", "hash2"],
        )
    ```

    ---------

    Signed-off-by: Chen Zhang <zhangch99@outlook.com>

commit f51cbe0
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Fri Jan 31 21:04:22 2025 +0000

    review comments

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 3d12a04
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Fri Jan 31 20:45:14 2025 +0000

    working but messy

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 60bcef0
Author: Cody Yu <hao.yu.cody@gmail.com>
Date:   Fri Jan 31 12:30:46 2025 -0800

    [Docs][V1] Prefix caching design (vllm-project#12598)

    - Create v1 design document section in docs.
    - Add prefix caching design doc.

    @WoosukKwon @ywang96

    ---------

    Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

commit 847f883
Author: Cody Yu <hao.yu.cody@gmail.com>
Date:   Fri Jan 31 12:30:33 2025 -0800

    [Git] Automatically sign-off commits (vllm-project#12595)

    It's very annoying when I forgot to add `-s` in `git commit` to
    sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git
    push -f` to fix the DCO. This PR adds a hook to sign off commits
    automatically when `-s` is missing to solve this problem. The only
    change from the user side is now users have to install 2 hooks, so
    instead of just

    ```
    pre-commit install
    ```

    Now we need to

    ```
    pre-commit install --hook-type pre-commit --hook-type commit-msg
    ```

    Note that even if users still only install the pre-commit hook, they
    won't get any error in `git commit`. Just the sign-off hook won't run.

    cc @hmellor @youkaichao

    ---------

    Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

commit 325f679
Author: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Date:   Fri Jan 31 15:06:39 2025 -0500

    [BugFix] Fix Torch.Compile For DeepSeek (vllm-project#12594)

    Co-authored-by: simon-mo <xmo@berkeley.edu>

commit 548ec44
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Fri Jan 31 19:13:22 2025 +0000

    simon changes

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit a57cd3d
Merge: 076cbe5 cabaf4e
Author: simon-mo <simon.mo@hey.com>
Date:   Fri Jan 31 07:52:26 2025 +0000

    Merge branch 'main' of github.com:vllm-project/vllm into mla-fp8

commit 076cbe5
Merge: 0ccbcce a1fc18c
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Thu Jan 30 23:31:41 2025 -0500

    Merge branch 'main' into mla-fp8

commit 0ccbcce
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Fri Jan 31 04:29:17 2025 +0000

    deepseek v3 support

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 645622c
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Fri Jan 31 03:08:36 2025 +0000

    cleanup

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 2d61054
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Fri Jan 31 03:03:07 2025 +0000

    cleanup

    Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit f2b2500
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Fri Jan 31 02:47:05 2025 +0000

    Fix TP > 1 cuda graphs

    Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 433322b
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Fri Jan 31 02:26:11 2025 +0000

    Revert "add cuda graph support"

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 31c34bf
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Thu Jan 30 23:06:09 2025 +0000

    ci fix

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 54ba87d
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Thu Jan 30 21:23:09 2025 +0000

    add cuda graph support

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 5afc1bf
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Thu Jan 30 20:58:53 2025 +0000

    fix mypy

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit cfb2d26
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Thu Jan 30 19:42:36 2025 +0000

    fix mypy

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 37e39f4
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Thu Jan 30 18:04:58 2025 +0000

    fix failing test

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 0881475
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Thu Jan 30 17:18:55 2025 +0000

    disable MLA for v3 for now

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 4a46014
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Thu Jan 30 11:12:48 2025 -0500

    Update vllm/attention/backends/mla/utils.py

    Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 09d814c
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Thu Jan 30 15:11:58 2025 +0000

    review comments

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 8bdc14a
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Thu Jan 30 14:09:46 2025 +0000

    review comments

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit d27826d
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Thu Jan 30 08:51:42 2025 -0500

    Update vllm/config.py

    Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 7487429
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Thu Jan 30 04:00:26 2025 +0000

    renaming for consistency

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 634eee6
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Thu Jan 30 03:52:59 2025 +0000

    review comments

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 31b802c
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Wed Jan 29 22:51:37 2025 -0500

    Update vllm/attention/backends/mla/utils.py

    Co-authored-by: Michael Goin <mgoin64@gmail.com>
    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 068e672
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Wed Jan 29 22:46:43 2025 -0500

    Update utils.py

    Co-authored-by: Michael Goin <mgoin64@gmail.com>
    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit f2cac91
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Thu Jan 30 03:11:43 2025 +0000

    more cleanups

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit c34e5ca
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Thu Jan 30 03:02:58 2025 +0000

    fix VLLM_MLA_PERFORM_MATRIX_ABSORPTION=0

    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

commit 27ad92c
Author: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Date:   Thu Jan 30 02:29:40 2025 +0000

    squashed commits

    Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
    Co-authored-by: simon-mo <simon.mo@hey.com>
    Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Isotr0py pushed a commit to Isotr0py/vllm that referenced this pull request Feb 2, 2025
- Create v1 design document section in docs.
- Add prefix caching design doc.

@WoosukKwon @ywang96

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
youngkent pushed a commit to youngkent/vllm that referenced this pull request Feb 3, 2025
- Create v1 design document section in docs.
- Add prefix caching design doc.

@WoosukKwon @ywang96

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
srikanthsrnvs pushed a commit to srikanthsrnvs/vllm that referenced this pull request Feb 3, 2025
- Create v1 design document section in docs.
- Add prefix caching design doc.

@WoosukKwon @ywang96

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
sahelib25 pushed a commit to krai/vllm that referenced this pull request Feb 3, 2025
- Create v1 design document section in docs.
- Add prefix caching design doc.

@WoosukKwon @ywang96

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
fxmarty-amd pushed a commit to fxmarty-amd/vllm that referenced this pull request Feb 7, 2025
- Create v1 design document section in docs.
- Add prefix caching design doc.

@WoosukKwon @ywang96

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
NickLucche pushed a commit to NickLucche/vllm that referenced this pull request Feb 7, 2025
- Create v1 design document section in docs.
- Add prefix caching design doc.

@WoosukKwon @ywang96

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
ShangmingCai pushed a commit to ShangmingCai/vllm that referenced this pull request Feb 10, 2025
- Create v1 design document section in docs.
- Add prefix caching design doc.

@WoosukKwon @ywang96

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
GWS0428 pushed a commit to GWS0428/VARserve that referenced this pull request Feb 12, 2025
- Create v1 design document section in docs.
- Add prefix caching design doc.

@WoosukKwon @ywang96

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants