Skip to content

Conversation

vMaroon
Copy link
Contributor

@vMaroon vMaroon commented Jul 5, 2025

Purpose

As part of [RFC]: KV-Cache Management Standardization for Interop #20492 and to support the development of llm-d's vLLM-native global KV-Cache indexer, prefix-cache block hashing must be reproducible. Given the same prefix-cache key input and configuration, the block hash should be reproducible - with no constraints over technical stack choices.

This PR introduces:

  • A new block hashing function: sha256_cbor, which serializes input objects using canonical CBOR (via cbor2) and hashes them with SHA-256
    • The result is truncated to 64 bits to match the current KVEvents schema, which does not yet support full 256-bit hash keys
      • Regardless, 64 bits provide extremely low collision odds for practical KV-cache sizes (e.g., 1M tokens cache with 16-tokens chunking -> ~62k blocks -> ~1 in 10 billion collision rate using the birthday bound, while keeping KVEvent traffic bandwidth compact
  • A change to the global NONE_HASH initialization logic to use the configured hash function

These changes make the prefix hashing logic reproducible, non-language-specific, and aligned with future cross-system KV-Cache interoperability goals outlined in the RFC.

Test Plan

The relevant test files were updated, there is no need for new ones:

  • tests/v1/core/test_kv_cache_utils.py
  • tests/v1/core/test_prefix_caching.py

Profiling

The total difference hashing a 50k tokens request (block size 16) is negligible.

=== System Information ===
Platform: macOS-15.5-arm64-arm-64bit-Mach-O
Processor: arm
Python version: 3.13.5
CPU count: 8
RAM: 32.0 GB
=========================

=== Hash Function Profiling Summary ===
AI workload equivalent per run: 50,000 tokens processed
Profiling config: 1000 runs, 3125 blocks/run, block_size=16
---------------------------------------
hash: mean=0.0012s, std=0.0020s
    Mean time per token: 0.00000002s
sha256: mean=0.0054s, std=0.0003s
    Mean time per token: 0.00000011s
sha256_cbor_64bit: mean=0.0171s, std=0.0044s
    Mean time per token: 0.00000034s
---------------------------------------
Comparison (relative slowdown, higher is slower):
    hash: 1.00x (baseline) mean diff: (0.0000s per 50,000 tokens, 0.00000000s per token)
    sha256: 4.62x  mean diff: (+0.0042s per 50,000 tokens, +0.00000008s per token)
    sha256_cbor_64bit: 14.73x  mean diff: (+0.0159s per 50,000 tokens, +0.00000032s per token)
=======================================

code: https://pastebin.com/7thahB9Y

Test Results

All updated tests pass.

Copy link

github-actions bot commented Jul 5, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @vMaroon, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reproducibility and cross-system compatibility of prefix-cache block hashing. It achieves this by introducing a new sha256_cbor hashing algorithm that leverages canonical CBOR serialization, and by making the initial hash seed (NONE_HASH) configurable to use the selected hashing function. These changes are crucial for supporting future global KV-Cache interoperability goals.

Highlights

  • New Hashing Algorithm for Prefix Caching: Introduced a new sha256_cbor hashing function for prefix-cache block hashing. This function serializes input objects using canonical CBOR (via the cbor2 library) and then hashes them with SHA-256. The resulting 256-bit hash is truncated to 64 bits to align with the current KVEvents schema, ensuring reproducibility across different systems and programming languages.
  • Configurable NONE_HASH Initialization: The global NONE_HASH initialization logic has been refactored into a new init_none_hash function. This allows the initial hash seed for prefix block sequences to be generated using the currently configured hashing algorithm (e.g., sha256_cbor, sha256, or Python's hash), making the system more flexible and consistent.
  • Dependency and Configuration Updates: The cbor2 library has been added as a new dependency in requirements/common.txt. Additionally, the PrefixCachingHashAlgo literal type in vllm/config.py has been updated to include the new sha256_cbor option, along with an expanded docstring explaining its behavior and benefits.
  • Expanded Test Coverage: Existing unit tests in tests/v1/core/test_kv_cache_utils.py and tests/v1/core/test_prefix_caching.py have been updated. They now include sha256_cbor in their parametrization, ensuring that the new hashing algorithm is thoroughly tested across various scenarios, including NONE_HASH initialization, block hashing, and request token hashing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new reproducible hashing mechanism using CBOR and SHA-256 for prefix caching. My review focuses on the potential risks of hash truncation, an inconsistency in NONE_HASH initialization, and missing test setup calls that could lead to flaky tests.

@vMaroon
Copy link
Contributor Author

vMaroon commented Jul 5, 2025

The force push adds sign-off per DCO and addresses gemini's suggestions.

Copy link
Contributor

@yinghai yinghai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you see any performance difference with change of hash function? What if there is mm content?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested that it's generates the same hash on other language e.g. rust?

Copy link
Contributor Author

@vMaroon vMaroon Jul 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was tested in Golang, see this llm-d-kv-cache-manager PR that reproduces these hashes independently.

Any implementation of the canonical specs should be fine, while also paying attention to endianness.

@vMaroon
Copy link
Contributor Author

vMaroon commented Jul 6, 2025

@yinghai I haven't profiled the hashing/serialization (this will follow later on across a different work-stream), but note that:

  1. SHA256 is already supported as a hashing algorithm in vLLM
  2. CBOR for serialization is widely used and is comparable to MessagePack in speed according to some blogs

I think vLLM should have a few more canonical options for both serialization and hashing, but this combination is a good kickoff. Once I have profiling data and benchmarks I will share, though I don't think it is a blocker considering that:

  • The effect of this PR on vLLM configurations that do not use the new algorithm are in the calculation of NONE_HASH, which no longer uses sha256 if the builtin algorithm is chosen
    • The reasoning is that I believe it is cleaner this way, plus it seems like that person that added sha256 as an option and made it so that NONE_HASH always uses sha256 is mostly interested in that configuration
  • Such a feature is inevitable if considering the mentioned RFC ([RFC]: KV-Cache Interoperability API Standardization #20492)

@orozery
Copy link
Contributor

orozery commented Jul 8, 2025

Before this PR, and still after this PR, the hashing specification seems very cumbersome to me.

Why is the input for hashing a list[Any]? (cc @comaniac)
I understand that it's easiest to simply not care and throw this to some general serializer (pickle before this PR, cbor2 after).
But on the other hand:

  1. This still keeps it complex to match with external implementations (like llm-d), and to keep sync between implementations.
  2. It may be more efficient to serialize if you know how your input looks like.

I would prefer to align the input to list[int], where we know the int size (I think it should be determined by vllm_config.model_config.get_vocab_size()), and then simply use struct.pack.
I also think we should keep the block hashes as bytes (instead of int) to save serialization time for computing hashes for preceding blocks.

Another thing I don't understand is how does this work given that we have a non-deterministic initial NONE_HASH. Why do we need a non-deterministic initialization in the first place?

@vMaroon
Copy link
Contributor Author

vMaroon commented Jul 8, 2025

Before this PR, and still after this PR, the hashing specification seems very cumbersome to me.

Why is the input for hashing a list[Any]? (cc @comaniac) I understand that it's easiest to simply not care and throw this to some general serializer (pickle before this PR, cbor2 after). But on the other hand:

  1. This still keeps it complex to match with external implementations (like llm-d), and to keep sync between implementations.
  2. It may be more efficient to serialize if you know how your input looks like.

I would prefer to align the input to list[int], where we know the int size (I think it should be determined by vllm_config.model_config.get_vocab_size()), and then simply use struct.pack. I also think we should keep the block hashes as bytes (instead of int) to save serialization time for computing hashes for preceding blocks.

Another thing I don't understand is how does this work given that we have a non-deterministic initial NONE_HASH. Why do we need a non-deterministic initialization in the first place?

Thanks Or @orozery, I agree with all these points. I wanted to limit the scope of this PR to adding a reproducible hashing algorithm and follow-up with minor refactoring of the hashing functions and KVEvents schema:

  1. On input typing to the hash function:

    def hash_block_tokens(
            hash_function: Callable,
            parent_block_hash: Optional[int],
            curr_block_token_ids: Sequence[int],
            extra_keys: Optional[tuple[Any, ...]] = None) -> BlockHash
    ...
    hash_function((parent_block_hash, curr_block_token_ids_tuple, extra_keys))
    

    extra_keys should be explicitly typed and the whole input explicitly defined for robustness and clarity.
    This is part of the mentioned RFC in goals:

    1. Ensure Reproducible Block Hashing Across Languages:
      • Defined structure for input objects (e.g., token arrays, extra_keys)
  2. NONE_HASH is non-deterministic if you do not explicitly set PYTHONHASHSEED. This is a common practice to setting the builtin hash function's seed, but I think that an env var with a better name can be introduced

  3. KVEvents block_hashes type must also change to bytes to allow for >64bit hashes to go through (Msgpack over int limits to 64bits)

@njhill njhill requested a review from heheda12345 July 8, 2025 13:29
@njhill
Copy link
Member

njhill commented Jul 8, 2025

I think the primary consideration is performance. This is the reason that sha256 wasn't made the default - see discussion in #15297.

cc @comaniac @dr75 @heheda12345

@dr75
Copy link
Contributor

dr75 commented Jul 8, 2025

I think the primary consideration is performance. This is the reason that sha256 wasn't made the default - see discussion in #15297.

cc @comaniac @dr75 @heheda12345

Yes, it wasn't made the default because of performance (in my opinion the impact is small enough to accept sha256; see measurements made with 50k token context in #15297).

The reason for non-determinism is to prevent exploitation of hash collisions when using hash() (which was there before sha256) to extract context of other users in a multi-tenant env. See #12621. @russellb

With sha256 that should not be an issue. With sha256 and using only 64 bit it might be a problem I guess.

@vMaroon
Copy link
Contributor Author

vMaroon commented Jul 8, 2025

I think we should have better configurability of serializers, hashers and trimming if relevant than this PR introduces but I expect such work to require a larger time-window for acceptance.

The trimming here was done because it remains a valid algorithm while avoiding a bug in KVEvents - my immediate interest in this work - without changing the schema (also for the scoping/E2E time-investment considerations).

Migrating to canonical and reproducible algorithms as defaults is part of the RFC but is not of urgency.

@vMaroon
Copy link
Contributor Author

vMaroon commented Jul 12, 2025

The force push rebases on main.

Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@heheda12345
Copy link
Collaborator

Can you import cbor2 in sha256_cbor_64bit function to fix the doc build failure? https://app.readthedocs.org/projects/vllm/builds/28834408/#278229768--52

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
@vMaroon
Copy link
Contributor Author

vMaroon commented Jul 13, 2025

@heheda12345
Copy link
Collaborator

Retrying.

@heheda12345 heheda12345 enabled auto-merge (squash) July 14, 2025 02:22
@heheda12345 heheda12345 merged commit 66f6fbd into vllm-project:main Jul 14, 2025
99 checks passed
hsubramony pushed a commit to HabanaAI/vllm-fork that referenced this pull request Jul 24, 2025
…256 + CBOR (64bit)vllm-project#20511

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
…256 + CBOR (64bit) (vllm-project#20511)

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Signed-off-by: x22x22 <wadeking@qq.com>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
…256 + CBOR (64bit) (vllm-project#20511)

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
…256 + CBOR (64bit) (vllm-project#20511)

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
…256 + CBOR (64bit) (vllm-project#20511)

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
…256 + CBOR (64bit) (vllm-project#20511)

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
…256 + CBOR (64bit) (vllm-project#20511)

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 27, 2025
…256 + CBOR (64bit) (vllm-project#20511)

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants