[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead #23673

linzebing · 2025-08-26T16:16:48Z

Purpose

The current BlockHash design stores the hash value alongside a tuple of all token IDs and any extra keys to prevent collisions, and BlockHashWithGroupId layers a group identifier on top of that structure. Every block therefore retains a tuple of Python integers for its tokens, creating many short‑lived objects. This design reduces collision risk, but it can significantly increase memory pressure and trigger frequent garbage collection.

@Jialin's analysis: On average, there are floor(1500/16)=93 blocks per request when input prompt is 1500 and block_size is 16. As shown above, there are 4 objects to be collected per block, so there’re 372 objects to be collected per request. Given that, the default GC threshold is 700 (i.e. a GC collect might be kicked off when it collected more than 700 objects. 1.88=700/372 requests might kick off a GC collect event which is too frequent.

Replacing these classes with bytes hashes (e.g., and <group_id>) would eliminate the per-block token tuple and nested NamedTuples, greatly reducing object counts. A full 256‑bit SHA‑256 hexdigest makes collisions virtually impossible, so token IDs and extra keys no longer need to be embedded in the hash structure.

bytes digest is preferred over str as it avoids extra work to convert SHA-256 digest into a string while also being more memory efficient. Replacing hash_fn from hash to sha256 incurs extra CPU overhead, but benchmarking shows that it's worth the effort as it greatly reduces number of short-lived objects and therefore reduces GC overheads.

Test Plan

Performance (H100)

Used @Jialin's Jialin#4 for GC analysis.

CUDA_VISIBLE_DEVICES=7 VLLM_USE_V1=1 HF_HUB_DISABLE_XET=1 python3 benchmarks/benchmark_throughput.py \
    --model facebook/opt-125m \
    --backend vllm \
    --input-len 1500 \
    --output-len 150 \
    --num-prompts 10000

CUDA_VISIBLE_DEVICES=7 VLLM_USE_V1=1 HF_HUB_DISABLE_XET=1 python3 benchmarks/benchmark_throughput.py \
    --model facebook/opt-125m \
    --backend vllm \
    --input-len 1500 \
    --output-len 1 \
    --num-prompts 20000

Hash methods benchmark: https://gist.github.com/linzebing/bd51f09a183bfd53c7b856ebf412fe68

Eval

HF_HUB_DISABLE_XET=1 lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1 --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 250

Test Result

Performance

input len = 1500, output len = 150

5.33% throughput improvement
main: Throughput: 105.25 requests/s, 173665.94 total tokens/s, 15787.94 output tokens/s
linzebing/gc_bytes: Throughput: 110.86 requests/s, 182910.25 total tokens/s, 16628.46 output tokens/s

Both the number of GC events as well as outlier GC events reduced.

input len = 1500, output len = 1

Ran the above command 3 times: 1.43% throughput improvement
main: Throughput: 235.92 requests/s, 354106.22 total tokens/s, 235.92 output tokens/s
linzebing/gc_bytes: Throughput: 239.29 requests/s, 359170.53 total tokens/s, 239.29 output tokens/s

Both the number of GC events as well as outlier GC events reduced.

benchmark_hash_methods.py results:

============================================================
HASH FUNCTION PERFORMANCE BENCHMARK
============================================================
Test data: (32-byte bytes object, 32-int tuple)
Iterations: 10,000
============================================================

Results:
  SHA256:     1.17 ±   0.57 μs
  SHA256_CBOR_64BIT:     6.37 ±   0.85 μs
  Built-in hash:     0.18 ±   0.14 μs

============================================================
SUMMARY
============================================================
• Built-in hash() is 6.4x faster than SHA256
• Built-in hash() is 34.8x faster than SHA256_CBOR_64BIT
• SHA256: 1.2μs per hash (cryptographically secure)
• SHA256_CBOR_64BIT: 6.4μs per hash (cryptographically secure, cross-language compatible)
• Built-in: 0.2μs per hash (fast, not secure)

So sha256 is around 1μs slower, however the GC savings are much greater.

eval

main:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.792	±	0.0257
		strict-match	5	exact_match	↑	0.764	±	0.0269

linzebing/gc_bytes:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.792	±	0.0257
		strict-match	5	exact_match	↑	0.764	±	0.0269

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

vllm/v1/core/kv_cache_utils.py

vllm/utils/__init__.py

Signed-off-by: linzebing <linzebing1995@gmail.com>

linzebing · 2025-09-08T02:26:36Z

Thanks @njhill !! Addressed comments.

njhill

Thanks @linzebing! Great work!

njhill · 2025-09-08T17:03:34Z

@linzebing looks like there are some other tests that need adjusting: https://buildkite.com/vllm/ci/builds/29810#019929f3-0b94-47a6-af38-31ed247f0e5a

Signed-off-by: linzebing <linzebing1995@gmail.com>

linzebing · 2025-09-08T20:40:39Z

@njhill : done. distributed-tests-2-gpus and distributed-tests-2-gpus seem to be test_infra issue; v1/e2e/test_spec_decode.py::test_ngram_correctness doesn't seem related, I can't repro locally. The test is non-deterministic and matched count can drop below the heuristic-based threshold when unlucky.

…llm-project#23673) Signed-off-by: linzebing <linzebing1995@gmail.com>

### What this PR does / why we need it? 1. Initial support disable tp for integrating with [vllm-commit](vllm-project/vllm#23024) 2. [vllm@commit](vllm-project/vllm#23673) now use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add the integration - vLLM version: main - vLLM main: vllm-project/vllm@e408272 --------- Signed-off-by: wangli <wangli858794774@gmail.com>

…llm-project#23673) Signed-off-by: linzebing <linzebing1995@gmail.com>

### What this PR does / why we need it? 1. Initial support disable tp for integrating with [vllm-commit](vllm-project/vllm#23024) 2. [vllm@commit](vllm-project/vllm#23673) now use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add the integration - vLLM version: main - vLLM main: vllm-project/vllm@e408272 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: offline0806 <z00858301@china.huawei.com>

### What this PR does / why we need it? 1. Initial support disable tp for integrating with [vllm-commit](vllm-project/vllm#23024) 2. [vllm@commit](vllm-project/vllm#23673) now use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add the integration - vLLM version: main - vLLM main: vllm-project/vllm@e408272 --------- Signed-off-by: wangli <wangli858794774@gmail.com>

…llm-project#23673) Signed-off-by: linzebing <linzebing1995@gmail.com>

### What this PR does / why we need it? 1. Initial support disable tp for integrating with [vllm-commit](vllm-project/vllm#23024) 2. [vllm@commit](vllm-project/vllm#23673) now use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add the integration - vLLM version: main - vLLM main: vllm-project/vllm@e408272 --------- Signed-off-by: wangli <wangli858794774@gmail.com>

…llm-project#23673) Signed-off-by: linzebing <linzebing1995@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

mergify bot added documentation Improvements or additions to documentation v1 labels Aug 26, 2025

linzebing changed the title ~~[Core] Use sha256 bytes instead of BlockHash and BlockHashWithGroupId~~ [Core] Use sha256 bytes instead of BlockHash to reduce GC overhead Aug 26, 2025

linzebing force-pushed the gc_bytes branch from a3b8295 to 31277f3 Compare August 26, 2025 16:58

Jialin reviewed Aug 26, 2025

View reviewed changes

vllm/v1/core/kv_cache_utils.py Outdated Show resolved Hide resolved

vllm/v1/core/kv_cache_utils.py Outdated Show resolved Hide resolved

vllm/utils/__init__.py Outdated Show resolved Hide resolved

linzebing force-pushed the gc_bytes branch 7 times, most recently from b19c193 to 3814ea8 Compare August 27, 2025 16:34

mergify bot added the performance Performance-related issues label Aug 27, 2025

linzebing force-pushed the gc_bytes branch from 3814ea8 to e9f6254 Compare August 27, 2025 17:46

linzebing requested a review from Jialin August 27, 2025 17:47

linzebing marked this pull request as ready for review August 27, 2025 17:47

linzebing requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners August 27, 2025 17:47

linzebing added 3 commits September 7, 2025 18:50

[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead

ac74c1c

Signed-off-by: linzebing <linzebing1995@gmail.com>

Fix tests

9accf26

Signed-off-by: linzebing <linzebing1995@gmail.com>

Make kv_events backward compatible

e7caea5

Signed-off-by: linzebing <linzebing1995@gmail.com>

linzebing force-pushed the gc_bytes branch from 50a1fc8 to e7caea5 Compare September 8, 2025 02:22

mergify bot removed the needs-rebase label Sep 8, 2025

linzebing requested a review from njhill September 8, 2025 02:26

njhill approved these changes Sep 8, 2025

View reviewed changes

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 8, 2025

Fix utils_/test_utils.py

92e4616

Signed-off-by: linzebing <linzebing1995@gmail.com>

vllm-bot merged commit 82dfb12 into vllm-project:main Sep 9, 2025
39 of 42 checks passed

Potabk mentioned this pull request Sep 9, 2025

[Bugfix] Fix broken CI vllm-project/vllm-ascend#2825

Merged

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025

[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead (v…

178ad49

…llm-project#23673) Signed-off-by: linzebing <linzebing1995@gmail.com>

orozery mentioned this pull request Sep 10, 2025

[KV offload][5/N] Add CPUOffloadingSpec #24251

Merged

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead (v…

1db44bb

…llm-project#23673) Signed-off-by: linzebing <linzebing1995@gmail.com>

heheda12345 mentioned this pull request Sep 15, 2025

[Tests] fix initialization of kv hash in tests #24273

Merged

5 tasks

vMaroon mentioned this pull request Sep 16, 2025

Sync up with hashing/KVEvents changes in vLLM llm-d/llm-d-kv-cache-manager#127

Closed

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead (v…

4ba5448

…llm-project#23673) Signed-off-by: linzebing <linzebing1995@gmail.com>

vMaroon mentioned this pull request Oct 1, 2025

Align with recent vLLM kv-block hashing changes llm-d/llm-d-kv-cache-manager#138

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead #23673

[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead #23673

Uh oh!

linzebing commented Aug 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linzebing commented Sep 8, 2025

Uh oh!

njhill left a comment •

edited

Loading

Uh oh!

njhill commented Sep 8, 2025

Uh oh!

linzebing commented Sep 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead #23673

[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead #23673

Uh oh!

Conversation

linzebing commented Aug 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Performance (H100)

Eval

Test Result

Performance

input len = 1500, output len = 150

input len = 1500, output len = 1

eval

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linzebing commented Sep 8, 2025

Uh oh!

njhill left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill commented Sep 8, 2025

Uh oh!

linzebing commented Sep 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

linzebing commented Aug 26, 2025 •

edited by github-actions bot

Loading

njhill left a comment •

edited

Loading