Skip to content

Conversation

linzebing
Copy link
Contributor

@linzebing linzebing commented Aug 26, 2025

Purpose

The current BlockHash design stores the hash value alongside a tuple of all token IDs and any extra keys to prevent collisions, and BlockHashWithGroupId layers a group identifier on top of that structure. Every block therefore retains a tuple of Python integers for its tokens, creating many short‑lived objects. This design reduces collision risk, but it can significantly increase memory pressure and trigger frequent garbage collection.

image

@Jialin's analysis: On average, there are floor(1500/16)=93 blocks per request when input prompt is 1500 and block_size is 16. As shown above, there are 4 objects to be collected per block, so there’re 372 objects to be collected per request. Given that, the default GC threshold is 700 (i.e. a GC collect might be kicked off when it collected more than 700 objects. 1.88=700/372 requests might kick off a GC collect event which is too frequent.

Replacing these classes with bytes hashes (e.g., and <group_id>) would eliminate the per-block token tuple and nested NamedTuples, greatly reducing object counts. A full 256‑bit SHA‑256 hexdigest makes collisions virtually impossible, so token IDs and extra keys no longer need to be embedded in the hash structure.

bytes digest is preferred over str as it avoids extra work to convert SHA-256 digest into a string while also being more memory efficient. Replacing hash_fn from hash to sha256 incurs extra CPU overhead, but benchmarking shows that it's worth the effort as it greatly reduces number of short-lived objects and therefore reduces GC overheads.

Test Plan

Performance (H100)

Used @Jialin's Jialin#4 for GC analysis.

CUDA_VISIBLE_DEVICES=7 VLLM_USE_V1=1 HF_HUB_DISABLE_XET=1 python3 benchmarks/benchmark_throughput.py \
    --model facebook/opt-125m \
    --backend vllm \
    --input-len 1500 \
    --output-len 150 \
    --num-prompts 10000
CUDA_VISIBLE_DEVICES=7 VLLM_USE_V1=1 HF_HUB_DISABLE_XET=1 python3 benchmarks/benchmark_throughput.py \
    --model facebook/opt-125m \
    --backend vllm \
    --input-len 1500 \
    --output-len 1 \
    --num-prompts 20000

Hash methods benchmark: https://gist.github.com/linzebing/bd51f09a183bfd53c7b856ebf412fe68

Eval

HF_HUB_DISABLE_XET=1 lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1 --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 250

Test Result

Performance

input len = 1500, output len = 150

5.33% throughput improvement
main: Throughput: 105.25 requests/s, 173665.94 total tokens/s, 15787.94 output tokens/s
linzebing/gc_bytes: Throughput: 110.86 requests/s, 182910.25 total tokens/s, 16628.46 output tokens/s

Both the number of GC events as well as outlier GC events reduced.
image

input len = 1500, output len = 1

Ran the above command 3 times: 1.43% throughput improvement
main: Throughput: 235.92 requests/s, 354106.22 total tokens/s, 235.92 output tokens/s
linzebing/gc_bytes: Throughput: 239.29 requests/s, 359170.53 total tokens/s, 239.29 output tokens/s

Both the number of GC events as well as outlier GC events reduced.
image

benchmark_hash_methods.py results:

============================================================
HASH FUNCTION PERFORMANCE BENCHMARK
============================================================
Test data: (32-byte bytes object, 32-int tuple)
Iterations: 10,000
============================================================

Results:
  SHA256:     1.17 ±   0.57 μs
  SHA256_CBOR_64BIT:     6.37 ±   0.85 μs
  Built-in hash:     0.18 ±   0.14 μs

============================================================
SUMMARY
============================================================
• Built-in hash() is 6.4x faster than SHA256
• Built-in hash() is 34.8x faster than SHA256_CBOR_64BIT
• SHA256: 1.2μs per hash (cryptographically secure)
• SHA256_CBOR_64BIT: 6.4μs per hash (cryptographically secure, cross-language compatible)
• Built-in: 0.2μs per hash (fast, not secure)

So sha256 is around 1μs slower, however the GC savings are much greater.

eval

main:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.792 ± 0.0257
strict-match 5 exact_match 0.764 ± 0.0269

linzebing/gc_bytes:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.792 ± 0.0257
strict-match 5 exact_match 0.764 ± 0.0269

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Aug 26, 2025
@linzebing linzebing changed the title [Core] Use sha256 bytes instead of BlockHash and BlockHashWithGroupId [Core] Use sha256 bytes instead of BlockHash to reduce GC overhead Aug 26, 2025
Signed-off-by: linzebing <linzebing1995@gmail.com>
Signed-off-by: linzebing <linzebing1995@gmail.com>
Signed-off-by: linzebing <linzebing1995@gmail.com>
@linzebing
Copy link
Contributor Author

Thanks @njhill !! Addressed comments.

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @linzebing! Great work!

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 8, 2025
@njhill
Copy link
Member

njhill commented Sep 8, 2025

@linzebing looks like there are some other tests that need adjusting: https://buildkite.com/vllm/ci/builds/29810#019929f3-0b94-47a6-af38-31ed247f0e5a

Signed-off-by: linzebing <linzebing1995@gmail.com>
@linzebing
Copy link
Contributor Author

@njhill : done. distributed-tests-2-gpus and distributed-tests-2-gpus seem to be test_infra issue; v1/e2e/test_spec_decode.py::test_ngram_correctness doesn't seem related, I can't repro locally. The test is non-deterministic and matched count can drop below the heuristic-based threshold when unlucky.

@vllm-bot vllm-bot merged commit 82dfb12 into vllm-project:main Sep 9, 2025
39 of 42 checks passed
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
MengqingCao pushed a commit to vllm-project/vllm-ascend that referenced this pull request Sep 10, 2025
### What this PR does / why we need it?
1. Initial support disable tp for integrating with
[vllm-commit](vllm-project/vllm#23024)
2. [vllm@commit](vllm-project/vllm#23673) now
use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add
the integration

- vLLM version: main
- vLLM main:
vllm-project/vllm@e408272

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
offline893 pushed a commit to offline893/vllm-ascend that referenced this pull request Sep 16, 2025
### What this PR does / why we need it?
1. Initial support disable tp for integrating with
[vllm-commit](vllm-project/vllm#23024)
2. [vllm@commit](vllm-project/vllm#23673) now
use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add
the integration

- vLLM version: main
- vLLM main:
vllm-project/vllm@e408272

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: offline0806 <z00858301@china.huawei.com>
wangxiaoteng888 pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Sep 25, 2025
### What this PR does / why we need it?
1. Initial support disable tp for integrating with
[vllm-commit](vllm-project/vllm#23024)
2. [vllm@commit](vllm-project/vllm#23673) now
use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add
the integration

- vLLM version: main
- vLLM main:
vllm-project/vllm@e408272

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
### What this PR does / why we need it?
1. Initial support disable tp for integrating with
[vllm-commit](vllm-project/vllm#23024)
2. [vllm@commit](vllm-project/vllm#23673) now
use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add
the integration

- vLLM version: main
- vLLM main:
vllm-project/vllm@e408272

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…llm-project#23673)

Signed-off-by: linzebing <linzebing1995@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants