-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead #23673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b19c193
to
3814ea8
Compare
Signed-off-by: linzebing <linzebing1995@gmail.com>
Signed-off-by: linzebing <linzebing1995@gmail.com>
Thanks @njhill !! Addressed comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @linzebing! Great work!
@linzebing looks like there are some other tests that need adjusting: https://buildkite.com/vllm/ci/builds/29810#019929f3-0b94-47a6-af38-31ed247f0e5a |
Signed-off-by: linzebing <linzebing1995@gmail.com>
@njhill : done. |
…llm-project#23673) Signed-off-by: linzebing <linzebing1995@gmail.com>
### What this PR does / why we need it? 1. Initial support disable tp for integrating with [vllm-commit](vllm-project/vllm#23024) 2. [vllm@commit](vllm-project/vllm#23673) now use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add the integration - vLLM version: main - vLLM main: vllm-project/vllm@e408272 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
…llm-project#23673) Signed-off-by: linzebing <linzebing1995@gmail.com>
### What this PR does / why we need it? 1. Initial support disable tp for integrating with [vllm-commit](vllm-project/vllm#23024) 2. [vllm@commit](vllm-project/vllm#23673) now use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add the integration - vLLM version: main - vLLM main: vllm-project/vllm@e408272 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: offline0806 <z00858301@china.huawei.com>
### What this PR does / why we need it? 1. Initial support disable tp for integrating with [vllm-commit](vllm-project/vllm#23024) 2. [vllm@commit](vllm-project/vllm#23673) now use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add the integration - vLLM version: main - vLLM main: vllm-project/vllm@e408272 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
…llm-project#23673) Signed-off-by: linzebing <linzebing1995@gmail.com>
### What this PR does / why we need it? 1. Initial support disable tp for integrating with [vllm-commit](vllm-project/vllm#23024) 2. [vllm@commit](vllm-project/vllm#23673) now use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add the integration - vLLM version: main - vLLM main: vllm-project/vllm@e408272 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
…llm-project#23673) Signed-off-by: linzebing <linzebing1995@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
The current
BlockHash
design stores the hash value alongside a tuple of all token IDs and any extra keys to prevent collisions, andBlockHashWithGroupId
layers a group identifier on top of that structure. Every block therefore retains a tuple of Python integers for its tokens, creating many short‑lived objects. This design reduces collision risk, but it can significantly increase memory pressure and trigger frequent garbage collection.@Jialin's analysis: On average, there are floor(1500/16)=93 blocks per request when input prompt is 1500 and block_size is 16. As shown above, there are 4 objects to be collected per block, so there’re 372 objects to be collected per request. Given that, the default GC threshold is 700 (i.e. a GC collect might be kicked off when it collected more than 700 objects. 1.88=700/372 requests might kick off a GC collect event which is too frequent.
Replacing these classes with bytes hashes (e.g., and <group_id>) would eliminate the per-block token tuple and nested NamedTuples, greatly reducing object counts. A full 256‑bit SHA‑256 hexdigest makes collisions virtually impossible, so token IDs and extra keys no longer need to be embedded in the hash structure.
bytes
digest is preferred overstr
as it avoids extra work to convert SHA-256 digest into a string while also being more memory efficient. Replacing hash_fn fromhash
tosha256
incurs extra CPU overhead, but benchmarking shows that it's worth the effort as it greatly reduces number of short-lived objects and therefore reduces GC overheads.Test Plan
Performance (H100)
Used @Jialin's Jialin#4 for GC analysis.
Hash methods benchmark: https://gist.github.com/linzebing/bd51f09a183bfd53c7b856ebf412fe68
Eval
Test Result
Performance
input len = 1500, output len = 150
5.33% throughput improvement
main:
Throughput: 105.25 requests/s, 173665.94 total tokens/s, 15787.94 output tokens/s
linzebing/gc_bytes:
Throughput: 110.86 requests/s, 182910.25 total tokens/s, 16628.46 output tokens/s
Both the number of GC events as well as outlier GC events reduced.

input len = 1500, output len = 1
Ran the above command 3 times: 1.43% throughput improvement
main:
Throughput: 235.92 requests/s, 354106.22 total tokens/s, 235.92 output tokens/s
linzebing/gc_bytes:
Throughput: 239.29 requests/s, 359170.53 total tokens/s, 239.29 output tokens/s
Both the number of GC events as well as outlier GC events reduced.

benchmark_hash_methods.py
results:So
sha256
is around 1μs slower, however the GC savings are much greater.eval
main
:linzebing/gc_bytes
:Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.