[Bugfix] Fix several issues with p2p xPyD in GET type #23993

Csrayz · 2025-08-30T15:39:45Z

Purpose

Fix several issues with p2p xPyD in GET type:

When using GET mode, such a situation can occur. The GPU buffer size configuration is just at the critical value of the KV Cache size. Some requests larger than the buffer size will clear all tensors, causing next(iter(self.send_store)) to raise StopIteration.
When calling recv_tensor, the remote_address of the p node is missing from the parameters, which causes kv_cache to always be None.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…ensor Signed-off-by: Csrayz <jover@cmbchina.com>

Signed-off-by: Csrayz <jover@cmbchina.com>

buffer_size_threshold This can avoid clearing the buffer due to the tensor being too large. Signed-off-by: Csrayz <jover@cmbchina.com>

gemini-code-assist

Code Review

This pull request effectively addresses a crash caused by an empty send_store during tensor eviction in GET mode. The core of the fix is the introduction of a pre-check that rejects tensors larger than the buffer threshold, which correctly prevents the problematic eviction loop. The addition of an assert to ensure the store is not empty during eviction is a good defensive measure. Furthermore, the code is cleaner due to the correction of several typos in variable names. The changes are logical and correctly resolve the bug.

Csrayz · 2025-08-30T15:58:32Z

@Abatom Please review.

Additionally, I have a question. While we understand the GET type is not recommended due to performance concerns, I've observed that when using the GET type, TensorMemoryPool is not utilized as a secondary cache for GPU buffers. Furthermore, in this configuration, TensorMemoryPool is still instantiated, consuming some memory. Could you please let me know if there are any plans to implement TensorMemoryPool as a secondary cache for GPU buffers in future releases?

Abatom · 2025-08-31T01:01:03Z

Thanks, I'll review this PR shortly.

Abatom · 2025-08-31T01:08:10Z

@Abatom Please review.

Additionally, I have a question. While we understand the GET type is not recommended due to performance concerns, I've observed that when using the GET type, TensorMemoryPool is not utilized as a secondary cache for GPU buffers. Furthermore, in this configuration, TensorMemoryPool is still instantiated, consuming some memory. Could you please let me know if there are any plans to implement TensorMemoryPool as a secondary cache for GPU buffers in future releases?

Currently, the memory pool size can be reduced via the mem_pool_size_gb configuration. Because the performance of the GET model is inferior to that of PUT_ASYNC, we haven’t prioritized letting P instances use the memory pool yet; we’ll add this capability when time allows.

Signed-off-by: ivyilike <pww123@cmbchina.com>

Csrayz · 2025-09-03T13:44:45Z

any update?

Abatom · 2025-09-03T23:51:45Z

@Csrayz Have you ever run the GET mode locally yourself?

Have you stress-tested it?

Did you encounter any garbled text?

During chunked prefill, was there any corruption?

When pre-emption happened, did you see garbled output or crashes?

If everything above looks good, just let me know and I’ll run the tests locally.

Csrayz · 2025-09-04T02:36:54Z

The code was submitted only after running normally locally. The basic environment is NVIDIA A10 x4.
We used vllm benchmark_server.py and tested for different request rates, including low load and overload situations.
We did not encounter garbled characters when making curl requests. When using benchmark_server, we did not pay attention to the response content.
No data corruption was found during chunked prefilling, and we did not notice if there was any preemption during the testing process.
There was a problem in the original GET method that prevented KV Transfer between PDs. After fixing it, no crashes were found through curl and script testing.

Abatom · 2025-09-04T08:47:33Z

@Csrayz I'll run this PR.

vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py

Signed-off-by: Csrayz <jover@cmbchina.com>

Abatom · 2025-09-11T01:44:28Z

@Csrayz thanks, LGTM, I run this PR, it works, cc @simon-mo

Csrayz · 2025-09-16T11:01:13Z

All work is complete, waiting for your review. @NickLucche

NickLucche

Looks good, I would just point having tests here would go a long way in ensuring functionality for these changes.

Csrayz · 2025-09-22T10:22:47Z

Yes, unit tests are needed to ensure that the program can correctly handle boundary cases. However, recent work has been quite busy, and I may need to supplement the relevant unit tests separately at a later time.

NickLucche · 2025-09-22T13:04:56Z

I think this is fine for now since it's a bugfix, but please review it at a later time when it's most suitable to you. Let's get this merged.
Thanks for contributing @Csrayz !

cc @Abatom

…3993) Signed-off-by: Csrayz <jover@cmbchina.com> Signed-off-by: ivyilike <pww123@cmbchina.com> Co-authored-by: ivyilike <pww123@cmbchina.com>

…3993) Signed-off-by: Csrayz <jover@cmbchina.com> Signed-off-by: ivyilike <pww123@cmbchina.com> Co-authored-by: ivyilike <pww123@cmbchina.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Csrayz <jover@cmbchina.com> Signed-off-by: ivyilike <pww123@cmbchina.com> Co-authored-by: ivyilike <pww123@cmbchina.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

…3993) Signed-off-by: Csrayz <jover@cmbchina.com> Signed-off-by: ivyilike <pww123@cmbchina.com> Co-authored-by: ivyilike <pww123@cmbchina.com> Signed-off-by: gaojc <1055866782@qq.com>

…3993) Signed-off-by: Csrayz <jover@cmbchina.com> Signed-off-by: ivyilike <pww123@cmbchina.com> Co-authored-by: ivyilike <pww123@cmbchina.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…3993) Signed-off-by: Csrayz <jover@cmbchina.com> Signed-off-by: ivyilike <pww123@cmbchina.com> Co-authored-by: ivyilike <pww123@cmbchina.com>

…3993) Signed-off-by: Csrayz <jover@cmbchina.com> Signed-off-by: ivyilike <pww123@cmbchina.com> Co-authored-by: ivyilike <pww123@cmbchina.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Csrayz added 3 commits August 30, 2025 23:11

[FIX] Skip when the send buffer is insufficient to hold the current t…

5b7244c

…ensor Signed-off-by: Csrayz <jover@cmbchina.com>

[FIX] Fix tensor spelling errors

fa1b02e

Signed-off-by: Csrayz <jover@cmbchina.com>

[FEAT] Check in advance whether tensor_size is greater than

cf5d306

buffer_size_threshold This can avoid clearing the buffer due to the tensor being too large. Signed-off-by: Csrayz <jover@cmbchina.com>

gemini-code-assist bot reviewed Aug 30, 2025

View reviewed changes

Merge branch 'main' into fix_p2pconn

dca0746

ivyilike force-pushed the fix_p2pconn branch from 2b982d4 to 2fc9cde Compare September 1, 2025 08:03

[Bugfix] fix p2p nccl's bug with "GET" model

e1efc27

Signed-off-by: ivyilike <pww123@cmbchina.com>

ivyilike force-pushed the fix_p2pconn branch from 2fc9cde to e1efc27 Compare September 1, 2025 08:36

Csrayz changed the title ~~[Bugfix] Fix the program crash caused by send_store being empty.~~ [Bugfix] Fix several issues with p2p xPyD in GET type Sep 1, 2025

[Bugfix] fix p2p nccl's bug with "GET" model

ba8e20e

Signed-off-by: ivyilike <pww123@cmbchina.com>

Abatom reviewed Sep 10, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py Outdated Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py Outdated Show resolved Hide resolved

[FEAT] Set the log level to an appropriate level

b15a634

Signed-off-by: Csrayz <jover@cmbchina.com>

Merge branch 'main' into fix_p2pconn

e684126

Csrayz requested a review from NickLucche as a code owner September 15, 2025 14:18

NickLucche approved these changes Sep 16, 2025

View reviewed changes

mergify bot added the kv-connector label Sep 18, 2025

NickLucche enabled auto-merge (squash) September 22, 2025 13:05

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 22, 2025

NickLucche merged commit c10101a into vllm-project:main Sep 22, 2025
59 checks passed

Uh oh!

[Bugfix] Fix several issues with p2p xPyD in GET type #23993

[Bugfix] Fix several issues with p2p xPyD in GET type #23993

Uh oh!

Conversation

Csrayz commented Aug 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Csrayz commented Aug 30, 2025

Uh oh!

Abatom commented Aug 31, 2025

Uh oh!

Abatom commented Aug 31, 2025

Uh oh!

Csrayz commented Sep 3, 2025

Uh oh!

Abatom commented Sep 3, 2025

Uh oh!

Csrayz commented Sep 4, 2025

Uh oh!

Abatom commented Sep 4, 2025

Uh oh!

Uh oh!

Uh oh!

Abatom commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Csrayz commented Sep 16, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Csrayz commented Sep 22, 2025

Uh oh!

NickLucche commented Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Csrayz commented Aug 30, 2025 •

edited by github-actions bot

Loading

Abatom commented Sep 11, 2025 •

edited

Loading