fix: NIXL connector transfers partial block to pass full multi-modal context #21074

GuanLuo · 2025-07-16T19:26:03Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

In disaggregation case, the prefill worker will receive the multi-modal input (say an image), the embedding will be processed and stored in KV cache. However, for decode worker, the multi-modal input will not be passed, and in such a case, decode worker will rely on the KV cache transfer to obtain the multi-modal context.

Previous implementation only transfer full KV blocks which may result in incomplete context when part of it is in the incomplete block. This change simply always transfers all blocks to work around that.

Test Plan

Test Result

(Optional) Documentation Update

github-actions · 2025-07-16T19:26:12Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request fixes an issue where partial KV blocks were not transferred in the NIXL connector, potentially leading to incomplete multi-modal context. The changes correctly adjust the logic to include partial blocks in the transfer, both on the producer and consumer side. The implementation looks correct and addresses the described problem. I've pointed out one critical issue regarding a potentially incorrect assertion that could lead to a crash in certain scenarios involving prefix caching.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

GuanLuo · 2025-07-16T23:40:58Z

@robertgshaw2-redhat I saw you wrote the initial implementation and I would love to know your opinion on changing the behavior from skipping partial last block to transferring it.

NickLucche · 2025-07-17T15:12:15Z

the prefill worker will receive the multi-modal input (say an image), the embedding will be processed and stored in KV cache

Do you mean encoder cache here? Or are you referring to non-llava impl with cross blocks (mllama?)

GuanLuo · 2025-07-17T19:12:59Z

Do you mean encoder cache here? Or are you referring to non-llava impl with cross blocks (mllama?)

@NickLucche I haven't considered the above for this change, can you elaborate or point me to the places where I can learn more about it?

My understanding of multi-modal inference is that after the prefill stage, the context of prompt (image template applied) and image is being captured in the KV blocks allocated for the request. Therefore, as long as the downstream (decode) worker can access all the blocks from upstream (prefill) worker, then it has sufficient information to continue the generation stage. This change is to ensure that all blocks will be transferred, even if the last block is partially used, and it does let decode worker produce proper response.

NickLucche

Yes after the encoder is run, the language_model part works exactly the same way.

However, for decode worker, the multi-modal input will not be passed,

Had a hard time figuring out what you meant here.
I see now you mean you want to transfer the blocks regardless of whether a full page is reached because image tokens are blocking.

I am not sure this is the best default for regular non-MM models where this optimization was saving a few transfers.

We probably want to run a few benchmarks or just do this for mm.
Thanks for spotting this and for contributing!

GuanLuo · 2025-07-21T06:05:45Z

I am not sure this is the best default for regular non-MM models where this optimization was saving a few transfers.

Let me know if what is the best way to proceed at this point, I do see that this optimization is going to save the last transfer in exchange to let decode re-compute the last block of token. Yes we should run a few benchmarks to understand the trade off numerically.

But for disaggregation, the ISL is typically long to be beneficial, which means there will already be a few full blocks to be transferred, so adding the last partial block in the transfer may not be as bad.

GuanLuo · 2025-07-21T06:07:21Z

After thinking a bit more, I think the point is for decode worker to successfully reconstruct the same KV cache as what is in prefill worker. The optimization works perfectly fine for non-MM model because the decode worker can re-compute the missing KV cache from the prompt, whereas in MM model, the encoding stage can't be redo, unless the decode worker also receives the request with MM input? But I am afraid that means repeating the MM input processing in decode worker.

Given that, I will still advocate for always transferring all KV cache from prefill, which better decouples the language model part from any processing before that (i.e. the encoding). Then for disaggregation, decode worker only needs the final prompt processed by prefill (from prefill_response.prompt) and the full KV cache, no matter whether the model is MM model or not.

NickLucche · 2025-07-21T14:20:24Z

I agree, what you're saying makes a lot of sense for MM.
I am just not sure we should simply switch to this behavior for text-only models as well. Therefore I would try to measure the current text-only optimization approach to see if it's actually beneficial to keep around.

hasB4K · 2025-07-22T16:09:12Z

I had to do a similar thing last week to transfer all blocks, including the partial ones. Your implementation is I think incomplete. You need to add this piece:

vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Lines 255 to 257 in eaadf83

    
           local_block_ids = (blocks.get_unhashed_block_ids() 
        
                              if num_external_tokens > 0 else []) 
        
           # Get unhashed blocks to pull from remote.

Into this:

                    len_prompt = len(request.prompt_token_ids)
                    num_computed_tokens = len_prompt - num_external_tokens
                    assert num_computed_tokens % self.block_size == 0
                    start_block = num_computed_tokens // self.block_size
                    local_block_ids = flatten_2d_lists(blocks.get_block_ids())[start_block:]

You'll also need to add this import at the top ofc:

from vllm.utils import flatten_2d_lists

get_unhashed_block_ids() doesn't give you all the blocks you might need, and recomputing num_computed_tokens is useful to have the exact blocks you need to transfer.

GuanLuo · 2025-07-22T22:36:45Z

get_unhashed_block_ids() doesn't give you all the blocks you might need

@hasB4K Do you mind to help me understand more about this?

Am I understanding it correctly that the issue is that when the decode worker has previously processed a request with shorter prompt (say 48 tokens and block size 32), the 2nd block will have hash value and will not be returned by this function. And then when processing another request with the same shorter prompt as prefix (say 72 tokens), the 2nd block will not be transferred?

I had to do a similar thing last week to transfer all blocks, including the partial ones.

I am also curious on your use case, is that also related to multi-modal?

hasB4K · 2025-07-23T11:58:53Z

Am I understanding it correctly that the issue is that when the decode worker has previously processed a request with shorter prompt (say 48 tokens and block size 32), the 2nd block will have hash value and will not be returned by this function. And then when processing another request with the same shorter prompt as prefix (say 72 tokens), the 2nd block will not be transferred?

Yes exactly 😉

I am also curious on your use case, is that also related to multi-modal?

No, I was just trying to optimize the last block transfer to avoid re-doing the prefill. But since you raise this issue with multi-modal I'm pretty sure your PR is mandatory.

GuanLuo · 2025-07-24T02:20:37Z

@NickLucche I have gathered some simple run with

Qwen3-0.6B model, prefill and decode worker on separate nodes
first with long ISL (3000) and second with short ISL (300).
The concurrency is 1 so we get results for end-to-end request latency.

Focusing on the TTFT as that is what sending partial block or not affects the most. You can see that the number before and after the change are not significant.

After change (transfer all)

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃                            Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│             Time To First Token (ms) │   583.15 │    48.97 │   674.33 │   674.24 │   671.76 │   667.54 │
│            Time To Second Token (ms) │     5.34 │     0.00 │     5.90 │     5.90 │     5.76 │     5.67 │
│                 Request Latency (ms) │ 1,542.43 │   990.50 │ 1,635.61 │ 1,635.23 │ 1,633.25 │ 1,629.43 │
│             Inter Token Latency (ms) │     6.48 │     6.35 │     6.51 │     6.51 │     6.50 │     6.49 │
│     Output Token Throughput Per User │   154.34 │   153.70 │   157.41 │   156.94 │   154.53 │   154.35 │
│                    (tokens/sec/user) │          │          │          │          │          │          │
│      Output Sequence Length (tokens) │   149.05 │   149.00 │   150.00 │   149.81 │   149.00 │   149.00 │
│       Input Sequence Length (tokens) │ 3,000.15 │ 3,000.00 │ 3,001.00 │ 3,001.00 │ 3,001.00 │ 3,000.00 │
│ Output Token Throughput (tokens/sec) │    96.23 │      N/A │      N/A │      N/A │      N/A │      N/A │
│         Request Throughput (per sec) │     0.65 │      N/A │      N/A │      N/A │      N/A │      N/A │
│                Request Count (count) │    20.00 │      N/A │      N/A │      N/A │      N/A │      N/A │
└──────────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                            Statistic ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│             Time To First Token (ms) │  44.08 │  29.44 │  48.52 │  48.42 │  47.85 │  45.68 │
│            Time To Second Token (ms) │   0.29 │   0.00 │   5.79 │   4.69 │   0.00 │   0.00 │
│                 Request Latency (ms) │ 986.53 │ 981.30 │ 992.90 │ 992.30 │ 988.89 │ 987.72 │
│             Inter Token Latency (ms) │   6.37 │   6.31 │   6.54 │   6.51 │   6.39 │   6.38 │
│     Output Token Throughput Per User │ 156.88 │ 152.86 │ 158.57 │ 158.40 │ 157.63 │ 157.39 │
│                    (tokens/sec/user) │        │        │        │        │        │        │
│      Output Sequence Length (tokens) │ 148.85 │ 147.00 │ 149.00 │ 149.00 │ 149.00 │ 149.00 │
│       Input Sequence Length (tokens) │ 300.00 │ 300.00 │ 300.00 │ 300.00 │ 300.00 │ 300.00 │
│ Output Token Throughput (tokens/sec) │ 149.90 │    N/A │    N/A │    N/A │    N/A │    N/A │
│         Request Throughput (per sec) │   1.01 │    N/A │    N/A │    N/A │    N/A │    N/A │
│                Request Count (count) │  20.00 │    N/A │    N/A │    N/A │    N/A │    N/A │
└──────────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘

Before change (skip partial)

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃                            Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│             Time To First Token (ms) │   593.96 │    45.08 │   669.32 │   669.25 │   668.89 │   668.63 │
│            Time To Second Token (ms) │     5.44 │     0.00 │     6.13 │     6.10 │     5.90 │     5.78 │
│                 Request Latency (ms) │ 1,570.64 │   997.12 │ 1,654.68 │ 1,654.22 │ 1,650.74 │ 1,648.26 │
│             Inter Token Latency (ms) │     6.60 │     6.40 │     6.66 │     6.66 │     6.64 │     6.63 │
│     Output Token Throughput Per User │   151.60 │   150.11 │   156.36 │   155.65 │   152.45 │   152.13 │
│                    (tokens/sec/user) │          │          │          │          │          │          │
│      Output Sequence Length (tokens) │   149.05 │   149.00 │   150.00 │   149.81 │   149.00 │   149.00 │
│       Input Sequence Length (tokens) │ 3,000.15 │ 3,000.00 │ 3,001.00 │ 3,001.00 │ 3,001.00 │ 3,000.00 │
│ Output Token Throughput (tokens/sec) │    94.49 │      N/A │      N/A │      N/A │      N/A │      N/A │
│         Request Throughput (per sec) │     0.63 │      N/A │      N/A │      N/A │      N/A │      N/A │
│                Request Count (count) │    20.00 │      N/A │      N/A │      N/A │      N/A │      N/A │
└──────────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃                            Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│             Time To First Token (ms) │    45.36 │    24.77 │    49.05 │    49.03 │    48.87 │    48.24 │
│            Time To Second Token (ms) │     0.32 │     0.00 │     6.32 │     5.12 │     0.00 │     0.00 │
│                 Request Latency (ms) │ 1,009.84 │ 1,006.06 │ 1,014.06 │ 1,013.73 │ 1,011.68 │ 1,010.57 │
│             Inter Token Latency (ms) │     6.52 │     6.49 │     6.72 │     6.70 │     6.56 │     6.53 │
│     Output Token Throughput Per User │   153.30 │   148.78 │   154.15 │   154.14 │   154.05 │   153.92 │
│                    (tokens/sec/user) │          │          │          │          │          │          │
│      Output Sequence Length (tokens) │   148.85 │   147.00 │   149.00 │   149.00 │   149.00 │   149.00 │
│       Input Sequence Length (tokens) │   300.00 │   300.00 │   300.00 │   300.00 │   300.00 │   300.00 │
│ Output Token Throughput (tokens/sec) │   146.45 │      N/A │      N/A │      N/A │      N/A │      N/A │
│         Request Throughput (per sec) │     0.98 │      N/A │      N/A │      N/A │      N/A │      N/A │
│                Request Count (count) │    20.00 │      N/A │      N/A │      N/A │      N/A │      N/A │
└──────────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘

GuanLuo · 2025-07-24T09:01:52Z

Hi @hasB4K , after going deeper into the implementation, I think that the change you purposed may not be needed.

In update_state_after_alloc, the blocks is new_computed_blocks + new_blocks where new_computed_blocks only contains fully matched blocks (full blocks) and new_blocks are provided with block_hash erased. In other words, get_unhashed_block_ids() will only returns new_blocks in the case of enabling caching, which is desired. In the context of my earlier example, get_unhashed_block_ids() will returns the 2nd and 3rd blocks for the 2nd request.

Let me know if there is misunderstanding in my reasoning, and it would be great if there is a simple test case that I can experiment to visualize what have I missed if the change is indeed needed.

mergify · 2025-07-24T17:37:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @GuanLuo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…i-modal context to downstream worker Signed-off-by: GuanLuo <gluo@nvidia.com>

Signed-off-by: GuanLuo <gluo@nvidia.com>

GuanLuo · 2025-07-25T08:42:29Z

@NickLucche do you mind to give another round of review when you have a chance? It would be great if this can be merged this week as this affects correctness and we want to include this fix in our next release.

hasB4K · 2025-07-27T23:22:27Z

Hi @hasB4K , after going deeper into the implementation, I think that the change you purposed may not be needed.

In update_state_after_alloc, the blocks is new_computed_blocks + new_blocks where new_computed_blocks only contains fully matched blocks (full blocks) and new_blocks are provided with block_hash erased. In other words, get_unhashed_block_ids() will only returns new_blocks in the case of enabling caching, which is desired. In the context of my earlier example, get_unhashed_block_ids() will returns the 2nd and 3rd blocks for the 2nd request.

Let me know if there is misunderstanding in my reasoning, and it would be great if there is a simple test case that I can experiment to visualize what have I missed if the change is indeed needed.

I double checked by running some tests by adding this:
assert flatten_2d_lists(blocks.get_block_ids())[start_block:] == blocks.get_unhashed_block_ids()

And you are right, it seems that my implem was not necessary, since it's always equals 😅 .

NickLucche

Hey @GuanLuo thanks for running the benchmarks and apologies for the late response! Are you running on IB-connected nodes btw?

Let's get a second opinion on this @robertgshaw2-redhat , otherwise this LGTM thanks for the work!

GuanLuo · 2025-07-28T18:04:44Z

Are you running on IB-connected nodes btw?

Yes

njhill

Thanks @GuanLuo! It looks mostly good to me.

I think a change here may also be required for the intermediate-host-buffer case (used for TPU). cc @juncgu

Also is there any chance you could add a test / extend the existing nixl connector tests to cover this? I.e. that tokens from partial blocks are now transferred rather than recomputed on decode side?

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

GuanLuo · 2025-08-01T20:17:42Z

@njhill I have updated the test cases to reflect the new behavior. I made a somewhat big change to test_cannot_schedule_after_recv() in test_remote_prefill_lifecycle.py and would like to check if my change is reasonable.

Previously the test is testing the old behavior where the last partial block will not be transferred, so after the transfer is finished, the immediate step (step 4) will request new block (for the rest of the tokens) immediately when looping through waiting requests and get skipped due to insufficient blocks. In other words, the request stays in waiting list.

But with the new behavior, if the transfer is initiated, the request has allocated blocks for all tokens, so it will be moved to running list as soon as the transfer is finished, and thus introduce the new steps where the request get moved but can't process as it needs new block for generated token and get moved back to waiting list, so we have (waiting -> running -> waiting).

Note that I change to prompt size for remote request as well so the request will follow the above flow, if the prompt size results in partial block, then the remote request can be scheduled immediately after transfer as the partial block can hold generated tokens.

Signed-off-by: GuanLuo <gluo@nvidia.com>

NickLucche

Left a few comments for minor stuff.
Re: test_cannot_schedule_after_recv , apologies if I misread your changes, but I think the test is now functionally different.
Can we either:

a) increase tokens st we're actually testing the not enough blocks case
b) do a separate case with the last partial block now being transferred after your changes, but keep a) to test old intended behavior?

Thanks for the great work here!

tests/v1/kv_connector/unit/test_nixl_connector.py

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

tests/v1/kv_connector/unit/test_remote_decode_lifecycle.py

GuanLuo · 2025-08-04T19:19:26Z

The test_cannot_schedule_after_recv behavior still matches the test name (but extra scheduler steps are involved). With the NIXL connector change, this is a rare case when there is no partial remote block and local blocks are full after receive. If there is partial block and the receive is complete, the decode can make progress due to the unused space in partial block.

I think it make sense to add (a) in addition to the current test cases, with the current change, the KV transfer will not be scheduled until the previous request is finished due to insufficient blocks.

Signed-off-by: GuanLuo <gluo@nvidia.com>

NickLucche

Awesome!

I think the name could be a bit less general than test_cannot_recv but happy with the change :)

GuanLuo · 2025-08-05T18:00:20Z

Awesome! @njhill @robertgshaw2-redhat can I get another round of review from you?

njhill

Thanks @GuanLuo!

…context (vllm-project#21074) Signed-off-by: GuanLuo <gluo@nvidia.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

…context (vllm-project#21074) Signed-off-by: GuanLuo <gluo@nvidia.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

…context (vllm-project#21074) Signed-off-by: GuanLuo <gluo@nvidia.com>

…context (vllm-project#21074) Signed-off-by: GuanLuo <gluo@nvidia.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

…context (vllm-project#21074) Signed-off-by: GuanLuo <gluo@nvidia.com>

gemini-code-assist bot reviewed Jul 16, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

GuanLuo force-pushed the gluo/transfer-incomplete-block branch from 4d4e68b to eaadf83 Compare July 17, 2025 19:58

robertgshaw2-redhat self-requested a review July 18, 2025 18:08

NickLucche requested changes Jul 19, 2025

View reviewed changes

mergify bot added the needs-rebase label Jul 24, 2025

GuanLuo requested a review from NickLucche July 24, 2025 18:31

GuanLuo added 3 commits July 24, 2025 17:06

fix: NIXL connector transfers partial block to transfer complete mult…

7878d58

…i-modal context to downstream worker Signed-off-by: GuanLuo <gluo@nvidia.com>

fix: remove assert that may fail

451b70d

Signed-off-by: GuanLuo <gluo@nvidia.com>

chore: style

c9cbab6

Signed-off-by: GuanLuo <gluo@nvidia.com>

GuanLuo force-pushed the gluo/transfer-incomplete-block branch from eaadf83 to c9cbab6 Compare July 25, 2025 00:07

mergify bot removed the needs-rebase label Jul 25, 2025

NickLucche approved these changes Jul 28, 2025

View reviewed changes

njhill reviewed Jul 30, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Show resolved Hide resolved

njhill reviewed Jul 31, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

mergify bot added the v1 label Aug 1, 2025

style: format

4dc41b9

Signed-off-by: GuanLuo <gluo@nvidia.com>

GuanLuo requested a review from njhill August 3, 2025 21:23

NickLucche requested changes Aug 4, 2025

View reviewed changes

tests/v1/kv_connector/unit/test_nixl_connector.py Outdated Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

tests/v1/kv_connector/unit/test_remote_decode_lifecycle.py Outdated Show resolved Hide resolved

GuanLuo added 2 commits August 4, 2025 17:15

test: add test case of insufficient block for transfer

00a1bcb

Signed-off-by: GuanLuo <gluo@nvidia.com>

chore: address comment

c8e8eac

Signed-off-by: GuanLuo <gluo@nvidia.com>

GuanLuo requested a review from NickLucche August 5, 2025 02:05

NickLucche approved these changes Aug 5, 2025

View reviewed changes

njhill approved these changes Aug 8, 2025

View reviewed changes

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 8, 2025

Merge branch 'main' into gluo/transfer-incomplete-block

0f2b507

vllm-bot merged commit 16fb668 into vllm-project:main Aug 11, 2025
38 of 45 checks passed

NickLucche mentioned this pull request Aug 12, 2025

[Bugfix][CI] Fix test_remote_decode_lifecycle.py::test_short_prompt_lifecycle #22727

Merged

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

fix: NIXL connector transfers partial block to pass full multi-modal …

2e7f976

…context (vllm-project#21074) Signed-off-by: GuanLuo <gluo@nvidia.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025

fix: NIXL connector transfers partial block to pass full multi-modal …

2e32956

…context (vllm-project#21074) Signed-off-by: GuanLuo <gluo@nvidia.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

fix: NIXL connector transfers partial block to pass full multi-modal …

65fb0a3

…context (vllm-project#21074) Signed-off-by: GuanLuo <gluo@nvidia.com>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

fix: NIXL connector transfers partial block to pass full multi-modal …

cb04f75

…context (vllm-project#21074) Signed-off-by: GuanLuo <gluo@nvidia.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

fix: NIXL connector transfers partial block to pass full multi-modal …

957d584

…context (vllm-project#21074) Signed-off-by: GuanLuo <gluo@nvidia.com>

Uh oh!

fix: NIXL connector transfers partial block to pass full multi-modal context #21074

fix: NIXL connector transfers partial block to pass full multi-modal context #21074

Conversation

GuanLuo commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

GuanLuo commented Jul 16, 2025

Uh oh!

NickLucche commented Jul 17, 2025

Uh oh!

GuanLuo commented Jul 17, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

GuanLuo commented Jul 21, 2025

Uh oh!

GuanLuo commented Jul 21, 2025

Uh oh!

NickLucche commented Jul 21, 2025

Uh oh!

hasB4K commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GuanLuo commented Jul 22, 2025

Uh oh!

hasB4K commented Jul 23, 2025

Uh oh!

GuanLuo commented Jul 24, 2025

Uh oh!

GuanLuo commented Jul 24, 2025

Uh oh!

mergify bot commented Jul 24, 2025

Uh oh!

GuanLuo commented Jul 25, 2025

Uh oh!

hasB4K commented Jul 27, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

GuanLuo commented Jul 28, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GuanLuo commented Aug 1, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GuanLuo commented Aug 4, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

GuanLuo commented Aug 5, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

GuanLuo commented Jul 16, 2025 •

edited

Loading

hasB4K commented Jul 22, 2025 •

edited

Loading