[Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer #26864

LookAround0301 · 2025-10-15T01:42:31Z

Purpose

This PR adds the Prefill Context Parallelism (PCP) feature, which corresponds to DCP. For specific implementation details, please refer to the RFC #25749.
TL;DR: PCP enhances long-sequence inference capabilities by partitioning the sequence dimension during the prefill stage.

The current implementation primarily includes the following changes:

Modified ModelRunner.py for CP partitioning logic for tokens;
Modified flashinfer.py to adapt the FlashInfer backend for GQA to PCP.
Modified block_tables.py to extend the KV cache storage based on DCP&PCP;
Added a communication group cp_group for PCP;
Added necessary command-line arguments to control parallelism for PCP;

Test Plan

Qwen/Qwen3-32B

export VLLM_ATTENTION_BACKEND='FLASHINFER'
vllm serve Qwen/Qwen3-32B --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --context-parallel-size 2

Test Result

gsm8k eval

TP2 batchsize 256 max_out_len 1024

PCP1

dataset	version	metric	mode	vllm-api-stream-chat
gsm8kdataset	-	accuracy	gen	88.02

PCP2

dataset	version	metric	mode	vllm-api-stream-chat
gsm8kdataset	-	accuracy	gen	87.87

TODOs

UT for PCP;
Make PCP of flashinfer compatible with DCP after PR #25438 is merged;
Make block-level interleaved KV cache storage compatible after PR #26696 is merged;

Feature works (These items will be tackled in follow-up PRs; community contributions are warmly welcomed.):

PCP support for MLA and other backends;
PCP support for chunked-prefill and prefix caching features;
PCP support for MTP;
PCP support for CUDAFullGraph;
Ring-CP style attention backend algorithm, ref RFC #26133.

mergify · 2025-10-15T01:43:08Z

⚠️ The sha of the head commit of this PR conflicts with #25852. Mergify cannot evaluate rules on this PR. ⚠️

chatgpt-codex-connector

💡 Codex Review

vllm/vllm/v1/worker/tpu_worker.py

Lines 327 to 329 in 9f8290e

    
           ensure_model_parallel_initialized( 
        
               parallel_config.tensor_parallel_size, 
        
               parallel_config.pipeline_parallel_size)

Pass context parallel size when initializing TPU groups

ensure_model_parallel_initialized now expects a context_model_parallel_size positional argument, but the TPU worker still calls it with only tensor and pipeline sizes. On TPU this call will immediately raise TypeError: ensure_model_parallel_initialized() missing 1 required positional argument, so the worker cannot start even when context parallelism is left at its default of 1. Forward the context-parallel size (e.g., parallel_config.context_parallel_size) or restore a default to keep TPU initialization functional.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

gemini-code-assist

Code Review

This pull request introduces Prefill Context Parallelism (PCP) to enhance long-sequence inference by partitioning the sequence dimension. The changes are extensive, touching distributed state management, attention backends, KV cache coordination, and the model runner. The overall approach seems sound and consistent with existing parallelism strategies in vLLM. However, I found a critical issue in the GPU model runner where a CPU tensor is used as a mask for a GPU tensor, which will lead to a runtime error. This needs to be addressed.

gemini-code-assist · 2025-10-15T01:46:08Z

vllm/v1/worker/gpu_model_runner.py

+                cp_unpad_mask = self.cp_unpad_mask_cpu_tensor[
+                    :total_num_scheduled_tokens*self.cp_world_size]


The boolean mask cp_unpad_mask is on the CPU, while it's being used to index a GPU tensor cp_padded_slot_mapping on line 1302. This will cause a RuntimeError: Boolean mask must be on the same device as the self tensor. The mask should be moved to the GPU before being used for indexing.

Suggested change

cp_unpad_mask = self.cp_unpad_mask_cpu_tensor[

:total_num_scheduled_tokens*self.cp_world_size]

cp_unpad_mask = self.cp_unpad_mask_cpu_tensor[

:total_num_scheduled_tokens*self.cp_world_size].to(self.device, non_blocking=True)

mergify · 2025-10-15T02:52:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LookAround0301.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LucasWilkinson

Thanks for the contribution! Excited for this!

I'll do a more thorough review later but my initial comment is:

Can we try reduce amount of changes overall. I dont think this needs to be this invasive especially in the gpu_model_runner. The gpu model runner is already a very complex piece of code making it very hard for contributors to keep, we should try out best to not add to the complexity.

LucasWilkinson · 2025-10-16T02:56:07Z

tests/distributed/test_context_parallel.py

    tp_size: int
    pp_size: int
    dcp_size: int
+    cp_size: int


I think we should call the pcp_size, thoughts?

LucasWilkinson · 2025-10-16T02:57:02Z

tests/distributed/test_context_parallel.py

        str(pp_size),
        "--decode-context-parallel-size",
        str(dcp_size),
+        "--context-parallel-size",


ditto earlier comment; I think this should be --prefill-context-parallel-size

yes, I have renamed variables like cp_*/context_parallel_* to pcp_*/prefill_context_parallel_* to ensure they are distinct from dcp.

LucasWilkinson · 2025-10-16T03:13:59Z

vllm/v1/worker/gpu_model_runner.py

+                   arange,
+                   out=positions_np)
+            req_indices_for_slotmapping = req_indices
+            positions_np_for_slotmapping = positions_np


this feels a bit messy; maybe something like this would be cleaner?:

num_scheduled_tokens = np.array(tokens, dtype=np.int32) ... # E.g., [2, 5, 3] -> [0, 0, 1, 1, 1, 1, 1, 2, 2, 2] req_indices = np.repeat(self.arange_np[:num_reqs], num_scheduled_tokens) ... np.add(self.input_batch.num_computed_tokens_cpu[req_indices], arange, out=positions_np) self.input_batch.block_table.compute_slot_mapping( req_indices, positions_np) self.input_batch.block_table.commit_slot_mapping( total_num_scheduled_tokens) ... num_scheduled_tokens, positions_np = self._update_tokens_for_cp( # This would have to be modified to filter num_scheduled_tokens, positions_np)

Thank you for the review. Moving compute_slot_mapping up to _update_tokens_for_cp indeed makes it clearer. I have made the changes and look forward to further review.

Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com> Co-authored-by: QiuChunshuo <qiuchunshuo@huawei.com> Co-authored-by: LookAround <lixushi@huawei.com> Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com> Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: LookAround <lixushi@huawei.com>

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

riou-chen · 2025-10-22T08:20:00Z

I clone the code and compile from source code。server run command in h20 as follows：
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m vllm.entrypoints.openai.api_server --served-model-name QwQ3-32B --model huggingface/Qwen3-32B --swap-space 0 --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --prefill-context-parallel-size 2 --port 6005 --no-enable-prefix-caching
client send 8K/1K request in multiprocess，but server crash, the bug info as follows:

(APIServer pid=206092) INFO: 10.72.1.61:51920 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=206092) INFO: 10.72.1.61:51922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=206092) INFO: 10.72.1.61:51924 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=206092) INFO: 10.72.1.61:51926 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=206092) INFO 10-22 16:08:27 [loggers.py:181] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] WorkerProc hit an exception.
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last):
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR [multiproc_executor.py:707] WorkerProc hit an exception.
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] Traceback (most recent call last):
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors)
ERROR [multiproc_executor.py:707] output = func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
ERROR [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model
ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output)
ERROR [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp(
ERROR [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp
ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = (
ERROR [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,)
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last):
ERROR [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = func(*args, **kwargs)
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp(
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = (
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
ERROR [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] Traceback (most recent call last):
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors)
ERROR [multiproc_executor.py:707] output = func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
ERROR [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] WorkerProc hit an exception.
ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output)
ERROR [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last):
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model
(
ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
ERROR [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = (
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
ERROR [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = (
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707]
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp(
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors)
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = (
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
ERROR [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707]
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp(
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = (
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last):
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = func(*args, **kwargs)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^

the code branch pcp_pr is right?

FENP · 2025-10-22T08:44:16Z

I clone the code and compile from source code。server run command in h20 as follows： CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m vllm.entrypoints.openai.api_server --served-model-name QwQ3-32B --model huggingface/Qwen3-32B --swap-space 0 --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --prefill-context-parallel-size 2 --port 6005 --no-enable-prefix-caching client send 8K/1K request in multiprocess，but server crash, the bug info as follows:

(APIServer pid=206092) INFO: 10.72.1.61:51920 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=206092) INFO: 10.72.1.61:51922 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=206092) INFO: 10.72.1.61:51924 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=206092) INFO: 10.72.1.61:51926 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=206092) INFO 10-22 16:08:27 [loggers.py:181] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] WorkerProc hit an exception. (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last): (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context ERROR [multiproc_executor.py:707] WorkerProc hit an exception. (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] Traceback (most recent call last): (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors) ERROR [multiproc_executor.py:707] output = func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) ERROR [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output) ERROR [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp( ERROR [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = ( ERROR [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,) ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last): ERROR [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = func(*args, **kwargs) ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp( (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = ( (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context ERROR [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) ERROR [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] Traceback (most recent call last): (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors) ERROR [multiproc_executor.py:707] output = func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) ERROR [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] WorkerProc hit an exception. ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output) ERROR [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last): (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model ( ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model ERROR [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = ( (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model ERROR [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = ( (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context ERROR [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp( (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors) ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = ( (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context ERROR [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) ERROR [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp( (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = ( (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last): (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = func(*args, **kwargs) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^

the code branch pcp_pr is right?

Could you please set --max-num-batched-tokens (e.g. --max-num-batched-tokens 32768) and try again? This PR is not compatible with chunk prefill, so something may happen when sequence too long.

riou-chen · 2025-10-22T09:10:18Z

@FENP，I run command in server and get same error
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m vllm.entrypoints.openai.api_server --served-model-name QwQ3-32B --model huggingface/Qwen3-32B --swap-space 0 --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --prefill-context-parallel-size 2 --max-num-batched-tokens 32768 --port 6005 --no-enable-prefix-caching --no-enable-chunked-prefill

the key error is
ValueError: could not broadcast input array from shape (32778,) into shape (32768,)
in multiproc_executor.py:707

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

pisceskkk · 2025-10-22T13:53:19Z

@FENP，I run command in server and get same error CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m vllm.entrypoints.openai.api_server --served-model-name QwQ3-32B --model huggingface/Qwen3-32B --swap-space 0 --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --prefill-context-parallel-size 2 --max-num-batched-tokens 32768 --port 6005 --no-enable-prefix-caching --no-enable-chunked-prefill

the key error is ValueError: could not broadcast input array from shape (32778,) into shape (32768,) in multiproc_executor.py:707

@riou-chen Thanks for trying it out and reporting the bug! This issue occurred because the size of our pre-allocated buffer was aligned with max-num-batched-tokens, but the actual number of tokens after padding could exceed this value. We have fixed it and submitted a new commit. Please give it another try!

mergify · 2025-10-22T13:55:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LookAround0301.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

riou-chen · 2025-10-23T06:27:39Z

good，it run successfully after merge the new commit. but the performance decrease compare to the base.
I run the pcp test as above command，and client send request 8K/1K ，2k/1k，1k/1k input/ouput，the test result as follows：
the 1K/1K base data:

the 1K/1K pcp data

the 2K/1K base data

the 2K/1K PCP data

Is there any error in my testing method? Why isn't PCP effective?

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

pisceskkk · 2025-10-23T08:27:43Z

but the performance decrease compare to the base.

@riou-chen During our GPU performance testing, we also identified performance degradation issues. Preliminary analysis suggests two potential causes:

The custom mask computation introduces significant overhead. Although we have implemented initial optimizations (please ref commit), there still remains some performance loss.
We suspect that flashinfer may have performance degradation when handling custom mask computations (though there is no concrete evidence yet).

Therefore, the current GPU implementation still has room for optimization.

Additionally, based on our implementation and performance testing on NPU, DCP & PCP incur performance losses for short sequences (comparing CP vs. DP under the same TP parallelism). Performance gains are only observed in scenarios with long-sequence inputs exceeding 32K, and the benefits become more pronounced as the sequence length increases.

FENP · 2025-10-23T08:45:30Z

but the performance decrease compare to the base.

@riou-chen During our GPU performance testing, we also identified performance degradation issues. Preliminary analysis suggests two potential causes:

The custom mask computation introduces significant overhead. Although we have implemented initial optimizations (please ref commit), there still remains some performance loss.

We suspect that flashinfer may have performance degradation when handling custom mask computations (though there is no concrete evidence yet).

Therefore, the current GPU implementation still has room for optimization.

Additionally, based on our implementation and performance testing on NPU, DCP & PCP incur performance losses for short sequences (comparing CP vs. DP under the same TP parallelism). Performance gains are only observed in scenarios with long-sequence inputs exceeding 32K, and the benefits become more pronounced as the sequence length increases.

I’d like to add that CUDA Graph significantly affects performance, which is not supported by PCP currently (WIP).

[Update] PIECEWISE cuda graph have been supported by 1598b45

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

LookAround0301 requested review from ApostaC, LucasWilkinson, ProExpertProg, WoosukKwon, alexm-redhat, comaniac, heheda12345, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao, ywang96 and zhuohan123 as code owners October 15, 2025 01:42

chatgpt-codex-connector bot reviewed Oct 15, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 15, 2025

View reviewed changes

FENP mentioned this pull request Oct 15, 2025

support PCP #25852

Closed

5 tasks

mergify bot added the v1 label Oct 15, 2025

mergify bot added the needs-rebase label Oct 15, 2025

pisceskkk mentioned this pull request Oct 15, 2025

[RFC]: Support Prefill Context Parallel (PCP) #25749

Open

2 tasks

LucasWilkinson reviewed Oct 16, 2025

View reviewed changes

mergify bot removed the needs-rebase label Oct 16, 2025

pisceskkk force-pushed the pcp_pr branch from 14bcc4e to d09cda7 Compare October 17, 2025 06:35

pisceskkk force-pushed the pcp_pr branch 2 times, most recently from e120ccb to d09cda7 Compare October 17, 2025 08:12

[typo] wrong param name and comment

551d87e

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

[bugfix] number of padded tokens may greater than max_num_batched_tokens

f4e8332

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

pisceskkk requested a review from pavanimajety as a code owner October 22, 2025 13:51

mergify bot added the needs-rebase label Oct 22, 2025

bug fix: write positions when not use pcp

eb5628e

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

FENP and others added 2 commits October 23, 2025 15:50

disable prefix caching and chunk prefill when using PCP

44473de

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

[Perf] Optimize custom_mask computation

f0ab17c

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

FENP and others added 4 commits October 23, 2025 16:49

code cleanup and fix scheduler_block_size

272c8f1

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

increase kv cache size by pcp size

e2e2952

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

Perf: support PIECEWISE cuda graph for PCP

1598b45

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

[bugfix] fix _update_tokens_for_pcp for MTP

502fc0d

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

	ensure_model_parallel_initialized(
	parallel_config.tensor_parallel_size,
	parallel_config.pipeline_parallel_size)

		cp_unpad_mask = self.cp_unpad_mask_cpu_tensor[
		:total_num_scheduled_tokens*self.cp_world_size]

Uh oh!

Uh oh!

[Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer #26864

Are you sure you want to change the base?

[Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer #26864

Conversation

LookAround0301 commented Oct 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

TODOs

Uh oh!

mergify bot commented Oct 15, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 15, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

pisceskkk Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

pisceskkk Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

riou-chen commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FENP commented Oct 22, 2025

Uh oh!

riou-chen commented Oct 22, 2025

Uh oh!

pisceskkk commented Oct 22, 2025

Uh oh!

mergify bot commented Oct 22, 2025

Uh oh!

riou-chen commented Oct 23, 2025

Uh oh!

pisceskkk commented Oct 23, 2025

Uh oh!

FENP commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LookAround0301 commented Oct 15, 2025 •

edited by github-actions bot

Loading

riou-chen commented Oct 22, 2025 •

edited

Loading

FENP commented Oct 23, 2025 •

edited

Loading