Skip to content

Conversation

@LookAround0301
Copy link

@LookAround0301 LookAround0301 commented Oct 15, 2025

Purpose

This PR adds the Prefill Context Parallelism (PCP) feature, which corresponds to DCP. For specific implementation details, please refer to the RFC #25749.
TL;DR: PCP enhances long-sequence inference capabilities by partitioning the sequence dimension during the prefill stage.

The current implementation primarily includes the following changes:

  • Modified ModelRunner.py for CP partitioning logic for tokens;
  • Modified flashinfer.py to adapt the FlashInfer backend for GQA to PCP.
  • Modified block_tables.py to extend the KV cache storage based on DCP&PCP;
  • Added a communication group cp_group for PCP;
  • Added necessary command-line arguments to control parallelism for PCP;

Test Plan

Qwen/Qwen3-32B

export VLLM_ATTENTION_BACKEND='FLASHINFER'
vllm serve Qwen/Qwen3-32B --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --context-parallel-size 2

Test Result

gsm8k eval

TP2 batchsize 256 max_out_len 1024

PCP1

dataset version metric mode vllm-api-stream-chat
gsm8kdataset - accuracy gen 88.02

PCP2

dataset version metric mode vllm-api-stream-chat
gsm8kdataset - accuracy gen 87.87

TODOs

  • UT for PCP;
  • Make PCP of flashinfer compatible with DCP after PR #25438 is merged;
  • Make block-level interleaved KV cache storage compatible after PR #26696 is merged;

Feature works (These items will be tackled in follow-up PRs; community contributions are warmly welcomed.):

  • PCP support for MLA and other backends;
  • PCP support for chunked-prefill and prefix caching features;
  • PCP support for MTP;
  • PCP support for CUDAFullGraph;
  • Ring-CP style attention backend algorithm, ref RFC #26133.

@mergify
Copy link

mergify bot commented Oct 15, 2025

⚠️ The sha of the head commit of this PR conflicts with #25852. Mergify cannot evaluate rules on this PR. ⚠️

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

ensure_model_parallel_initialized(
parallel_config.tensor_parallel_size,
parallel_config.pipeline_parallel_size)

P1 Badge Pass context parallel size when initializing TPU groups

ensure_model_parallel_initialized now expects a context_model_parallel_size positional argument, but the TPU worker still calls it with only tensor and pipeline sizes. On TPU this call will immediately raise TypeError: ensure_model_parallel_initialized() missing 1 required positional argument, so the worker cannot start even when context parallelism is left at its default of 1. Forward the context-parallel size (e.g., parallel_config.context_parallel_size) or restore a default to keep TPU initialization functional.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Prefill Context Parallelism (PCP) to enhance long-sequence inference by partitioning the sequence dimension. The changes are extensive, touching distributed state management, attention backends, KV cache coordination, and the model runner. The overall approach seems sound and consistent with existing parallelism strategies in vLLM. However, I found a critical issue in the GPU model runner where a CPU tensor is used as a mask for a GPU tensor, which will lead to a runtime error. This needs to be addressed.

Comment on lines 1299 to 1300
cp_unpad_mask = self.cp_unpad_mask_cpu_tensor[
:total_num_scheduled_tokens*self.cp_world_size]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The boolean mask cp_unpad_mask is on the CPU, while it's being used to index a GPU tensor cp_padded_slot_mapping on line 1302. This will cause a RuntimeError: Boolean mask must be on the same device as the self tensor. The mask should be moved to the GPU before being used for indexing.

Suggested change
cp_unpad_mask = self.cp_unpad_mask_cpu_tensor[
:total_num_scheduled_tokens*self.cp_world_size]
cp_unpad_mask = self.cp_unpad_mask_cpu_tensor[
:total_num_scheduled_tokens*self.cp_world_size].to(self.device, non_blocking=True)

@FENP FENP mentioned this pull request Oct 15, 2025
5 tasks
@mergify mergify bot added the v1 label Oct 15, 2025
@mergify
Copy link

mergify bot commented Oct 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LookAround0301.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Excited for this!

I'll do a more thorough review later but my initial comment is:

Can we try reduce amount of changes overall. I dont think this needs to be this invasive especially in the gpu_model_runner. The gpu model runner is already a very complex piece of code making it very hard for contributors to keep, we should try out best to not add to the complexity.

tp_size: int
pp_size: int
dcp_size: int
cp_size: int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call the pcp_size, thoughts?

str(pp_size),
"--decode-context-parallel-size",
str(dcp_size),
"--context-parallel-size",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto earlier comment; I think this should be --prefill-context-parallel-size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I have renamed variables like cp_*/context_parallel_* to pcp_*/prefill_context_parallel_* to ensure they are distinct from dcp.

arange,
out=positions_np)
req_indices_for_slotmapping = req_indices
positions_np_for_slotmapping = positions_np
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels a bit messy; maybe something like this would be cleaner?:

        num_scheduled_tokens = np.array(tokens, dtype=np.int32)
        ...
        # E.g., [2, 5, 3] -> [0, 0, 1, 1, 1, 1, 1, 2, 2, 2]
        req_indices = np.repeat(self.arange_np[:num_reqs],
                                num_scheduled_tokens)
        ...

        np.add(self.input_batch.num_computed_tokens_cpu[req_indices],
               arange,
               out=positions_np)

        self.input_batch.block_table.compute_slot_mapping(
            req_indices, positions_np)
        self.input_batch.block_table.commit_slot_mapping(
            total_num_scheduled_tokens)

        ...
        num_scheduled_tokens, positions_np = self._update_tokens_for_cp( # This would have to be modified to filter 
             num_scheduled_tokens, positions_np)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. Moving compute_slot_mapping up to _update_tokens_for_cp indeed makes it clearer. I have made the changes and look forward to further review.

@mergify mergify bot removed the needs-rebase label Oct 16, 2025
Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com>
Co-authored-by: QiuChunshuo <qiuchunshuo@huawei.com>
Co-authored-by: LookAround <lixushi@huawei.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
@pisceskkk pisceskkk force-pushed the pcp_pr branch 2 times, most recently from e120ccb to d09cda7 Compare October 17, 2025 08:12
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
@riou-chen
Copy link

riou-chen commented Oct 22, 2025

I clone the code and compile from source code。server run command in h20 as follows:
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m vllm.entrypoints.openai.api_server --served-model-name QwQ3-32B --model huggingface/Qwen3-32B --swap-space 0 --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --prefill-context-parallel-size 2 --port 6005 --no-enable-prefix-caching
client send 8K/1K request in multiprocess,but server crash, the bug info as follows:

(APIServer pid=206092) INFO: 10.72.1.61:51920 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=206092) INFO: 10.72.1.61:51922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=206092) INFO: 10.72.1.61:51924 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=206092) INFO: 10.72.1.61:51926 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=206092) INFO 10-22 16:08:27 [loggers.py:181] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] WorkerProc hit an exception.
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last):
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR [multiproc_executor.py:707] WorkerProc hit an exception.
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] Traceback (most recent call last):
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors)
ERROR [multiproc_executor.py:707] output = func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
ERROR [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model
ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output)
ERROR [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp(
ERROR [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp
ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = (
ERROR [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,)
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last):
ERROR [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = func(*args, **kwargs)
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp(
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = (
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
ERROR [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] Traceback (most recent call last):
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors)
ERROR [multiproc_executor.py:707] output = func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
ERROR [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] WorkerProc hit an exception.
ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output)
ERROR [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last):
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model
(
ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
ERROR [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = (
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
ERROR [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = (
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output)
(Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707]
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp(
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model
ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors)
ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = (
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
ERROR [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:707]
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp(
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = (
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last):
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = func(*args, **kwargs)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs)
(Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^

the code branch pcp_pr is right?

@FENP
Copy link
Contributor

FENP commented Oct 22, 2025

I clone the code and compile from source code。server run command in h20 as follows: CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m vllm.entrypoints.openai.api_server --served-model-name QwQ3-32B --model huggingface/Qwen3-32B --swap-space 0 --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --prefill-context-parallel-size 2 --port 6005 --no-enable-prefix-caching client send 8K/1K request in multiprocess,but server crash, the bug info as follows:

(APIServer pid=206092) INFO: 10.72.1.61:51920 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=206092) INFO: 10.72.1.61:51922 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=206092) INFO: 10.72.1.61:51924 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=206092) INFO: 10.72.1.61:51926 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=206092) INFO 10-22 16:08:27 [loggers.py:181] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] WorkerProc hit an exception. (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last): (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context ERROR [multiproc_executor.py:707] WorkerProc hit an exception. (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] Traceback (most recent call last): (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors) ERROR [multiproc_executor.py:707] output = func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) ERROR [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output) ERROR [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp( ERROR [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = ( ERROR [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,) ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last): ERROR [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = func(*args, **kwargs) ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp( (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = ( (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context ERROR [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) ERROR [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] Traceback (most recent call last): (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors) ERROR [multiproc_executor.py:707] output = func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) ERROR [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] WorkerProc hit an exception. ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output) ERROR [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last): (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model ( ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model ERROR [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = ( (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model ERROR [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = ( (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context ERROR [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output) (Worker_PCP0_TP1 pid=206364) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp( (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_worker.py", line 477, in execute_model ERROR [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = self.model_runner.execute_model(scheduler_output, intermediate_tensors) ERROR [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = ( (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context ERROR [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) ERROR [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ ERROR [multiproc_executor.py:707] (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 2608, in execute_model (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ) = self._prepare_inputs(scheduler_output) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 1196, in _prepare_inputs (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] num_scheduled_tokens, positions_cp = self._update_tokens_for_pcp( (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/gpu_model_runner.py", line 992, in _update_tokens_for_pcp (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] self.pcp_unpad_mask_cpu[: pcp_padded_arange.shape[0]] = ( (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ValueError: could not broadcast input array from shape (8194,) into shape (8192,) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] Traceback (most recent call last): (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/executor/multiproc_executor.py", line 702, in worker_busy_loop (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] output = func(*args, **kwargs) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/home/PCP/vllm/vllm/v1/worker/worker_base.py", line 375, in execute_model (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return self.worker.execute_model(scheduler_output, *args, **kwargs) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] return func(*args, **kwargs) (Worker_PCP1_TP0 pid=206365) ERROR 10-22 16:08:28 [multiproc_executor.py:707] ^^^^^^^^^^^^^^^^^^^^

the code branch pcp_pr is right?

Could you please set --max-num-batched-tokens (e.g. --max-num-batched-tokens 32768) and try again? This PR is not compatible with chunk prefill, so something may happen when sequence too long.

@riou-chen
Copy link

@FENP,I run command in server and get same error
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m vllm.entrypoints.openai.api_server --served-model-name QwQ3-32B --model huggingface/Qwen3-32B --swap-space 0 --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --prefill-context-parallel-size 2 --max-num-batched-tokens 32768 --port 6005 --no-enable-prefix-caching --no-enable-chunked-prefill

the key error is
ValueError: could not broadcast input array from shape (32778,) into shape (32768,)
in multiproc_executor.py:707

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
@pisceskkk
Copy link

@FENP,I run command in server and get same error CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m vllm.entrypoints.openai.api_server --served-model-name QwQ3-32B --model huggingface/Qwen3-32B --swap-space 0 --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --prefill-context-parallel-size 2 --max-num-batched-tokens 32768 --port 6005 --no-enable-prefix-caching --no-enable-chunked-prefill

the key error is ValueError: could not broadcast input array from shape (32778,) into shape (32768,) in multiproc_executor.py:707

@riou-chen Thanks for trying it out and reporting the bug! This issue occurred because the size of our pre-allocated buffer was aligned with max-num-batched-tokens, but the actual number of tokens after padding could exceed this value. We have fixed it and submitted a new commit. Please give it another try!

@mergify
Copy link

mergify bot commented Oct 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LookAround0301.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 22, 2025
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
@riou-chen
Copy link

good,it run successfully after merge the new commit. but the performance decrease compare to the base.
I run the pcp test as above command,and client send request 8K/1K ,2k/1k,1k/1k input/ouput,the test result as follows:
the 1K/1K base data:
image

the 1K/1K pcp data
image

the 2K/1K base data
image

the 2K/1K PCP data
image

Is there any error in my testing method? Why isn't PCP effective?

FENP and others added 2 commits October 23, 2025 15:50
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
@pisceskkk
Copy link

but the performance decrease compare to the base.

@riou-chen During our GPU performance testing, we also identified performance degradation issues. Preliminary analysis suggests two potential causes:

  1. The custom mask computation introduces significant overhead. Although we have implemented initial optimizations (please ref commit), there still remains some performance loss.
  2. We suspect that flashinfer may have performance degradation when handling custom mask computations (though there is no concrete evidence yet).

Therefore, the current GPU implementation still has room for optimization.

Additionally, based on our implementation and performance testing on NPU, DCP & PCP incur performance losses for short sequences (comparing CP vs. DP under the same TP parallelism). Performance gains are only observed in scenarios with long-sequence inputs exceeding 32K, and the benefits become more pronounced as the sequence length increases.

@FENP
Copy link
Contributor

FENP commented Oct 23, 2025

but the performance decrease compare to the base.

@riou-chen During our GPU performance testing, we also identified performance degradation issues. Preliminary analysis suggests two potential causes:

  1. The custom mask computation introduces significant overhead. Although we have implemented initial optimizations (please ref commit), there still remains some performance loss.
  2. We suspect that flashinfer may have performance degradation when handling custom mask computations (though there is no concrete evidence yet).

Therefore, the current GPU implementation still has room for optimization.

Additionally, based on our implementation and performance testing on NPU, DCP & PCP incur performance losses for short sequences (comparing CP vs. DP under the same TP parallelism). Performance gains are only observed in scenarios with long-sequence inputs exceeding 32K, and the benefits become more pronounced as the sequence length increases.

I’d like to add that CUDA Graph significantly affects performance, which is not supported by PCP currently (WIP).

[Update] PIECEWISE cuda graph have been supported by 1598b45

FENP and others added 4 commits October 23, 2025 16:49
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants