-
-
Couldn't load subscription status.
- Fork 10.8k
[Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer #26864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
vllm/vllm/v1/worker/tpu_worker.py
Lines 327 to 329 in 9f8290e
| ensure_model_parallel_initialized( | |
| parallel_config.tensor_parallel_size, | |
| parallel_config.pipeline_parallel_size) |
ensure_model_parallel_initialized now expects a context_model_parallel_size positional argument, but the TPU worker still calls it with only tensor and pipeline sizes. On TPU this call will immediately raise TypeError: ensure_model_parallel_initialized() missing 1 required positional argument, so the worker cannot start even when context parallelism is left at its default of 1. Forward the context-parallel size (e.g., parallel_config.context_parallel_size) or restore a default to keep TPU initialization functional.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces Prefill Context Parallelism (PCP) to enhance long-sequence inference by partitioning the sequence dimension. The changes are extensive, touching distributed state management, attention backends, KV cache coordination, and the model runner. The overall approach seems sound and consistent with existing parallelism strategies in vLLM. However, I found a critical issue in the GPU model runner where a CPU tensor is used as a mask for a GPU tensor, which will lead to a runtime error. This needs to be addressed.
vllm/v1/worker/gpu_model_runner.py
Outdated
| cp_unpad_mask = self.cp_unpad_mask_cpu_tensor[ | ||
| :total_num_scheduled_tokens*self.cp_world_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The boolean mask cp_unpad_mask is on the CPU, while it's being used to index a GPU tensor cp_padded_slot_mapping on line 1302. This will cause a RuntimeError: Boolean mask must be on the same device as the self tensor. The mask should be moved to the GPU before being used for indexing.
| cp_unpad_mask = self.cp_unpad_mask_cpu_tensor[ | |
| :total_num_scheduled_tokens*self.cp_world_size] | |
| cp_unpad_mask = self.cp_unpad_mask_cpu_tensor[ | |
| :total_num_scheduled_tokens*self.cp_world_size].to(self.device, non_blocking=True) |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! Excited for this!
I'll do a more thorough review later but my initial comment is:
Can we try reduce amount of changes overall. I dont think this needs to be this invasive especially in the gpu_model_runner. The gpu model runner is already a very complex piece of code making it very hard for contributors to keep, we should try out best to not add to the complexity.
| tp_size: int | ||
| pp_size: int | ||
| dcp_size: int | ||
| cp_size: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should call the pcp_size, thoughts?
| str(pp_size), | ||
| "--decode-context-parallel-size", | ||
| str(dcp_size), | ||
| "--context-parallel-size", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto earlier comment; I think this should be --prefill-context-parallel-size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I have renamed variables like cp_*/context_parallel_* to pcp_*/prefill_context_parallel_* to ensure they are distinct from dcp.
vllm/v1/worker/gpu_model_runner.py
Outdated
| arange, | ||
| out=positions_np) | ||
| req_indices_for_slotmapping = req_indices | ||
| positions_np_for_slotmapping = positions_np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels a bit messy; maybe something like this would be cleaner?:
num_scheduled_tokens = np.array(tokens, dtype=np.int32)
...
# E.g., [2, 5, 3] -> [0, 0, 1, 1, 1, 1, 1, 2, 2, 2]
req_indices = np.repeat(self.arange_np[:num_reqs],
num_scheduled_tokens)
...
np.add(self.input_batch.num_computed_tokens_cpu[req_indices],
arange,
out=positions_np)
self.input_batch.block_table.compute_slot_mapping(
req_indices, positions_np)
self.input_batch.block_table.commit_slot_mapping(
total_num_scheduled_tokens)
...
num_scheduled_tokens, positions_np = self._update_tokens_for_cp( # This would have to be modified to filter
num_scheduled_tokens, positions_np)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Moving compute_slot_mapping up to _update_tokens_for_cp indeed makes it clearer. I have made the changes and look forward to further review.
Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com> Co-authored-by: QiuChunshuo <qiuchunshuo@huawei.com> Co-authored-by: LookAround <lixushi@huawei.com> Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com> Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: LookAround <lixushi@huawei.com>
e120ccb to
d09cda7
Compare
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
|
I clone the code and compile from source code。server run command in h20 as follows: (APIServer pid=206092) INFO: 10.72.1.61:51920 - "POST /v1/chat/completions HTTP/1.1" 200 OK the code branch pcp_pr is right? |
Could you please set |
|
@FENP,I run command in server and get same error the key error is |
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
@riou-chen Thanks for trying it out and reporting the bug! This issue occurred because the size of our pre-allocated buffer was aligned with |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
@riou-chen During our GPU performance testing, we also identified performance degradation issues. Preliminary analysis suggests two potential causes:
Therefore, the current GPU implementation still has room for optimization. Additionally, based on our implementation and performance testing on NPU, DCP & PCP incur performance losses for short sequences (comparing CP vs. DP under the same TP parallelism). Performance gains are only observed in scenarios with long-sequence inputs exceeding 32K, and the benefits become more pronounced as the sequence length increases. |
I’d like to add that CUDA Graph significantly affects performance, which is not supported by PCP currently (WIP). [Update] PIECEWISE cuda graph have been supported by 1598b45 |
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>




Purpose
This PR adds the Prefill Context Parallelism (PCP) feature, which corresponds to DCP. For specific implementation details, please refer to the RFC #25749.
TL;DR: PCP enhances long-sequence inference capabilities by partitioning the sequence dimension during the prefill stage.
The current implementation primarily includes the following changes:
ModelRunner.pyfor CP partitioning logic for tokens;flashinfer.pyto adapt the FlashInfer backend for GQA to PCP.block_tables.pyto extend the KV cache storage based on DCP&PCP;cp_groupfor PCP;Test Plan
Qwen/Qwen3-32B
Test Result
gsm8k eval
TP2 batchsize 256 max_out_len 1024
PCP1
PCP2
TODOs
Feature works (These items will be tackled in follow-up PRs; community contributions are warmly welcomed.):