Skip to content

Conversation

@zhangsicheng5
Copy link
Contributor

@zhangsicheng5 zhangsicheng5 commented Nov 29, 2025

  1. support pcp + mtp in full graph
  2. pcp/dcp related mtp bugfix
  3. support pcp + mtpx

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for PCP (Prefill Context Parallelism) and MTP (Multi-Token Prediction) in full graph mode, along with several related bug fixes. The changes correctly generalize PCP-only logic to accommodate DCP (Decode Context Parallelism) as well. A notable improvement is the handling of variable query lengths in speculative decoding batches, which replaces assumptions of fixed lengths with more robust logic. However, I've identified one critical issue in the implementation that needs to be addressed.

@zhangsicheng5 zhangsicheng5 force-pushed the mtp branch 3 times, most recently from b08104c to b330b75 Compare December 1, 2025 02:11
@weijinqian0 weijinqian0 added ready read for review ready-for-test start test by label for PR labels Dec 1, 2025
# prefill target_hidden_states: pcp split
num_tokens_d = num_decode_reqs * self.decode_threshold
query_lens_d = self.runner.query_lens[:num_decode_reqs]
num_tokens_d = query_lens_d.sum().item()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is query_lens_d a device tensor? if so, you call query_lens_d.sum().item() will incur cpu blocking, please fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a host tensor, refer to model_runner_v1.py: self.query_lens = torch.from_numpy(num_scheduled_tokens), so it will not influence host device sync. The _d is abbreviation of _decode.

@github-actions
Copy link

github-actions bot commented Dec 2, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants