forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 134
Implement Pipeline Parallelism support for HPU. #1000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
kwisniewski98
approved these changes
Apr 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
/run-gaudi-tests |
michalkuligowski
approved these changes
Apr 7, 2025
|
/run-gaudi-tests |
Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai>
…lism. Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai>
Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai>
|
/run-gaudi-tests |
jmaksymczuk
pushed a commit
that referenced
this pull request
Apr 9, 2025
This PR implements HPU support for pipeline parallelism. Tested accuracy and it's the same as TP accuracy on: - Llama3.1-70b-Instruct - Llama3.2-3b-Instruct - Mixtral-8x7b To serve with PP: `VLLM_DECODE_BS_BUCKET_MIN=384 VLLM_DECODE_BLOCK_BUCKET_MAX=896 vllm serve /mnt/weka/data/pytorch/llama3.1/Meta-Llama-3.1-70B-Instruct/ --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-num-seqs 384 --disable-log-requests --dtype bfloat16 --gpu-memory-util 0.9 --disable-log-stats --num_scheduler_steps 1 --max-num-batched-tokens 2048 --max-model-len 256 --block-size 128` Known issues: * since for Pipeline Parallelism max_num_seqs acts as a microbatch for a single virtual_engine - for bigger batch_size we fall into a very specific corner case and get flat_pa error -> set batch_size to approximately batch size that you would use in TP but divided by pp_size * delayed sampling is not yet compatible with pipeline parallelism * virtaul_engine ID is passed to HPUGraph which results in pp_size * amount of graphs --------- Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai> Co-authored-by: Rafal Litka <rlitka@habana.ai>
michalkuligowski
added a commit
that referenced
this pull request
Apr 10, 2025
This PR implements HPU support for pipeline parallelism. Tested accuracy and it's the same as TP accuracy on: - Llama3.1-70b-Instruct - Llama3.2-3b-Instruct - Mixtral-8x7b To serve with PP: `VLLM_DECODE_BS_BUCKET_MIN=384 VLLM_DECODE_BLOCK_BUCKET_MAX=896 vllm serve /mnt/weka/data/pytorch/llama3.1/Meta-Llama-3.1-70B-Instruct/ --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-num-seqs 384 --disable-log-requests --dtype bfloat16 --gpu-memory-util 0.9 --disable-log-stats --num_scheduler_steps 1 --max-num-batched-tokens 2048 --max-model-len 256 --block-size 128` Known issues: * since for Pipeline Parallelism max_num_seqs acts as a microbatch for a single virtual_engine - for bigger batch_size we fall into a very specific corner case and get flat_pa error -> set batch_size to approximately batch size that you would use in TP but divided by pp_size * delayed sampling is not yet compatible with pipeline parallelism * virtaul_engine ID is passed to HPUGraph which results in pp_size * amount of graphs Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai> Co-authored-by: Rafal Litka <rlitka@habana.ai> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>
czhu15
pushed a commit
that referenced
this pull request
May 15, 2025
- Enable PP solution with full support for DeepSeek R1 execution with PP>1. - Requires 1.21.0 or newer. Does not support 1.20.1 or older. - Implementation mirrors #1000 as closely as possible while ensuring DeepSeek R1 functions fully. - Adds a benchmark script for sweeping various configs automatically. This can be removed if you feel it shouldnt merge to deepseek_r1 branch. Additional validation is being done by yabai.hu@intel.com. @czhu15 youlei.yang@intel.com please help start the review in the meantime. Signed-off-by: Voas, Tanner <tanner.voas@intel.com> Co-authored-by: Hu, Yabai <yabai.hu@intel.com> Co-authored-by: Ji, Kunshang <kunshang.ji@intel.com> Co-authored-by: Sheng, Yi <yi.sheng@intel.com> Co-authored-by: Chen, Xinyu <xinyu1.chen@intel.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements HPU support for pipeline parallelism.
Tested accuracy and it's the same as TP accuracy on:
To serve with PP:
VLLM_DECODE_BS_BUCKET_MIN=384 VLLM_DECODE_BLOCK_BUCKET_MAX=896 vllm serve /mnt/weka/data/pytorch/llama3.1/Meta-Llama-3.1-70B-Instruct/ --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-num-seqs 384 --disable-log-requests --dtype bfloat16 --gpu-memory-util 0.9 --disable-log-stats --num_scheduler_steps 1 --max-num-batched-tokens 2048 --max-model-len 256 --block-size 128Known issues: