Implement Pipeline Parallelism support for HPU. #1000

jmaksymczuk · 2025-04-02T13:42:42Z

This PR implements HPU support for pipeline parallelism.
Tested accuracy and it's the same as TP accuracy on:

Llama3.1-70b-Instruct
Llama3.2-3b-Instruct
Mixtral-8x7b

To serve with PP:
VLLM_DECODE_BS_BUCKET_MIN=384 VLLM_DECODE_BLOCK_BUCKET_MAX=896 vllm serve /mnt/weka/data/pytorch/llama3.1/Meta-Llama-3.1-70B-Instruct/ --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-num-seqs 384 --disable-log-requests --dtype bfloat16 --gpu-memory-util 0.9 --disable-log-stats --num_scheduler_steps 1 --max-num-batched-tokens 2048 --max-model-len 256 --block-size 128

Known issues:

since for Pipeline Parallelism max_num_seqs acts as a microbatch for a single virtual_engine - for bigger batch_size we fall into a very specific corner case and get flat_pa error -> set batch_size to approximately batch size that you would use in TP but divided by pp_size
delayed sampling is not yet compatible with pipeline parallelism
virtaul_engine ID is passed to HPUGraph which results in pp_size * amount of graphs

vllm/worker/hpu_model_runner.py

kwisniewski98

LGTM

michalkuligowski · 2025-04-07T07:16:50Z

/run-gaudi-tests

michalkuligowski · 2025-04-08T10:56:01Z

/run-gaudi-tests

Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai>

…lism. Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai>

Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai>

michalkuligowski · 2025-04-08T14:53:19Z

/run-gaudi-tests

This PR implements HPU support for pipeline parallelism. Tested accuracy and it's the same as TP accuracy on: - Llama3.1-70b-Instruct - Llama3.2-3b-Instruct - Mixtral-8x7b To serve with PP: `VLLM_DECODE_BS_BUCKET_MIN=384 VLLM_DECODE_BLOCK_BUCKET_MAX=896 vllm serve /mnt/weka/data/pytorch/llama3.1/Meta-Llama-3.1-70B-Instruct/ --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-num-seqs 384 --disable-log-requests --dtype bfloat16 --gpu-memory-util 0.9 --disable-log-stats --num_scheduler_steps 1 --max-num-batched-tokens 2048 --max-model-len 256 --block-size 128` Known issues: * since for Pipeline Parallelism max_num_seqs acts as a microbatch for a single virtual_engine - for bigger batch_size we fall into a very specific corner case and get flat_pa error -> set batch_size to approximately batch size that you would use in TP but divided by pp_size * delayed sampling is not yet compatible with pipeline parallelism * virtaul_engine ID is passed to HPUGraph which results in pp_size * amount of graphs --------- Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai> Co-authored-by: Rafal Litka <rlitka@habana.ai>

This PR implements HPU support for pipeline parallelism. Tested accuracy and it's the same as TP accuracy on: - Llama3.1-70b-Instruct - Llama3.2-3b-Instruct - Mixtral-8x7b To serve with PP: `VLLM_DECODE_BS_BUCKET_MIN=384 VLLM_DECODE_BLOCK_BUCKET_MAX=896 vllm serve /mnt/weka/data/pytorch/llama3.1/Meta-Llama-3.1-70B-Instruct/ --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-num-seqs 384 --disable-log-requests --dtype bfloat16 --gpu-memory-util 0.9 --disable-log-stats --num_scheduler_steps 1 --max-num-batched-tokens 2048 --max-model-len 256 --block-size 128` Known issues: * since for Pipeline Parallelism max_num_seqs acts as a microbatch for a single virtual_engine - for bigger batch_size we fall into a very specific corner case and get flat_pa error -> set batch_size to approximately batch size that you would use in TP but divided by pp_size * delayed sampling is not yet compatible with pipeline parallelism * virtaul_engine ID is passed to HPUGraph which results in pp_size * amount of graphs Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai> Co-authored-by: Rafal Litka <rlitka@habana.ai> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>

@czhu15

- Enable PP solution with full support for DeepSeek R1 execution with PP>1. - Requires 1.21.0 or newer. Does not support 1.20.1 or older. - Implementation mirrors #1000 as closely as possible while ensuring DeepSeek R1 functions fully. - Adds a benchmark script for sweeping various configs automatically. This can be removed if you feel it shouldnt merge to deepseek_r1 branch. Additional validation is being done by yabai.hu@intel.com. @czhu15 youlei.yang@intel.com please help start the review in the meantime. Signed-off-by: Voas, Tanner <tanner.voas@intel.com> Co-authored-by: Hu, Yabai <yabai.hu@intel.com> Co-authored-by: Ji, Kunshang <kunshang.ji@intel.com> Co-authored-by: Sheng, Yi <yi.sheng@intel.com> Co-authored-by: Chen, Xinyu <xinyu1.chen@intel.com>

jmaksymczuk requested review from afierka-intel, kzawora-intel, madamczyk-intel, mgawarkiewicz, michalkuligowski and vivekgoe as code owners April 2, 2025 13:42

kwisniewski98 reviewed Apr 3, 2025

View reviewed changes

vllm/worker/hpu_model_runner.py Outdated Show resolved Hide resolved

kwisniewski98 approved these changes Apr 3, 2025

View reviewed changes

jmaksymczuk force-pushed the hpu_pp_new branch from 0d7bac3 to 9a587b4 Compare April 4, 2025 13:28

michalkuligowski approved these changes Apr 7, 2025

View reviewed changes

jmaksymczuk force-pushed the hpu_pp_new branch from 2accb8f to 546a708 Compare April 8, 2025 09:24

jmaksymczuk force-pushed the hpu_pp_new branch from df4a38f to 71c05f4 Compare April 8, 2025 12:23

jmaksymczuk requested a review from michalkuligowski April 8, 2025 12:31

jmaksymc and others added 10 commits April 8, 2025 15:55

Pipeline parallelism implementation.

4e0cfa9

Change forward in hpu_model_runner.

e1bf0be

Cleanup

e7f7d3e

I removed too much, reverting.

b52f9b3

format

c33969c

Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai>

I removed too much.

acf597c

Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai>

Add assert - delayed sampling is not compatible with pipeline paralle…

e0b93c0

…lism. Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai>

Remove extra kv_cache for binding, it's not needed.

b2e9831

Signed-off-by: jmaksymczuk <jmaksymczuk@habana.ai>

[SW-224972] __iter__ for IntermediateTensors

d5f4ac5

format

a3db715

jmaksymczuk force-pushed the hpu_pp_new branch from 71c05f4 to a3db715 Compare April 8, 2025 12:58

jmaksymczuk merged commit 2cf9580 into habana_main Apr 9, 2025
44 checks passed

jmaksymczuk deleted the hpu_pp_new branch April 9, 2025 08:41

tvoas mentioned this pull request May 12, 2025

[deepseek_r1] General PP enabling #1240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Pipeline Parallelism support for HPU. #1000

Implement Pipeline Parallelism support for HPU. #1000

Uh oh!

jmaksymczuk commented Apr 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

kwisniewski98 left a comment

Uh oh!

michalkuligowski commented Apr 7, 2025

Uh oh!

michalkuligowski commented Apr 8, 2025

Uh oh!

michalkuligowski commented Apr 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Implement Pipeline Parallelism support for HPU. #1000

Implement Pipeline Parallelism support for HPU. #1000

Uh oh!

Conversation

jmaksymczuk commented Apr 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kwisniewski98 left a comment

Choose a reason for hiding this comment

Uh oh!

michalkuligowski commented Apr 7, 2025

Uh oh!

michalkuligowski commented Apr 8, 2025

Uh oh!

michalkuligowski commented Apr 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jmaksymczuk commented Apr 2, 2025 •

edited by github-actions bot

Loading