[MISC] Introduce pipeline parallelism partition strategies #6920

comaniac · 2024-07-29T23:02:43Z

This is a simple PR that introduces PP partition strategies to allow more interesting partition algorithm. The motivation is I found that the current PP kv-cache capacity is limited by the smallest one of all PP stages, and this is usually the first or the last stage because of embedding and LM head. Thus, simply let the first and last stage take less layers could improve the overall throughput. Here is a simple benchmark result of Llama-3-70B-FP8 with 0.02% of CNN daily mail summarization task on 8xL4 GPUs.

Strategy	kv-cache blocks	E2E time (s)
Baseline (evenly partition)	1476	244.53
Balanced	3078	167.68

In this example we got 1.46x throughput improvement with this partition strategy.

For simplify, this PR now uses vLLM envs to control the partition strategy. This can be improved in the future.
For example: VLLM_PIPELINE_PARTITION_STRATEGY=less_layer_in_first_last_partition

cc @andoorve @Yard1 @SolitaryThinker

github-actions · 2024-07-29T23:02:54Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

njhill · 2024-07-29T23:35:06Z

@comaniac very nice!! Can we make a more balanced strategy the default?

comaniac · 2024-07-29T23:51:12Z

@comaniac very nice!! Can we make a more balanced strategy the default?

I don't against that but I didn't perform very comprehensive testing so not sure if this is constantly beneficial or it will result in OOM / other side effects for other models. Bring it in first and make it as default after a while might be safer.

andoorve · 2024-07-29T23:56:26Z

Thanks for the change! To confirm, was the benchmark PP=8 on one node?

Also, it would be nice if there was an easy way for the user to register their own partition strategy, or manual strategy. We don't need to advertise this widely, but an easy way to manually specify splits would be great for power users.

comaniac · 2024-07-30T00:23:27Z

Yes this is PP=8 on one node.

For the registration, I currently put the registry in a standalone file. All you need to do is adding a function to that file and use the env variable to enable it. I feel this should be sufficient for power users to configure/experiments. Please let me know if you have better suggestions.

youkaichao

The most complicated part for this feature, is how to allow users to register the partition strategy. Instead of adding more automatic options, how about directly adding an env var VLLM_PP_LAYER_PARTITION, with pp_size items, and sums up to the number of layers?

comaniac · 2024-07-30T04:36:04Z

The most complicated part for this feature, is how to allow users to register the partition strategy. Instead of adding more automatic options, how about directly adding an env var VLLM_PP_LAYER_PARTITION, with pp_size items, and sums up to the number of layers?

Can you provide an example of how it will look like?

youkaichao · 2024-07-30T04:38:59Z

e.g. export VLLM_PP_LAYER_PARTITION=7,9 when you have pp size = 2 and 16 layers.

comaniac · 2024-07-30T04:51:55Z

That's a good idea to allow custom partition. I feel we could support both, and the built in options are effective partition algorithms. WDYT?

youkaichao · 2024-07-30T04:55:41Z

My opinion is that this is very difficult to optimize automatically. It depends on the layer size and lm head/word embedding size. Say there is a crazy model with large word embedding layer and small lm head layer...

I prefer just adding VLLM_PP_LAYER_PARTITION for advanced users to enforce the partition.

comaniac · 2024-07-30T05:17:16Z

I don't think built in options would cause this issue tho, because users still need to explicitly choose them. If none of built in options works then users could customize the partition. In other words, the built in options are more like syntax sugar. Otherwise it would be tedious if a user wants to apply a particular algorithm to all models and 90% of them are actually applicable.

youkaichao · 2024-07-30T05:57:14Z

sorry didn't get time to read the code in details. can you briefly mention which strategy works for which scenario? and how can users choose it?

andoorve · 2024-07-30T06:24:45Z

I had a similar idea as @youkaichao, just specifying the partition using env var. Possibly we could repurpose the same env var and use [] to parse as a list?

comaniac · 2024-07-30T06:54:37Z

I had a similar idea as @youkaichao, just specifying the partition using env var. Possibly we could repurpose the same env var and use [] to parse as a list?

This is definitely doable, but let me clarify: Are you supporting the idea that we only allow users to specify partition list without providing other built in methods, or are you supporting both? This is not a huge design change and it's not facing to most users, so I don't want to spend too much time on it. If most people feel it's better to not provide more partitions other than the even partition and enforce users to specify a partition list, then I'm fine with it.

andoorve · 2024-07-30T07:00:59Z

Sorry, I should clarify - I think we should do both. We can just overload the same environment variable to do so. I.e., a less advanced user can specify:

VLLM_PIPELINE_PARTITION_STRATEGY=less_layer_in_first_last_partition

whereas a more advanced user might want to play around and could specify using []. For example:

VLLM_PIPELINE_PARTITION_STRATEGY=[2,3,3,2]

I like the original idea of being able to register partitions using register_partition_strategy as it allows us in the future to be able to extend this ourselves with more partition policies, or maybe even provide plugin functionality. Just would be good to have the manual option as well.

youkaichao · 2024-07-30T07:16:08Z

I don't support register_partition_strategy, because it does not work for ray (a major usecase for multi-node pp). If users use this functionality to register their own partition, only the driver worker knows the partition, while the rest workers don't. Ray workers are forked from raylet, and execute vllm code only.

This custom registration happens before, when we support out-of-the-tree model registration. See #5657 for more details.

That's why I don't want this functionality to be too complicated. Users should do the calculation, and only give vLLM the partition result by export VLLM_PP_LAYER_PARTITION=7,9 . Using environment variable to pass the parition result works for all cases, while using environment variable to pass a custom function is much more complicated.

comaniac · 2024-07-30T07:20:53Z

This PR doesn't pass function via environment variable but just the name. Does it also not work? To me it works the same way as passing a list.

comaniac · 2024-07-30T15:05:48Z

Ok I took a deep look at #5667 and got your point about not using registry. On the hand this is just a way of implementation. We can still provide built in methods with a static dict.

So the design choice we should discuss is "whether we could provide some built in methods so that users could configure it by name, in addition to configure it by a partition list".

youkaichao · 2024-07-30T17:13:35Z

So the design choice we should discuss is "whether we could provide some built in methods so that users could configure it by name, in addition to configure it by a partition list".

Yes you are right. In my opinion configuring it by a partition list is straightforward and feature-complete. But I don't have time to figure out the benefit of the provided strategies in this PR and when to use them. We can also chat off-github for more details.

njhill · 2024-07-30T17:45:06Z

It would be greate if we could at least make it easy for users to pick up an optimal default for key model + topology combinations. For example tp=8 x pp=2 for 16bit llama-3.1-405b.

comaniac · 2024-07-30T20:26:32Z

Fallback to remove built-in partition strategies per offline discussion. PTAL @youkaichao

youkaichao · 2024-07-30T21:18:37Z

vllm/distributed/utils.py

+                partition_list_str)) from err
+        if len(partitions) != pp_size:
+            raise ValueError(
+                "Number of partitions does not match the number of workers.")


nit

Suggested change

"Number of partitions does not match the number of workers.")

"Number of partitions does not match the pp size.")

youkaichao · 2024-07-30T21:20:50Z

vllm/distributed/utils.py

-    if pp_rank == pp_size - 1:
-        end_layer = num_hidden_layers
+        if pp_rank == pp_size - 1:
+            end_layer = num_hidden_layers



can we add a logging.info here, to tell user how many layers we have, how many layers this pp rank has? This helps user verify that their env var works. It is especially very helpful, when setting env vars in multi-node consistently is difficult.

youkaichao

Thanks for addressing my comments! Left some nit comments.

youkaichao · 2024-07-31T19:02:02Z

@comaniac I directly fixed the nit comments. Thanks for the contribution!

…ect#6920) Co-authored-by: youkaichao <youkaichao@126.com>

…ect#6920) Co-authored-by: youkaichao <youkaichao@126.com> Signed-off-by: Alvant <alvasian@yandex.ru>

…ect#6920) Co-authored-by: youkaichao <youkaichao@126.com>

comaniac added 2 commits July 29, 2024 14:33

wip

39a1736

done

58fe884

comaniac requested a review from youkaichao July 29, 2024 23:03

remoe

ecb30c9

youkaichao requested changes Jul 30, 2024

View reviewed changes

revert

865ae2b

youkaichao reviewed Jul 30, 2024

View reviewed changes

youkaichao approved these changes Jul 30, 2024

View reviewed changes

add logging and fix error message

b7039e4

remove logging

8d648f3

youkaichao merged commit bd70013 into vllm-project:main Jul 31, 2024
14 of 16 checks passed

dtrifiro mentioned this pull request Aug 5, 2024

Sync with upstream@v0.5.4-7-g9118217f opendatahub-io/vllm#120

Closed

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[MISC] Introduce pipeline parallelism partition strategies (vllm-proj…

a7c2c2f

…ect#6920) Co-authored-by: youkaichao <youkaichao@126.com>

comaniac deleted the partition branch September 16, 2024 23:12

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[MISC] Introduce pipeline parallelism partition strategies (vllm-proj…

77c84b9

…ect#6920) Co-authored-by: youkaichao <youkaichao@126.com> Signed-off-by: Alvant <alvasian@yandex.ru>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[MISC] Introduce pipeline parallelism partition strategies (vllm-proj…

ff5bdcd

…ect#6920) Co-authored-by: youkaichao <youkaichao@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MISC] Introduce pipeline parallelism partition strategies #6920

[MISC] Introduce pipeline parallelism partition strategies #6920

comaniac commented Jul 29, 2024 •

edited

Loading

github-actions bot commented Jul 29, 2024

njhill commented Jul 29, 2024

comaniac commented Jul 29, 2024

andoorve commented Jul 29, 2024

comaniac commented Jul 30, 2024

youkaichao left a comment

comaniac commented Jul 30, 2024

youkaichao commented Jul 30, 2024

comaniac commented Jul 30, 2024

youkaichao commented Jul 30, 2024

comaniac commented Jul 30, 2024

youkaichao commented Jul 30, 2024

andoorve commented Jul 30, 2024

comaniac commented Jul 30, 2024

andoorve commented Jul 30, 2024

youkaichao commented Jul 30, 2024

comaniac commented Jul 30, 2024 •

edited

Loading

comaniac commented Jul 30, 2024

youkaichao commented Jul 30, 2024

njhill commented Jul 30, 2024

comaniac commented Jul 30, 2024

youkaichao Jul 30, 2024

youkaichao Jul 30, 2024

youkaichao left a comment

youkaichao commented Jul 31, 2024

	"Number of partitions does not match the number of workers.")
	"Number of partitions does not match the pp size.")

[MISC] Introduce pipeline parallelism partition strategies #6920

[MISC] Introduce pipeline parallelism partition strategies #6920

Conversation

comaniac commented Jul 29, 2024 • edited Loading

github-actions bot commented Jul 29, 2024

njhill commented Jul 29, 2024

comaniac commented Jul 29, 2024

andoorve commented Jul 29, 2024

comaniac commented Jul 30, 2024

youkaichao left a comment

Choose a reason for hiding this comment

comaniac commented Jul 30, 2024

youkaichao commented Jul 30, 2024

comaniac commented Jul 30, 2024

youkaichao commented Jul 30, 2024

comaniac commented Jul 30, 2024

youkaichao commented Jul 30, 2024

andoorve commented Jul 30, 2024

comaniac commented Jul 30, 2024

andoorve commented Jul 30, 2024

youkaichao commented Jul 30, 2024

comaniac commented Jul 30, 2024 • edited Loading

comaniac commented Jul 30, 2024

youkaichao commented Jul 30, 2024

njhill commented Jul 30, 2024

comaniac commented Jul 30, 2024

youkaichao Jul 30, 2024

Choose a reason for hiding this comment

youkaichao Jul 30, 2024

Choose a reason for hiding this comment

youkaichao left a comment

Choose a reason for hiding this comment

youkaichao commented Jul 31, 2024

comaniac commented Jul 29, 2024 •

edited

Loading

comaniac commented Jul 30, 2024 •

edited

Loading