Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MISC] Introduce pipeline parallelism partition strategies #6920

Merged
merged 6 commits into from
Jul 31, 2024

Conversation

comaniac
Copy link
Collaborator

@comaniac comaniac commented Jul 29, 2024

This is a simple PR that introduces PP partition strategies to allow more interesting partition algorithm. The motivation is I found that the current PP kv-cache capacity is limited by the smallest one of all PP stages, and this is usually the first or the last stage because of embedding and LM head. Thus, simply let the first and last stage take less layers could improve the overall throughput. Here is a simple benchmark result of Llama-3-70B-FP8 with 0.02% of CNN daily mail summarization task on 8xL4 GPUs.

Strategy kv-cache blocks E2E time (s)
Baseline (evenly partition) 1476 244.53
Balanced 3078 167.68

In this example we got 1.46x throughput improvement with this partition strategy.

For simplify, this PR now uses vLLM envs to control the partition strategy. This can be improved in the future.
For example: VLLM_PIPELINE_PARTITION_STRATEGY=less_layer_in_first_last_partition

cc @andoorve @Yard1 @SolitaryThinker

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@comaniac comaniac requested a review from youkaichao July 29, 2024 23:03
@njhill
Copy link
Member

njhill commented Jul 29, 2024

@comaniac very nice!! Can we make a more balanced strategy the default?

@comaniac
Copy link
Collaborator Author

@comaniac very nice!! Can we make a more balanced strategy the default?

I don't against that but I didn't perform very comprehensive testing so not sure if this is constantly beneficial or it will result in OOM / other side effects for other models. Bring it in first and make it as default after a while might be safer.

@andoorve
Copy link
Collaborator

Thanks for the change! To confirm, was the benchmark PP=8 on one node?

Also, it would be nice if there was an easy way for the user to register their own partition strategy, or manual strategy. We don't need to advertise this widely, but an easy way to manually specify splits would be great for power users.

@comaniac
Copy link
Collaborator Author

Yes this is PP=8 on one node.

For the registration, I currently put the registry in a standalone file. All you need to do is adding a function to that file and use the env variable to enable it. I feel this should be sufficient for power users to configure/experiments. Please let me know if you have better suggestions.

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most complicated part for this feature, is how to allow users to register the partition strategy. Instead of adding more automatic options, how about directly adding an env var VLLM_PP_LAYER_PARTITION, with pp_size items, and sums up to the number of layers?

@comaniac
Copy link
Collaborator Author

The most complicated part for this feature, is how to allow users to register the partition strategy. Instead of adding more automatic options, how about directly adding an env var VLLM_PP_LAYER_PARTITION, with pp_size items, and sums up to the number of layers?

Can you provide an example of how it will look like?

@youkaichao
Copy link
Member

e.g. export VLLM_PP_LAYER_PARTITION=7,9 when you have pp size = 2 and 16 layers.

@comaniac
Copy link
Collaborator Author

That's a good idea to allow custom partition. I feel we could support both, and the built in options are effective partition algorithms. WDYT?

@youkaichao
Copy link
Member

My opinion is that this is very difficult to optimize automatically. It depends on the layer size and lm head/word embedding size. Say there is a crazy model with large word embedding layer and small lm head layer...

I prefer just adding VLLM_PP_LAYER_PARTITION for advanced users to enforce the partition.

@comaniac
Copy link
Collaborator Author

I don't think built in options would cause this issue tho, because users still need to explicitly choose them. If none of built in options works then users could customize the partition. In other words, the built in options are more like syntax sugar. Otherwise it would be tedious if a user wants to apply a particular algorithm to all models and 90% of them are actually applicable.

@youkaichao
Copy link
Member

sorry didn't get time to read the code in details. can you briefly mention which strategy works for which scenario? and how can users choose it?

@andoorve
Copy link
Collaborator

I had a similar idea as @youkaichao, just specifying the partition using env var. Possibly we could repurpose the same env var and use [] to parse as a list?

@comaniac
Copy link
Collaborator Author

I had a similar idea as @youkaichao, just specifying the partition using env var. Possibly we could repurpose the same env var and use [] to parse as a list?

This is definitely doable, but let me clarify: Are you supporting the idea that we only allow users to specify partition list without providing other built in methods, or are you supporting both? This is not a huge design change and it's not facing to most users, so I don't want to spend too much time on it. If most people feel it's better to not provide more partitions other than the even partition and enforce users to specify a partition list, then I'm fine with it.

@andoorve
Copy link
Collaborator

Sorry, I should clarify - I think we should do both. We can just overload the same environment variable to do so. I.e., a less advanced user can specify:

VLLM_PIPELINE_PARTITION_STRATEGY=less_layer_in_first_last_partition

whereas a more advanced user might want to play around and could specify using []. For example:

VLLM_PIPELINE_PARTITION_STRATEGY=[2,3,3,2]

I like the original idea of being able to register partitions using register_partition_strategy as it allows us in the future to be able to extend this ourselves with more partition policies, or maybe even provide plugin functionality. Just would be good to have the manual option as well.

@youkaichao
Copy link
Member

I don't support register_partition_strategy, because it does not work for ray (a major usecase for multi-node pp). If users use this functionality to register their own partition, only the driver worker knows the partition, while the rest workers don't. Ray workers are forked from raylet, and execute vllm code only.

This custom registration happens before, when we support out-of-the-tree model registration. See #5657 for more details.

That's why I don't want this functionality to be too complicated. Users should do the calculation, and only give vLLM the partition result by export VLLM_PP_LAYER_PARTITION=7,9 . Using environment variable to pass the parition result works for all cases, while using environment variable to pass a custom function is much more complicated.

@comaniac
Copy link
Collaborator Author

comaniac commented Jul 30, 2024

This PR doesn't pass function via environment variable but just the name. Does it also not work? To me it works the same way as passing a list.

@comaniac
Copy link
Collaborator Author

Ok I took a deep look at #5667 and got your point about not using registry. On the hand this is just a way of implementation. We can still provide built in methods with a static dict.

So the design choice we should discuss is "whether we could provide some built in methods so that users could configure it by name, in addition to configure it by a partition list".

@youkaichao
Copy link
Member

So the design choice we should discuss is "whether we could provide some built in methods so that users could configure it by name, in addition to configure it by a partition list".

Yes you are right. In my opinion configuring it by a partition list is straightforward and feature-complete. But I don't have time to figure out the benefit of the provided strategies in this PR and when to use them. We can also chat off-github for more details.

@njhill
Copy link
Member

njhill commented Jul 30, 2024

It would be greate if we could at least make it easy for users to pick up an optimal default for key model + topology combinations. For example tp=8 x pp=2 for 16bit llama-3.1-405b.

@comaniac
Copy link
Collaborator Author

Fallback to remove built-in partition strategies per offline discussion. PTAL @youkaichao

partition_list_str)) from err
if len(partitions) != pp_size:
raise ValueError(
"Number of partitions does not match the number of workers.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
"Number of partitions does not match the number of workers.")
"Number of partitions does not match the pp size.")

if pp_rank == pp_size - 1:
end_layer = num_hidden_layers
if pp_rank == pp_size - 1:
end_layer = num_hidden_layers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a logging.info here, to tell user how many layers we have, how many layers this pp rank has? This helps user verify that their env var works. It is especially very helpful, when setting env vars in multi-node consistently is difficult.

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments! Left some nit comments.

@youkaichao
Copy link
Member

@comaniac I directly fixed the nit comments. Thanks for the contribution!

@youkaichao youkaichao merged commit bd70013 into vllm-project:main Jul 31, 2024
14 of 16 checks passed
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024
@comaniac comaniac deleted the partition branch September 16, 2024 23:12
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
…ect#6920)

Co-authored-by: youkaichao <youkaichao@126.com>
Signed-off-by: Alvant <alvasian@yandex.ru>
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants