-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MISC] Introduce pipeline parallelism partition strategies #6920
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
@comaniac very nice!! Can we make a more balanced strategy the default? |
I don't against that but I didn't perform very comprehensive testing so not sure if this is constantly beneficial or it will result in OOM / other side effects for other models. Bring it in first and make it as default after a while might be safer. |
Thanks for the change! To confirm, was the benchmark PP=8 on one node? Also, it would be nice if there was an easy way for the user to register their own partition strategy, or manual strategy. We don't need to advertise this widely, but an easy way to manually specify splits would be great for power users. |
Yes this is PP=8 on one node. For the registration, I currently put the registry in a standalone file. All you need to do is adding a function to that file and use the env variable to enable it. I feel this should be sufficient for power users to configure/experiments. Please let me know if you have better suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most complicated part for this feature, is how to allow users to register the partition strategy. Instead of adding more automatic options, how about directly adding an env var VLLM_PP_LAYER_PARTITION
, with pp_size items, and sums up to the number of layers?
Can you provide an example of how it will look like? |
e.g. |
That's a good idea to allow custom partition. I feel we could support both, and the built in options are effective partition algorithms. WDYT? |
My opinion is that this is very difficult to optimize automatically. It depends on the layer size and lm head/word embedding size. Say there is a crazy model with large word embedding layer and small lm head layer... I prefer just adding |
I don't think built in options would cause this issue tho, because users still need to explicitly choose them. If none of built in options works then users could customize the partition. In other words, the built in options are more like syntax sugar. Otherwise it would be tedious if a user wants to apply a particular algorithm to all models and 90% of them are actually applicable. |
sorry didn't get time to read the code in details. can you briefly mention which strategy works for which scenario? and how can users choose it? |
I had a similar idea as @youkaichao, just specifying the partition using env var. Possibly we could repurpose the same env var and use [] to parse as a list? |
This is definitely doable, but let me clarify: Are you supporting the idea that we only allow users to specify partition list without providing other built in methods, or are you supporting both? This is not a huge design change and it's not facing to most users, so I don't want to spend too much time on it. If most people feel it's better to not provide more partitions other than the even partition and enforce users to specify a partition list, then I'm fine with it. |
Sorry, I should clarify - I think we should do both. We can just overload the same environment variable to do so. I.e., a less advanced user can specify:
whereas a more advanced user might want to play around and could specify using
I like the original idea of being able to register partitions using |
I don't support This custom registration happens before, when we support out-of-the-tree model registration. See #5657 for more details. That's why I don't want this functionality to be too complicated. Users should do the calculation, and only give vLLM the partition result by |
This PR doesn't pass function via environment variable but just the name. Does it also not work? To me it works the same way as passing a list. |
Ok I took a deep look at #5667 and got your point about not using registry. On the hand this is just a way of implementation. We can still provide built in methods with a static dict. So the design choice we should discuss is "whether we could provide some built in methods so that users could configure it by name, in addition to configure it by a partition list". |
Yes you are right. In my opinion configuring it by a partition list is straightforward and feature-complete. But I don't have time to figure out the benefit of the provided strategies in this PR and when to use them. We can also chat off-github for more details. |
It would be greate if we could at least make it easy for users to pick up an optimal default for key model + topology combinations. For example tp=8 x pp=2 for 16bit llama-3.1-405b. |
Fallback to remove built-in partition strategies per offline discussion. PTAL @youkaichao |
vllm/distributed/utils.py
Outdated
partition_list_str)) from err | ||
if len(partitions) != pp_size: | ||
raise ValueError( | ||
"Number of partitions does not match the number of workers.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
"Number of partitions does not match the number of workers.") | |
"Number of partitions does not match the pp size.") |
if pp_rank == pp_size - 1: | ||
end_layer = num_hidden_layers | ||
if pp_rank == pp_size - 1: | ||
end_layer = num_hidden_layers | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a logging.info
here, to tell user how many layers we have, how many layers this pp rank has? This helps user verify that their env var works. It is especially very helpful, when setting env vars in multi-node consistently is difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments! Left some nit comments.
@comaniac I directly fixed the nit comments. Thanks for the contribution! |
…ect#6920) Co-authored-by: youkaichao <youkaichao@126.com>
…ect#6920) Co-authored-by: youkaichao <youkaichao@126.com> Signed-off-by: Alvant <alvasian@yandex.ru>
…ect#6920) Co-authored-by: youkaichao <youkaichao@126.com>
This is a simple PR that introduces PP partition strategies to allow more interesting partition algorithm. The motivation is I found that the current PP kv-cache capacity is limited by the smallest one of all PP stages, and this is usually the first or the last stage because of embedding and LM head. Thus, simply let the first and last stage take less layers could improve the overall throughput. Here is a simple benchmark result of Llama-3-70B-FP8 with 0.02% of CNN daily mail summarization task on 8xL4 GPUs.
In this example we got 1.46x throughput improvement with this partition strategy.
For simplify, this PR now uses vLLM envs to control the partition strategy. This can be improved in the future.
For example:
VLLM_PIPELINE_PARTITION_STRATEGY=less_layer_in_first_last_partition
cc @andoorve @Yard1 @SolitaryThinker