-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][Core] Support tensor parallel division with remainder of attention heads #5367
Conversation
Looks amazing |
Some builds are failing?... |
@NadavShmayo please run |
I've run the formatting script before pushing and fixed all the problems, is there something I'm missing regarding this? EDIT: There seems to be a problem with either the formatting script or the CI, as the formatting scripts formats the imports in the Also fixed some of the failing tests for now, will fix the rest of them hopefully later today. |
@simon-mo @mgoin I'll also start working on a separate pull request to improve the performance of this implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a TP=3 case to a current distributed test so we can have this actively tested?
vllm/model_executor/layers/linear.py
Outdated
tp_size = get_tensor_model_parallel_world_size() | ||
assert self.quant_method is not None | ||
self.output_size_per_partition = divide(self.output_size, tp_size) | ||
self.output_size_per_partition = output_size // tp_size + ( | ||
output_size % tp_size > tp_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you separate this into its own variable as the remainder for clarity and/or please add a comment describing what is the intended behavior? The condition makes it a bit unclear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, this should use the util function I added for this logic, which should make it more readable.
@@ -49,7 +51,7 @@ | |||
from vllm.sequence import IntermediateTensors, SamplerOutput | |||
|
|||
|
|||
@torch.compile | |||
#@torch.compile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please restore
FYI I think the current complexity in this PR of needing to edit each model definition makes this difficult to accept and commit to. We will need careful consideration to commit to supporting this uneven TP scenario, so I believe getting this into the next release might be optimistic. |
@mgoin I've fixed the code according to your comments, and added some test cases for this PR, thanks! Regarding your last comment, I understand it does seem like a major change, but wouldn't it be possible to do this gradually? Anyways for most models the changes required to support this should be minor. |
Thanks @NadavShmayo, I think there may be quite a bit of interest in this being supported. I agree with @mgoin though that we'd want to minimize the invasiveness of the changes as much as possible, and is unlikely to make it into the imminent release. |
@njhill Would it be better if I add a list of models that support this feature, and a validation during setup to make sure the model supports this feature if it's used? This way we can avoid changing all the model architectures at once. Anyways I'll make sure to be attentive to comments on this PR, so it can be merged as soon as possible while being careful not to impair existing vLLM functionality. |
Frontier models such as Qwen2 cannot be run, until this is merged in. Many agent frontier models are using prime numbers for the attention head count. The attention head count or vocabulary not divisible by 1,2,4,8. And other frontier models are only divisible by 1 and 20 GPUs. Frontier sub-7B sized models are also affected by this. I think the change should be merged in soon or it will delay testing/evaluation for frontier agent model research. Since not all models can currently be loaded by vLLM, it makes uniform evaluation and performance benchmarking of frontier agent models, impossible. |
Which Qwen2 models are not able to be run? |
given that this change is too intrusive, I note in the doc , that we should use pipeline parallel in this case. |
@youkaichao I do understand the concerns of how intrusive this change is. Pipeline parallel performance
Tensor parallel performance
Alternative suggestionI suggest the following alternative, which avoids making intrusive changes in the code while still allowing to implement this feature, at the cost of some code duplication: We could add a flag to explicitly enable this feature (uneven tensor parallel distribution), and when this flag is enabled we initialize a special model class (for example instead of This way we don't change any implementation of existing models or layers, while still allowing this pull-request for significant performance improvements in this scenario. What do you think? |
It is expected that TP is faster than PP if you can hold the model in a single node. It is more about maintenance burden. We have limited bandwidth, and cannot maintain this feature. We will only support common usecases in the main branch. I have seen thousands of issues with detailed environment reports, and I can confidently say that the 3 GPU usecase is a very rare case. My opinion is, the vLLM codebase will only support 3 GPU case by PP. Meanwhile, you can maintain another repo to keep this feature. In fact, your approach of I think that's how open-source works, you cannot expect to add all the features to the main branch. Let's build the ecosystem and the community together. |
I definitely have a need for this feature. I'm pretty sure many others will also be having a need for this to be available in vLLM. I don't see any need to use more GPUs than is necessary to load a given model. For example, if I can load a model in exactly 5 GPUs, why would I need to allocate 8 GPUs to load that model. Here's my situation and requirements:
I use pipeline_parallel_size = 1 and set tensor_parallel_size to be the exact number of GPUs that a model would need to load based on what is mentioned in this vLLM documentation for distributed inference - https://docs.vllm.ai/en/latest/serving/distributed_serving.html So, anything that can be done to move away from the current constraints of 2,4,8,16 will be highly beneficial for a lot of Enterprises. This is a common feedback that I hear from people using vLLM. Everything else is absolutely great and awesome about vLLM. No doubt whatsoever. |
@NadavShmayo Do you still plan to implement this feature via the plugin framework added in #7426? |
Solved #1041 #5003
This pull request adds support for tensor parallel division with a remainder (only for some models, other models could be changed easily to also support this)- meaning you can divide any number of attention heads in a tensor parallel manner to any number of GPUs.
I for example have a use case for which I want to deploy a 70B model which doesn't really fit on 2XA100-80Gi GPUs, so I want to deploy it using 3XA100-80Gi GPUs, but currently it isn't supported by vLLM.
Weight sharding
I implemented this by loading the remainder of the weights to the first GPUs, for example when sharding a weight matrix of size 4096 across 3 GPUs, we shard it as 1366, 1365, 1365.
Grouped Query Attention sharding
For attention weights with GQA (Grouped-Query-Attention) this is a bit more complex, as we want each group of attention heads to be loaded on the same GPU and include both the query matrices and key-value matrices. In this case we shard in the same manner as before, but ensure the shard size is a multiple of the attention heads per group:
32 Attention heads with 8 Key-Value heads -> (12 Attention heads with 3 Key-Value Heads, 12 Attention heads with 3 Key-Value Heads, 8 Attention heads with 2 Key-Value Heads).
Perhaps this logic could be optimized to avoid some GPUs loading much less heads than others.
Further steps