-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Model architecture plugins #7124
Comments
@NadavShmayo - Is this plugin now available for use with latest vLLM? |
I definitely have a need for this feature. I'm pretty sure many others will also be having a need for this to be available in vLLM. I don't see any need to use more GPUs than is necessary to load a given model. For example, if I can load a model in exactly 5 GPUs, why would I need to allocate 8 GPUs to load that model. Here's my situation and requirements:
I use pipeline_parallel_size = 1 and set tensor_parallel_size to be the exact number of GPUs that a model would need to load based on what is mentioned in this vLLM documentation for distributed inference - https://docs.vllm.ai/en/latest/serving/distributed_serving.html So, anything that can be done to move away from the current constraints of 2,4,8,16 will be highly beneficial for a lot of Enterprises. This is a common feedback that I hear from people using vLLM. Everything else is absolutely great and awesome about vLLM. No doubt whatsoever. |
@sekh77 I don't get it. You can just use |
@youkaichao - Here's my understanding of pipeline_parallel_size. In my case, if I use pipeline_parallel_size = 3 and force tensor_parallel_size to be the exact number of GPUs in a node (which is 4), the world size in vLLM becomes 4*3 = 12. This means when I attempt to load LLaMA3.1 70B which actually requires only 3 GPUs to load, the above configuration will load it across all 12 GPUs. Now with gpu_memory_utilization=0.9, I have no memory left in any of the 12 GPUs to load any other model because vLLM loads all weights, intermediate states, gradients, KV cache etc. in all GPUs. I'm unable to reduce gpu_memory_utliization to less than 0.7 as it gets into OOM due to the size of KV cache for these models. I'm also helping it a little bit by specifying cpu_offload_gb=10. If this understanding is incorrect, please do let me know. I'm absolutely OK to adjust my configurations based on appropriate guidelines that you advise. My objective is to meet my requirements as I described in my previous message. |
if 3 GPUs are enough to hold the model, you can just use |
Ok. TP is calculated dynamically in my inference service pipeline. Assume if I found a way to dynamically override the TP calculation from 3 to 1 for LLaMA 3.1 70B, how will this calculation solve for Mixtral8x22B and Databricks DBRX that requires exactly 5 and 4 GPUs to hold the models, respectively? |
in your script, you just need to change use the tensor parallel size as the new pipeline parallel size |
Are you suggesting to keep tp = 1 always, and then keep pp to the calculated number of GPUs for a model? |
So for LLaMA3.1 70B, tp = 1, pp = 3 Is that what you are suggesting? |
yes |
Ok, got it. Let me try this. Will let you know. Thank you. |
@youkaichao - It worked as per my expectations. Thank you very much for suggesting that route. I have an additional question though. Since the TP is now 1, I think there is no tensor parallelism anymore in my case. Am I losing anything with respect to inference throughput? Right now with PP, the latency is in milliseconds with these models on InfiniBand connectivity. But not sure what happens when I scale concurrency. |
with pipeline parallel, you lose some latency, but the throughput should be the same. be sure to submit enough requests to saturate the pipeline. |
Motivation.
As a continuation to #5367 - as this merge request was rejected and I have to maintain my own fork to support this scenario, I suggest we should add support in vLLM for model architecture plugins.
This will allow vLLM to easily add new model architectures without changing vLLM's core logic, and support scenarios such as uneven GPU tensor parallelism.
We could build an ecosystem of model architecture plugins - which could accelerate new model support by a lot without risking existing functionality.
Proposed Change.
Supporting this in it's basic form is simple as we just have to add loaded plugins to the
ModelRegistry
.To support more complex model architectures (Such in the #5367 case), we should decouple the
Config
class which provides the amount of attention heads from vLLM's core logic, and allow each model architecture to override these values.Feedback Period.
No response
CC List.
@youkaichao
Any Other Things.
Just to make it clear, I'll be happy to implement this, but I want hear some feedback before I go ahead and implement this.
The text was updated successfully, but these errors were encountered: