Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Model architecture plugins #7124

Open
NadavShmayo opened this issue Aug 4, 2024 · 14 comments
Open

[RFC]: Model architecture plugins #7124

NadavShmayo opened this issue Aug 4, 2024 · 14 comments
Labels

Comments

@NadavShmayo
Copy link
Contributor

Motivation.

As a continuation to #5367 - as this merge request was rejected and I have to maintain my own fork to support this scenario, I suggest we should add support in vLLM for model architecture plugins.
This will allow vLLM to easily add new model architectures without changing vLLM's core logic, and support scenarios such as uneven GPU tensor parallelism.

We could build an ecosystem of model architecture plugins - which could accelerate new model support by a lot without risking existing functionality.

Proposed Change.

Supporting this in it's basic form is simple as we just have to add loaded plugins to the ModelRegistry.
To support more complex model architectures (Such in the #5367 case), we should decouple the Config class which provides the amount of attention heads from vLLM's core logic, and allow each model architecture to override these values.

Feedback Period.

No response

CC List.

@youkaichao

Any Other Things.

Just to make it clear, I'll be happy to implement this, but I want hear some feedback before I go ahead and implement this.

@DarkLight1337
Copy link
Member

Potentially related: #7067 introduces an easy way to compose vLLM models, with the relevant code being abstracted by #7153.

@sekh77
Copy link

sekh77 commented Aug 31, 2024

@NadavShmayo - Is this plugin now available for use with latest vLLM?

@sekh77
Copy link

sekh77 commented Aug 31, 2024

I definitely have a need for this feature. I'm pretty sure many others will also be having a need for this to be available in vLLM.

I don't see any need to use more GPUs than is necessary to load a given model. For example, if I can load a model in exactly 5 GPUs, why would I need to allocate 8 GPUs to load that model.

Here's my situation and requirements:

  1. I have 3 nodes in Azure with 12 A100 80GB GPUs (4 GPUs per node) connected through an Infiniband.
  2. In my conversational AI chat application, users can dynamically switch between models in the chat screen right at runtime - so they have a choice to use one model over the other depending on how a model performs for their complex queries.
  3. I want to pre-load my GPUs with LLaMA3.1 70B, Mixtral8x22B, and Databricks DBRX so that my users can choose from any of these three models during chat.
  4. The application automatically calculates the model parameters based on information from model's config.json. And then it uses a formula to derive the exact number of GPUs a model will require to load and infer.
  5. Based on this formula, LLaMA3.1 70B requires 3 GPUs, Mixtral8x22B requires 5 GPUs, and Databricks DBRX requires 4 GPUs.
  6. Ideally all three models should fit in 12 GPUs. However, with the current vLLM architecture / the way it calculates, LLaMA3.1 70B will need 4 GPUs (64 attention head is not divisible by 3 but divisible by 4), Mixtral8x22B will need 8 GPUs, and DBRX will need 4 GPUs (no change for DBRX because 4 matches vLLMs expectation).
  7. Now this puts me into a situation where I can load only any two models that would take up 8 GPUs leaving the remaining 4 GPUs unused. This is not a good use compute resources. especially given the fact that these are expensive GPUs.

I use pipeline_parallel_size = 1 and set tensor_parallel_size to be the exact number of GPUs that a model would need to load based on what is mentioned in this vLLM documentation for distributed inference - https://docs.vllm.ai/en/latest/serving/distributed_serving.html

So, anything that can be done to move away from the current constraints of 2,4,8,16 will be highly beneficial for a lot of Enterprises. This is a common feedback that I hear from people using vLLM. Everything else is absolutely great and awesome about vLLM. No doubt whatsoever.

@youkaichao
Copy link
Member

@sekh77 I don't get it. You can just use pipeline_parallel_size=3 without any problem.

@sekh77
Copy link

sekh77 commented Sep 2, 2024

@youkaichao - Here's my understanding of pipeline_parallel_size. In my case, if I use pipeline_parallel_size = 3 and force tensor_parallel_size to be the exact number of GPUs in a node (which is 4), the world size in vLLM becomes 4*3 = 12.

This means when I attempt to load LLaMA3.1 70B which actually requires only 3 GPUs to load, the above configuration will load it across all 12 GPUs. Now with gpu_memory_utilization=0.9, I have no memory left in any of the 12 GPUs to load any other model because vLLM loads all weights, intermediate states, gradients, KV cache etc. in all GPUs. I'm unable to reduce gpu_memory_utliization to less than 0.7 as it gets into OOM due to the size of KV cache for these models.

I'm also helping it a little bit by specifying cpu_offload_gb=10.

If this understanding is incorrect, please do let me know. I'm absolutely OK to adjust my configurations based on appropriate guidelines that you advise. My objective is to meet my requirements as I described in my previous message.

@youkaichao
Copy link
Member

youkaichao commented Sep 3, 2024

if 3 GPUs are enough to hold the model, you can just use -pp 3 -tp 1

@sekh77
Copy link

sekh77 commented Sep 3, 2024

Ok. TP is calculated dynamically in my inference service pipeline. Assume if I found a way to dynamically override the TP calculation from 3 to 1 for LLaMA 3.1 70B, how will this calculation solve for Mixtral8x22B and Databricks DBRX that requires exactly 5 and 4 GPUs to hold the models, respectively?

@youkaichao
Copy link
Member

in your script, you just need to change -tp to -pp , and everything should work.

use the tensor parallel size as the new pipeline parallel size

@sekh77
Copy link

sekh77 commented Sep 3, 2024

Are you suggesting to keep tp = 1 always, and then keep pp to the calculated number of GPUs for a model?

@sekh77
Copy link

sekh77 commented Sep 3, 2024

So for LLaMA3.1 70B, tp = 1, pp = 3
for Mixtral8x22B, tp = 1, pp = 5
for Databricks BDBRX, tp = 1, pp = 4

Is that what you are suggesting?

@youkaichao
Copy link
Member

yes

@sekh77
Copy link

sekh77 commented Sep 3, 2024

Ok, got it. Let me try this. Will let you know. Thank you.

@sekh77
Copy link

sekh77 commented Sep 3, 2024

@youkaichao - It worked as per my expectations. Thank you very much for suggesting that route. I have an additional question though. Since the TP is now 1, I think there is no tensor parallelism anymore in my case. Am I losing anything with respect to inference throughput? Right now with PP, the latency is in milliseconds with these models on InfiniBand connectivity. But not sure what happens when I scale concurrency.

@youkaichao
Copy link
Member

Am I losing anything with respect to inference throughput?

with pipeline parallel, you lose some latency, but the throughput should be the same. be sure to submit enough requests to saturate the pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants