[RFC]: Model architecture plugins #7124

NadavShmayo · 2024-08-04T12:13:33Z

Motivation.

As a continuation to #5367 - as this merge request was rejected and I have to maintain my own fork to support this scenario, I suggest we should add support in vLLM for model architecture plugins.
This will allow vLLM to easily add new model architectures without changing vLLM's core logic, and support scenarios such as uneven GPU tensor parallelism.

We could build an ecosystem of model architecture plugins - which could accelerate new model support by a lot without risking existing functionality.

Proposed Change.

Supporting this in it's basic form is simple as we just have to add loaded plugins to the ModelRegistry.
To support more complex model architectures (Such in the #5367 case), we should decouple the Config class which provides the amount of attention heads from vLLM's core logic, and allow each model architecture to override these values.

Feedback Period.

No response

CC List.

@youkaichao

Any Other Things.

Just to make it clear, I'll be happy to implement this, but I want hear some feedback before I go ahead and implement this.

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2024-08-05T11:28:03Z

Potentially related: #7067 introduces an easy way to compose vLLM models, with the relevant code being abstracted by #7153.

sekh77 · 2024-08-31T05:09:00Z

@NadavShmayo - Is this plugin now available for use with latest vLLM?

sekh77 · 2024-08-31T18:59:05Z

I definitely have a need for this feature. I'm pretty sure many others will also be having a need for this to be available in vLLM.

I don't see any need to use more GPUs than is necessary to load a given model. For example, if I can load a model in exactly 5 GPUs, why would I need to allocate 8 GPUs to load that model.

Here's my situation and requirements:

I have 3 nodes in Azure with 12 A100 80GB GPUs (4 GPUs per node) connected through an Infiniband.
In my conversational AI chat application, users can dynamically switch between models in the chat screen right at runtime - so they have a choice to use one model over the other depending on how a model performs for their complex queries.
I want to pre-load my GPUs with LLaMA3.1 70B, Mixtral8x22B, and Databricks DBRX so that my users can choose from any of these three models during chat.
The application automatically calculates the model parameters based on information from model's config.json. And then it uses a formula to derive the exact number of GPUs a model will require to load and infer.
Based on this formula, LLaMA3.1 70B requires 3 GPUs, Mixtral8x22B requires 5 GPUs, and Databricks DBRX requires 4 GPUs.
Ideally all three models should fit in 12 GPUs. However, with the current vLLM architecture / the way it calculates, LLaMA3.1 70B will need 4 GPUs (64 attention head is not divisible by 3 but divisible by 4), Mixtral8x22B will need 8 GPUs, and DBRX will need 4 GPUs (no change for DBRX because 4 matches vLLMs expectation).
Now this puts me into a situation where I can load only any two models that would take up 8 GPUs leaving the remaining 4 GPUs unused. This is not a good use compute resources. especially given the fact that these are expensive GPUs.

I use pipeline_parallel_size = 1 and set tensor_parallel_size to be the exact number of GPUs that a model would need to load based on what is mentioned in this vLLM documentation for distributed inference - https://docs.vllm.ai/en/latest/serving/distributed_serving.html

So, anything that can be done to move away from the current constraints of 2,4,8,16 will be highly beneficial for a lot of Enterprises. This is a common feedback that I hear from people using vLLM. Everything else is absolutely great and awesome about vLLM. No doubt whatsoever.

youkaichao · 2024-09-02T21:35:25Z

@sekh77 I don't get it. You can just use pipeline_parallel_size=3 without any problem.

sekh77 · 2024-09-02T23:24:15Z

@youkaichao - Here's my understanding of pipeline_parallel_size. In my case, if I use pipeline_parallel_size = 3 and force tensor_parallel_size to be the exact number of GPUs in a node (which is 4), the world size in vLLM becomes 4*3 = 12.

This means when I attempt to load LLaMA3.1 70B which actually requires only 3 GPUs to load, the above configuration will load it across all 12 GPUs. Now with gpu_memory_utilization=0.9, I have no memory left in any of the 12 GPUs to load any other model because vLLM loads all weights, intermediate states, gradients, KV cache etc. in all GPUs. I'm unable to reduce gpu_memory_utliization to less than 0.7 as it gets into OOM due to the size of KV cache for these models.

I'm also helping it a little bit by specifying cpu_offload_gb=10.

If this understanding is incorrect, please do let me know. I'm absolutely OK to adjust my configurations based on appropriate guidelines that you advise. My objective is to meet my requirements as I described in my previous message.

youkaichao · 2024-09-03T00:32:28Z

if 3 GPUs are enough to hold the model, you can just use -pp 3 -tp 1

sekh77 · 2024-09-03T00:42:07Z

Ok. TP is calculated dynamically in my inference service pipeline. Assume if I found a way to dynamically override the TP calculation from 3 to 1 for LLaMA 3.1 70B, how will this calculation solve for Mixtral8x22B and Databricks DBRX that requires exactly 5 and 4 GPUs to hold the models, respectively?

youkaichao · 2024-09-03T00:53:39Z

in your script, you just need to change -tp to -pp , and everything should work.

use the tensor parallel size as the new pipeline parallel size

sekh77 · 2024-09-03T01:20:45Z

Are you suggesting to keep tp = 1 always, and then keep pp to the calculated number of GPUs for a model?

sekh77 · 2024-09-03T01:22:23Z

So for LLaMA3.1 70B, tp = 1, pp = 3
for Mixtral8x22B, tp = 1, pp = 5
for Databricks BDBRX, tp = 1, pp = 4

Is that what you are suggesting?

youkaichao · 2024-09-03T01:23:08Z

yes

sekh77 · 2024-09-03T01:26:07Z

Ok, got it. Let me try this. Will let you know. Thank you.

sekh77 · 2024-09-03T15:49:02Z

@youkaichao - It worked as per my expectations. Thank you very much for suggesting that route. I have an additional question though. Since the TP is now 1, I think there is no tensor parallelism anymore in my case. Am I losing anything with respect to inference throughput? Right now with PP, the latency is in milliseconds with these models on InfiniBand connectivity. But not sure what happens when I scale concurrency.

youkaichao · 2024-09-03T19:41:14Z

Am I losing anything with respect to inference throughput?

with pipeline parallel, you lose some latency, but the throughput should be the same. be sure to submit enough requests to saturate the pipeline.

NadavShmayo added the RFC label Aug 4, 2024

youkaichao mentioned this issue Aug 4, 2024

[RFC]: vLLM plugin system #7131

Open

DarkLight1337 mentioned this issue Aug 6, 2024

[Model] Support SigLIP encoder and alternative decoders for LLaVA models #7153

Merged

NadavShmayo mentioned this issue Aug 12, 2024

[Core][Model][Frontend] Model architecture plugins #7438

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Model architecture plugins #7124

[RFC]: Model architecture plugins #7124

NadavShmayo commented Aug 4, 2024

DarkLight1337 commented Aug 5, 2024

sekh77 commented Aug 31, 2024

sekh77 commented Aug 31, 2024

youkaichao commented Sep 2, 2024

sekh77 commented Sep 2, 2024 •

edited

Loading

youkaichao commented Sep 3, 2024 •

edited

Loading

sekh77 commented Sep 3, 2024

youkaichao commented Sep 3, 2024

sekh77 commented Sep 3, 2024

sekh77 commented Sep 3, 2024

youkaichao commented Sep 3, 2024

sekh77 commented Sep 3, 2024

sekh77 commented Sep 3, 2024

youkaichao commented Sep 3, 2024

[RFC]: Model architecture plugins #7124

[RFC]: Model architecture plugins #7124

Comments

NadavShmayo commented Aug 4, 2024

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

DarkLight1337 commented Aug 5, 2024

sekh77 commented Aug 31, 2024

sekh77 commented Aug 31, 2024

youkaichao commented Sep 2, 2024

sekh77 commented Sep 2, 2024 • edited Loading

youkaichao commented Sep 3, 2024 • edited Loading

sekh77 commented Sep 3, 2024

youkaichao commented Sep 3, 2024

sekh77 commented Sep 3, 2024

sekh77 commented Sep 3, 2024

youkaichao commented Sep 3, 2024

sekh77 commented Sep 3, 2024

sekh77 commented Sep 3, 2024

youkaichao commented Sep 3, 2024

sekh77 commented Sep 2, 2024 •

edited

Loading

youkaichao commented Sep 3, 2024 •

edited

Loading