-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Optimization] Avoid repeated model architecture conversion for pooling models #25261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Optimization] Avoid repeated model architecture conversion for pooling models #25261
Conversation
…ng models Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a cache for get_model_architecture to avoid repeated expensive conversions, which is a good optimization. My main feedback is regarding thread safety. The global cache is modified without a lock, which can lead to race conditions and redundant computations in a multi-threaded environment. I've suggested adding a lock using a double-checked locking pattern to make the caching mechanism thread-safe and robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I noticed too that it was being called repeatedly. LGTM, just a quick question though, would this work with functools cache?
|
No, since |
|
But if ModelConfig already has a |
That is true, but |
|
Hmm... looks like this PR uncovered some hidden problems about EAGLE config validity |
|
Any idea? @wwl2755 @WoosukKwon |
|
I think it is because So it makes a problem when An easy fix is to delete the assertion and make a dummy |
|
|
|
Ok I just realized that |
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Great catch! Solve the problem once for all. |
…ng models (vllm-project#25261) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
…ng models (vllm-project#25261) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: charlifu <charlifu@amd.com>
…ng models (#25261) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ng models (vllm-project#25261) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…ng models (vllm-project#25261) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
…ng models (vllm-project#25261) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
…ng models (vllm-project#25261) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
When running multi-modal pooling models, I found that almost 3% of the time is taken by the
get_model_architecturecall insidecreate_processor. Upon inspection,get_model_architectureconverts the model class into a pooling model each time it is called which is quite expensive (since it occurs for every single request), so I have decided to cache its output.cc @maxdebayser @noooop
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.