-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Support SigLIP encoder and alternative decoders for LLaVA models #7153
Conversation
I have tried this method 3 hour ago. Bue the error as follos: |
Can you show the input to your model? Also note that we don't support multiple image input per prompt for now, so you can only test with single image input. |
OK I have found an issue with the image placeholder calculation, it is related to |
Fixed, can you try again? |
I will try it maybe reply tonight or tomorrow.
发自我的iPhone
…------------------ Original ------------------
From: Cyrus Leung ***@***.***>
Date: Mon, Aug 5, 2024 6:44 PM
To: vllm-project/vllm ***@***.***>
Cc: Brench ***@***.***>, Comment ***@***.***>
Subject: Re: [vllm-project/vllm] [Model] Support alternative LLM backbone andSigLIP encoder for LLaVA model (PR #7153)
Fixed, can you try again?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
d20b0a7
to
0a2eb73
Compare
can you try this? because my conda env has a problem: |
…test code for Mantis model
You should delete any compiled binaries from your local vLLM repo, and reinstall vLLM. Alternatively, delete the repo and clone a fresh one. |
I have tested your example and the model can run in vLLM, but the output is gibberish (probably something wrong with the input processor). Since this PR isn't aimed at fully implementing Mantis (but rather to enable it), we can create another PR after this one to resolve those issues. LLM output:
|
I know the mistake. I need to build on CUDA. Your branch is base on AMD. LOL |
Thanks for your help!!! |
Yep,because there are some modifications of the 'LlavaForConditionalGeneration' in Mantis about the tokens. So Mantis depends on its own class to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM but I left a few comments - PTAL!
may be should use some function to remove the unexpected tag.
reference: https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/vlm/mantis.py
|
…els (vllm-project#7153) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
…els (vllm-project#7153) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
…els (vllm-project#7153) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
…els (vllm-project#7153) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Signed-off-by: Alvant <alvasian@yandex.ru>
Based on #7067, this PR partially addresses #7143 by enabling SigLIP to be used in LLaVA models. The same code also enables them to load other LLM decoders.
This PR also fixes two hidden bugs in the calculation for the number of placeholder tokens:
vision_feature_select_strategy="default"
.These two miscalculations cancelled out each other prior to this PR. They were discovered when trying to support SigLIP in #7149 because SigLIP encoder, unlike CLIP encoder, does not output the CLS token, resulting in an off-by-one error when filling in placeholders.
FIX #7149