-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward. #4387
Comments
I know about that model from some time and also mentioned here but NO ONE CARE ... why I do not know .... |
+1 CogVLM is the best open source vision model currently available. Having a super powerful multi-modal LLM that's easy to run locally is a game changer. I know that Ollama is looking to add CogVLM support, but they need llama.cpp to support it first. |
+1 CogVLM/CogAgent is amazing on mobile UI dection and UI object detection. |
Is better than GPT4:V |
+1 we need it ! asap |
MobileVLM might be even better |
There is no demo of it online, it uses the same vision encoder and similar projection as llava just a tiny llm instead of the 7B vicuna. I didn't test it on llama.cpp but my guess is that it requires minimal changes to get the language model supported - the projection has small changes as well (normalization) I'm not saying it is not what you claim - just from what I've seen at a first view I find it highly unlikely. Would be a huge development in showcasing what the small CLIP can do despite everyone else not being able to do the same. I believe MobileVLM is worthy of support, it's tiny and appears to be a little bit worse than llava-1.5 but of course much faster. |
cogvlm is far better then llava - llava already works on most - so please lets stick with cogvlm if anyone would embark on that - as it takes about 80g vram here in fp16 .. and bnb isnt cutting it |
pong - how does that not get any traction ?! |
I started looking into it. But have many on my schedule currently. |
@darkacorn, I'd like to test it, what's your branch called?
…On Wed, 10 Jan 2024, 09:41 darkacorn, ***@***.***> wrote:
pong - how does that not get any traction ?!
—
Reply to this email directly, view it on GitHub
<#4387 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWJWSS6SVTNMMUKMI23ULDYNZO25AVCNFSM6AAAAABANM23OOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBUGUYDCNBVG4>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
https://github.com/THUDM/CogVLM no branch not affiliated with them either |
understandable - i talked to turboderp(exllama) and casper (autoawq) too .. apparently its quite a bit of work to get a quant / inference going outside of the regular transformer arch |
Jip that's also my feeling. To make the deep feature fusion work you have to provide an additonal mask as input. That's quite different from the usual stuff. |
I'd love to see Cogvlm support as well. |
You are all (including me) welcome to contribute either code or money for a coffee for these hard-working individuals making this stuff work for us. |
I'm also waiting for it .... |
we all wait - we need more eyeballs on that "feature request" .. sadly most people dont seem to care enough about vision yet information from turbo(exllama) to get a rough version done is about 50h of work initially and then ofc the upkeep of it - but given the litte demand is has - it seems to be a wasted effort we really just need 1 quant .. and then we can adapt it pretty quickly to everything else |
What do you mean it seems to be a wasted effort? |
"given the litte demand is has - it seems to be a wasted effort" i dont know how i could be clearer in that statement - if more people would be interrested in vision that would turn faster .. but apparently most just focus on regular llm's / multimodality does sadly not have a huge demand |
Alright I wasn't trying to diminsh your text but thanks for explaining it, I did not realize that. |
trust me i would love it to be quanted too - makes my life easyer .. 36g per fp16 model and you eventually want all 3 in vram just blocks my resources up - i would love to have it smaller and faster - but if there are not a few experts chip in and start .. - its just not the most rewarding work for them as very few people want it even if cogvlm is the best vision model we got lets see maybe they chip in and get the ball rolling |
I also don't understand why is so low interest in Cogvlm because is far more better than llava which is still in development.... |
After working on it for a bit I found that it is not trivial to convert it to llama.cpp. The implementation of EVA-CLIP is different from the OpenAI CLIP model. There are some subtleties I'm trying to wrap my head around. So progress is relatively slow but intereset is there... |
@dtiarks if you are up for it hop on discord we are all on the TheBloke AI's discord ( link should be on any huggingface repo he has/ dont want to spam there here) thanks for narrowing the problem set down at least a bit im sure turboderp / casper can help narrow those "sublties" down even further |
This would be a game changer, since CogVLM is so much better than llava. Using llava after seeing what CogVLM can do feels like asking llama 7B for code after using gpt 4. |
I personally have changed my mind, CogVLM is a huge thing - no one really wanted to invest the work integrating it. A good part of the work is done, though my time is getting lower for a while and the lora integration is not done yet |
If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile. |
… On Sat, 3 Feb 2024, 22:04 lon, ***@***.***> wrote:
If someone is up to the task, I'd like to propose a bounty for getting
CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at
claude @ the domain in my profile.
—
Reply to this email directly, view it on GitHub
<#4387 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWJWSVXT27MA3FTX45DVTTYR2X7VAVCNFSM6AAAAABANM23OOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGQ3DMOJWG4>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Yes, but it seems like the author can't work on it and/or has other priorities |
For me it's a small side project, I have dozens of large (commercial) projects I am working on. A bounty is intriguing, ironically I once tried the same on fiverr once to advance this project and not one of the "AI developers" there is actually able to contribute anything. So I've not given it up, I just have slow progress atm. Also happy to add a collaborator into my PR-branch of course.
|
One pesky bug is remaining but it's working quite great already, especially the large model. You'll need to re-create the projector gguf files, you can keep the llm gguf files. You'll notice llava-1.6 is working if you need a ton of embedding tokens. Update |
Now that LLaVA 1.6 has been added, is there no longer much interest in adding CogVLM? |
@mjspeck I started implementing it and got pretty far. However I got stuck at a point where I need some input from experts like @ggerganov There is a branch at https://github.com/dtiarks/llama.cpp/tree/cog-vlm |
I had put my attention on the dynamic-lora-expert internlm implemented (xcomposer2), which shows very similar results and spatial awareness as cogvlm but probably a magnitude faster in performance. cogVLM is still interesting imho, I'm just doubting the longterm potential given much smaller networks show similar powers. Maybe I'm mistaken though, I've limited view on the difference in their output. |
The performance of the Cog agent is what's most interesting. Not sure if LLaVA 1.6 has been tested on similar problems or if xcomposer2 has either. |
Just wanna say we still would have a lot of interest in using CogVLM on llama.cpp |
+1 |
There's v2 now: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B |
wow |
NotImplementedError: Architecture 'CogVLMForCausalLM' not supported! |
+1 |
I think that llava-1.6 is the better one, it is heavyweight compared to 1.5 but lighter than cog and with batching optimization it could be almost as fast as llava 1.5. Batching would not be difficult to add into clip.cpp! It's basically ready for it, just needs some tunes. One big step missing for out llava 1.6 implementation is the line based tensor manipulation. Someone who is very good in ggml tensors (better than me) could add the line based manipulation into llava 1.6. |
infact ollama supportted see https://ollama.com/library/llava:13b-v1.6-vicuna-q5_K_M |
doesn't ollama use llama.cpp?
…On Mon, 3 Jun 2024, 09:06 chaoqunxie, ***@***.***> wrote:
I think that llava-1.6 is the better one, it is heavyweight compared to
1.5 but lighter than cog and with batching optimization it could be almost
as fast as llava 1.5. Batching would not be difficult to add into clip.cpp!
It's basically ready for it, just needs some tunes.
One big step missing for out llava 1.6 implementation is the line based
tensor manipulation. llama.cpp llava 1.6 implementation uses the more
simple variant of llava 1.6 because of lack of 5d tensors I was not able to
get that properly implemented so I had to take a shortcut. That shortcut is
noticeable when it comes to OCR for example.
Someone who is very good in ggml tensors (better than me) could add the
line based manipulation into llava 1.6. Then we could add the batching into
CLIP to run all llava-1.6 image batches at once instead of sequential and
we'd have a very high quality result. Surpassing cogvlm imho. At much less
work than implementing the whole cog architecture.
infact ollama supportted see
https://ollama.com/library/llava:13b-v1.6-vicuna-q5_K_M
—
Reply to this email directly, view it on GitHub
<#4387 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWJWSUMSCZMCEL4WJBSLXLZFQPZBAVCNFSM6AAAAABANM23OOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBUGUZTONRXHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Still not supported. |
1 similar comment
Still not supported. |
Discussed in #4350
Originally posted by cmp-nct December 7, 2023
I've just seen CovVLM which is a Vicuna 7B language model behind a 9B vision tower (laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) on a opensource license.
I've compared it with llava-1.5 (not even compareable) and Qwen-VL and it beats Qwen-VL by a margin in OCR abilities, detection of details and no or almost no hallucinations.
It understands handwritten as well as typed letters, context, fine details, background graphics
It can also locate tiny visual targets with pixel coordinates
I'm quite blown away that I didn't know it before..
I believe that this is what we need, it has similarities to llava but adds an additional expert model, so that's not super quick to implement.
In addition the ViT needs K-type quantization support.
Definitely worth a close look
URL: https://github.com/THUDM/CogVLM
Webdemo: http://36.103.203.44:7861/
Paper: https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf
Look at this example, I asked for a JSON representation - not cherry picked, it can actually extract all of the content with minimal errors:
Here is what QWEN-VL does:
Here is llava1.5-13B:
I've not yet looked into architectural challenges. But this is literally game changer..
That's seriously good OCR and its image detection abilities are beyond anything I've remotely seen from llava 1.5/ShareGPT4V
@monatis @FSSRepo
The text was updated successfully, but these errors were encountered: