Enhancement: We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward. #4387

cmp-nct · 2023-12-09T01:16:22Z

Discussed in #4350

^{Originally posted by cmp-nct December 7, 2023}
I've just seen CovVLM which is a Vicuna 7B language model behind a 9B vision tower (laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) on a opensource license.
I've compared it with llava-1.5 (not even compareable) and Qwen-VL and it beats Qwen-VL by a margin in OCR abilities, detection of details and no or almost no hallucinations.
It understands handwritten as well as typed letters, context, fine details, background graphics
It can also locate tiny visual targets with pixel coordinates
I'm quite blown away that I didn't know it before..

I believe that this is what we need, it has similarities to llava but adds an additional expert model, so that's not super quick to implement.
In addition the ViT needs K-type quantization support.
Definitely worth a close look

URL: https://github.com/THUDM/CogVLM
Webdemo: http://36.103.203.44:7861/
Paper: https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf

Look at this example, I asked for a JSON representation - not cherry picked, it can actually extract all of the content with minimal errors:

The image appears to be a driver's license from California. Here's a JSON representation:
{
  "License_Number": "DL 11234568",
  "Expiration_Date": "08/31/2014",
  "Class": "C",
  "License_Holder": "Iva Cardholder",
  "Address": "2570 24th Street, Anytown, CA 95818",
  "Veteran": "YES",
  "Sex": "F",
  "Hair_Color": "Brown",
  "Eyes": "Brown",
  "Height": "125 lb",
  "Weight": "125 lb",
  "Issue_Date": "08/31/2009",
  "Birth_Date": "08/31/1977",
  "VISUALS": {
      "left": "Iva Cardholder",
      "right": "Iva Cardholder",
      "top": "California Bear",
      "bottom": "Gold Rush Miner"
    }
}

Here is what QWEN-VL does:

{
  "id": "123456",
  "issue_date": "08/31/2014",
  "expiration_date": "08/31/2024",
  "cardholder": {
    "name": "Ina Cordero",
    "gender": "F",
    "race": "A",
    "eye_color": "B",
    "hair_color": "R",
    "eyebrow_color": "N",
    "height": "5'6"",
    "weight": "126 lb"
  },
  "class": "C",
  "type": "DRIVER LICENSE",
  "state": "CA",
  "country": "USA"
}

Here is llava1.5-13B:

{
"image": "https://i.imgur.com/39vZv.jpg",
"description": "A California driver's license with a woman's picture on it. The license is blue and white and has a picture of a bear on it. The license number is 11324567890."
}

I've not yet looked into architectural challenges. But this is literally game changer..
That's seriously good OCR and its image detection abilities are beyond anything I've remotely seen from llava 1.5/ShareGPT4V

@monatis @FSSRepo

The text was updated successfully, but these errors were encountered:

mirek190 · 2023-12-09T19:26:05Z

I know about that model from some time and also mentioned here but NO ONE CARE ... why I do not know ....

ericruleman · 2023-12-15T07:03:03Z

+1 CogVLM is the best open source vision model currently available. Having a super powerful multi-modal LLM that's easy to run locally is a game changer.

I know that Ollama is looking to add CogVLM support, but they need llama.cpp to support it first.

truebit · 2023-12-24T15:54:58Z

+1 CogVLM/CogAgent is amazing on mobile UI dection and UI object detection.

mirek190 · 2023-12-24T16:11:46Z

Is better than GPT4:V

darkacorn · 2024-01-02T18:45:08Z

+1 we need it ! asap

Foul-Tarnished · 2024-01-07T02:14:16Z

MobileVLM might be even better

cmp-nct · 2024-01-07T03:22:54Z

MobileVLM might be even better

There is no demo of it online, it uses the same vision encoder and similar projection as llava just a tiny llm instead of the 7B vicuna.
It is trained on llava-1.5 data (so lacks a lot of sharegptv4 knowledge) but claiming that tiny thing is at eye level with GPT4V needs some more evidence than 5 words. The benchmarks do not support it.

I didn't test it on llama.cpp but my guess is that it requires minimal changes to get the language model supported - the projection has small changes as well (normalization)
Regarding support: the authors actually already had it working in llama.cpp according to the paper (mentioned using Q4 on the llm) but didn't release the changes as a fork or PR for some reason ?

I'm not saying it is not what you claim - just from what I've seen at a first view I find it highly unlikely. Would be a huge development in showcasing what the small CLIP can do despite everyone else not being able to do the same.

I believe MobileVLM is worthy of support, it's tiny and appears to be a little bit worse than llava-1.5 but of course much faster.
Should not distract from CogVLM being the best open source one

darkacorn · 2024-01-07T06:14:28Z

cogvlm is far better then llava - llava already works on most - so please lets stick with cogvlm if anyone would embark on that - as it takes about 80g vram here in fp16 .. and bnb isnt cutting it

darkacorn · 2024-01-10T09:40:51Z

pong - how does that not get any traction ?!
the main example given is worthless as thats simple ocr - but cogvlm is so much better then llava on every vertical

dtiarks · 2024-01-10T09:44:24Z

I started looking into it. But have many on my schedule currently.

husnoo · 2024-01-10T10:36:13Z

@darkacorn, I'd like to test it, what's your branch called?

…

On Wed, 10 Jan 2024, 09:41 darkacorn, ***@***.***> wrote: pong - how does that not get any traction ?! — Reply to this email directly, view it on GitHub <#4387 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAWJWSS6SVTNMMUKMI23ULDYNZO25AVCNFSM6AAAAABANM23OOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBUGUYDCNBVG4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

darkacorn · 2024-01-10T10:55:27Z

https://github.com/THUDM/CogVLM no branch not affiliated with them either

darkacorn · 2024-01-10T10:56:59Z

I started looking into it. But have many on my schedule currently.

understandable - i talked to turboderp(exllama) and casper (autoawq) too .. apparently its quite a bit of work to get a quant / inference going outside of the regular transformer arch

dtiarks · 2024-01-10T11:04:25Z

Jip that's also my feeling. To make the deep feature fusion work you have to provide an additonal mask as input. That's quite different from the usual stuff.

chigkim · 2024-01-24T04:52:22Z

I'd love to see Cogvlm support as well.

glarsson · 2024-01-24T09:04:42Z

You are all (including me) welcome to contribute either code or money for a coffee for these hard-working individuals making this stuff work for us.

mirek190 · 2024-01-24T10:16:43Z

I'm also waiting for it ....

darkacorn · 2024-01-24T11:08:53Z

we all wait - we need more eyeballs on that "feature request" .. sadly most people dont seem to care enough about vision yet

information from turbo(exllama) to get a rough version done is about 50h of work initially and then ofc the upkeep of it - but given the litte demand is has - it seems to be a wasted effort

we really just need 1 quant .. and then we can adapt it pretty quickly to everything else

glarsson · 2024-01-24T11:19:10Z

What do you mean it seems to be a wasted effort?

darkacorn · 2024-01-24T12:42:00Z

"given the litte demand is has - it seems to be a wasted effort" i dont know how i could be clearer in that statement - if more people would be interrested in vision that would turn faster .. but apparently most just focus on regular llm's / multimodality does sadly not have a huge demand

glarsson · 2024-01-24T12:58:30Z

Alright I wasn't trying to diminsh your text but thanks for explaining it, I did not realize that.

darkacorn · 2024-01-24T13:17:28Z

trust me i would love it to be quanted too - makes my life easyer .. 36g per fp16 model and you eventually want all 3 in vram just blocks my resources up - i would love to have it smaller and faster - but if there are not a few experts chip in and start .. - its just not the most rewarding work for them as very few people want it even if cogvlm is the best vision model we got

THUDM/CogVLM#346

lets see maybe they chip in and get the ball rolling

mirek190 · 2024-01-24T15:33:48Z

I also don't understand why is so low interest in Cogvlm because is far more better than llava which is still in development....

dtiarks · 2024-01-24T15:38:34Z

After working on it for a bit I found that it is not trivial to convert it to llama.cpp. The implementation of EVA-CLIP is different from the OpenAI CLIP model. There are some subtleties I'm trying to wrap my head around. So progress is relatively slow but intereset is there...

darkacorn · 2024-01-24T17:23:36Z

@dtiarks if you are up for it hop on discord we are all on the TheBloke AI's discord ( link should be on any huggingface repo he has/ dont want to spam there here)

thanks for narrowing the problem set down at least a bit

im sure turboderp / casper can help narrow those "sublties" down even further

longregen · 2024-01-31T03:58:14Z

This would be a game changer, since CogVLM is so much better than llava. Using llava after seeing what CogVLM can do feels like asking llama 7B for code after using gpt 4.

cmp-nct · 2024-01-31T04:17:33Z

I personally have changed my mind, CogVLM is a huge thing - no one really wanted to invest the work integrating it.
Here we have xcomposer2, which is almost as fast as llava-1.5, higher resolution than CogVLM and quite possibly better as well.
#5232

A good part of the work is done, though my time is getting lower for a while and the lora integration is not done yet

longregen · 2024-02-03T22:04:30Z

If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.

husnoo · 2024-02-03T22:25:17Z

#5266

…

On Sat, 3 Feb 2024, 22:04 lon, ***@***.***> wrote: If someone is up to the task, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile. — Reply to this email directly, view it on GitHub <#4387 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAWJWSVXT27MA3FTX45DVTTYR2X7VAVCNFSM6AAAAABANM23OOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGQ3DMOJWG4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

chigkim · 2024-02-03T22:59:35Z

If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.

@cmp-nct is already working on Llava 1.6 in #5267

longregen · 2024-02-04T23:31:44Z

If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.

@cmp-nct is already working on Llava 1.6 in #5267

Yes, but it seems like the author can't work on it and/or has other priorities

cmp-nct · 2024-02-04T23:53:58Z

If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.

@cmp-nct is already working on Llava 1.6 in #5267

Yes, but it seems like the author can't work on it and/or has other priorities

For me it's a small side project, I have dozens of large (commercial) projects I am working on.
I really want to see full llava-1.6 (as well as cogvlm and xcomposer2) support, llava-1.6 is the best candidate followed by xcomposer2 then covvlm.

A bounty is intriguing, ironically I once tried the same on fiverr once to advance this project and not one of the "AI developers" there is actually able to contribute anything.
Though the requirement to have it merged is maybe too much, as you only have limited influence in actually getting something merged even if a PR is fully functional.

So I've not given it up, I just have slow progress atm. Also happy to add a collaborator into my PR-branch of course.

90% done

cmp-nct · 2024-02-08T07:18:51Z

One pesky bug is remaining but it's working quite great already, especially the large model.
#5267

You'll need to re-create the projector gguf files, you can keep the llm gguf files.
For the projector you need to add the variables into the config.json (as described), otherwise it will be detected as llava-1.5.
I've uploaded a config.json to my HF, uploading the projectors as well.

You'll notice llava-1.6 is working if you need a ton of embedding tokens.
llava-1.5 has 576 tokens, llava-1.6 up to 5 times that.

Update
The PR is ready for use, the tensor bug is handled, it's not merged into main llama.cpp yet. So you need to manually check out the branch/PR

mjspeck · 2024-03-01T18:02:20Z

Now that LLaVA 1.6 has been added, is there no longer much interest in adding CogVLM?

dtiarks · 2024-03-01T18:08:58Z

@mjspeck I started implementing it and got pretty far. However I got stuck at a point where I need some input from experts like @ggerganov

There is a branch at https://github.com/dtiarks/llama.cpp/tree/cog-vlm
The code is under examples/cog-vlm. The problem is that the language model's ("deepfeaturefusion") graph seems to be broken when selecting the correct expert. This is somewhat similar to the MoE implementation. Maybe @ggerganov or someone else can help.

cmp-nct · 2024-03-01T18:49:26Z

I had put my attention on the dynamic-lora-expert internlm implemented (xcomposer2), which shows very similar results and spatial awareness as cogvlm but probably a magnitude faster in performance.
However, I got stuck in tensor differences, potentially reaching into the current CLIP implementation or just an error in how I am managing the attention lora (the attention calculations and permutations in pytorch are significantly differing from llama.cpp). Debugging those differences are super time intense.
So I got stuck there and currently look into other areas.

cogVLM is still interesting imho, I'm just doubting the longterm potential given much smaller networks show similar powers. Maybe I'm mistaken though, I've limited view on the difference in their output.

mjspeck · 2024-03-04T21:08:23Z

The performance of the Cog agent is what's most interesting. Not sure if LLaVA 1.6 has been tested on similar problems or if xcomposer2 has either.

mjspeck · 2024-05-01T05:12:38Z

Just wanna say we still would have a lot of interest in using CogVLM on llama.cpp

mirek190 · 2024-05-01T06:48:47Z

+1

daaain · 2024-05-20T22:27:34Z

There's v2 now: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B

mirek190 · 2024-05-20T23:30:47Z

wow

xiaoyuediandao · 2024-05-21T03:33:18Z

NotImplementedError: Architecture 'CogVLMForCausalLM' not supported!

marksalpeter · 2024-05-23T10:28:05Z

+1

cmp-nct · 2024-05-23T10:59:35Z

I think that llava-1.6 is the better one, it is heavyweight compared to 1.5 but lighter than cog and with batching optimization it could be almost as fast as llava 1.5. Batching would not be difficult to add into clip.cpp! It's basically ready for it, just needs some tunes.

One big step missing for out llava 1.6 implementation is the line based tensor manipulation.
llama.cpp llava 1.6 implementation uses the more simple variant of llava 1.6 because of lack of 5d tensors I was not able to get that properly implemented so I had to take a shortcut.
That shortcut is noticeable when it comes to OCR for example.

Someone who is very good in ggml tensors (better than me) could add the line based manipulation into llava 1.6.
Then we could add the batching into CLIP to run all llava-1.6 image batches at once instead of sequential and we'd have a very high quality result. Surpassing cogvlm imho.
At much less work than implementing the whole cog architecture.

chaoqunxie · 2024-06-03T08:06:15Z

I think that llava-1.6 is the better one, it is heavyweight compared to 1.5 but lighter than cog and with batching optimization it could be almost as fast as llava 1.5. Batching would not be difficult to add into clip.cpp! It's basically ready for it, just needs some tunes.

One big step missing for out llava 1.6 implementation is the line based tensor manipulation. llama.cpp llava 1.6 implementation uses the more simple variant of llava 1.6 because of lack of 5d tensors I was not able to get that properly implemented so I had to take a shortcut. That shortcut is noticeable when it comes to OCR for example.

Someone who is very good in ggml tensors (better than me) could add the line based manipulation into llava 1.6. Then we could add the batching into CLIP to run all llava-1.6 image batches at once instead of sequential and we'd have a very high quality result. Surpassing cogvlm imho. At much less work than implementing the whole cog architecture.

infact ollama supportted see https://ollama.com/library/llava:13b-v1.6-vicuna-q5_K_M

husnoo · 2024-06-03T08:53:37Z

doesn't ollama use llama.cpp?

…

On Mon, 3 Jun 2024, 09:06 chaoqunxie, ***@***.***> wrote: I think that llava-1.6 is the better one, it is heavyweight compared to 1.5 but lighter than cog and with batching optimization it could be almost as fast as llava 1.5. Batching would not be difficult to add into clip.cpp! It's basically ready for it, just needs some tunes. One big step missing for out llava 1.6 implementation is the line based tensor manipulation. llama.cpp llava 1.6 implementation uses the more simple variant of llava 1.6 because of lack of 5d tensors I was not able to get that properly implemented so I had to take a shortcut. That shortcut is noticeable when it comes to OCR for example. Someone who is very good in ggml tensors (better than me) could add the line based manipulation into llava 1.6. Then we could add the batching into CLIP to run all llava-1.6 image batches at once instead of sequential and we'd have a very high quality result. Surpassing cogvlm imho. At much less work than implementing the whole cog architecture. infact ollama supportted see https://ollama.com/library/llava:13b-v1.6-vicuna-q5_K_M — Reply to this email directly, view it on GitHub <#4387 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAWJWSUMSCZMCEL4WJBSLXLZFQPZBAVCNFSM6AAAAABANM23OOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBUGUZTONRXHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

geroldmeisinger · 2024-06-04T19:07:49Z

int4 version: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B-int4

github-actions · 2024-07-19T01:07:02Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

chigkim · 2024-07-19T05:31:01Z

Still not supported.

jiaolongxue · 2024-08-11T14:55:44Z

Still not supported.

cmp-nct changed the title ~~We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward.~~ Enhancement: We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward. Dec 9, 2023

easp mentioned this issue Jan 11, 2024

Support for CogVLM wanted. CogVLM is an alternative for LLaVA ollama/ollama#1930

Open

github-actions bot added stale and removed stale labels Apr 5, 2024

chigkim mentioned this issue Jun 5, 2024

Support for InternVL #6803

Closed

geroldmeisinger mentioned this issue Jun 28, 2024

Server jhc13/taggui#228

Draft

github-actions bot added the stale label Jul 5, 2024

github-actions bot closed this as completed Jul 19, 2024

Enhancement: We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward. #4387

Enhancement: We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward. #4387

Comments

cmp-nct commented Dec 9, 2023

Discussed in #4350

mirek190 commented Dec 9, 2023

ericruleman commented Dec 15, 2023

truebit commented Dec 24, 2023

mirek190 commented Dec 24, 2023

darkacorn commented Jan 2, 2024

Foul-Tarnished commented Jan 7, 2024

cmp-nct commented Jan 7, 2024 • edited Loading

darkacorn commented Jan 7, 2024

darkacorn commented Jan 10, 2024 • edited Loading

dtiarks commented Jan 10, 2024

husnoo commented Jan 10, 2024 via email

darkacorn commented Jan 10, 2024 • edited Loading

darkacorn commented Jan 10, 2024

dtiarks commented Jan 10, 2024

chigkim commented Jan 24, 2024

glarsson commented Jan 24, 2024

mirek190 commented Jan 24, 2024

darkacorn commented Jan 24, 2024 • edited Loading

glarsson commented Jan 24, 2024

darkacorn commented Jan 24, 2024

glarsson commented Jan 24, 2024

darkacorn commented Jan 24, 2024 • edited Loading

mirek190 commented Jan 24, 2024

dtiarks commented Jan 24, 2024

darkacorn commented Jan 24, 2024

longregen commented Jan 31, 2024

cmp-nct commented Jan 31, 2024

longregen commented Feb 3, 2024 • edited Loading

husnoo commented Feb 3, 2024 via email

chigkim commented Feb 3, 2024 • edited Loading

longregen commented Feb 4, 2024 • edited Loading

cmp-nct commented Feb 4, 2024 • edited Loading

cmp-nct commented Feb 8, 2024 • edited Loading

mjspeck commented Mar 1, 2024

dtiarks commented Mar 1, 2024

cmp-nct commented Mar 1, 2024

mjspeck commented Mar 4, 2024 • edited Loading

mjspeck commented May 1, 2024

mirek190 commented May 1, 2024

daaain commented May 20, 2024

mirek190 commented May 20, 2024

xiaoyuediandao commented May 21, 2024

marksalpeter commented May 23, 2024

cmp-nct commented May 23, 2024

chaoqunxie commented Jun 3, 2024

husnoo commented Jun 3, 2024 via email

geroldmeisinger commented Jun 4, 2024

github-actions bot commented Jul 19, 2024

chigkim commented Jul 19, 2024

jiaolongxue commented Aug 11, 2024

cmp-nct commented Jan 7, 2024 •

edited

Loading

darkacorn commented Jan 10, 2024 •

edited

Loading

darkacorn commented Jan 10, 2024 •

edited

Loading

darkacorn commented Jan 24, 2024 •

edited

Loading

darkacorn commented Jan 24, 2024 •

edited

Loading

longregen commented Feb 3, 2024 •

edited

Loading

chigkim commented Feb 3, 2024 •

edited

Loading

longregen commented Feb 4, 2024 •

edited

Loading

cmp-nct commented Feb 4, 2024 •

edited

Loading

cmp-nct commented Feb 8, 2024 •

edited

Loading

mjspeck commented Mar 4, 2024 •

edited

Loading