Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward. #4387

Closed
cmp-nct opened this issue Dec 9, 2023 Discussed in #4350 · 50 comments
Labels

Comments

@cmp-nct
Copy link
Contributor

cmp-nct commented Dec 9, 2023

Discussed in #4350

Originally posted by cmp-nct December 7, 2023
I've just seen CovVLM which is a Vicuna 7B language model behind a 9B vision tower (laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) on a opensource license.
I've compared it with llava-1.5 (not even compareable) and Qwen-VL and it beats Qwen-VL by a margin in OCR abilities, detection of details and no or almost no hallucinations.
It understands handwritten as well as typed letters, context, fine details, background graphics
It can also locate tiny visual targets with pixel coordinates
I'm quite blown away that I didn't know it before..

I believe that this is what we need, it has similarities to llava but adds an additional expert model, so that's not super quick to implement.
In addition the ViT needs K-type quantization support.
Definitely worth a close look

URL: https://github.com/THUDM/CogVLM
Webdemo: http://36.103.203.44:7861/
Paper: https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf

Look at this example, I asked for a JSON representation - not cherry picked, it can actually extract all of the content with minimal errors:
image

The image appears to be a driver's license from California. Here's a JSON representation:
{
  "License_Number": "DL 11234568",
  "Expiration_Date": "08/31/2014",
  "Class": "C",
  "License_Holder": "Iva Cardholder",
  "Address": "2570 24th Street, Anytown, CA 95818",
  "Veteran": "YES",
  "Sex": "F",
  "Hair_Color": "Brown",
  "Eyes": "Brown",
  "Height": "125 lb",
  "Weight": "125 lb",
  "Issue_Date": "08/31/2009",
  "Birth_Date": "08/31/1977",
  "VISUALS": {
      "left": "Iva Cardholder",
      "right": "Iva Cardholder",
      "top": "California Bear",
      "bottom": "Gold Rush Miner"
    }
}

Here is what QWEN-VL does:

{
  "id": "123456",
  "issue_date": "08/31/2014",
  "expiration_date": "08/31/2024",
  "cardholder": {
    "name": "Ina Cordero",
    "gender": "F",
    "race": "A",
    "eye_color": "B",
    "hair_color": "R",
    "eyebrow_color": "N",
    "height": "5'6"",
    "weight": "126 lb"
  },
  "class": "C",
  "type": "DRIVER LICENSE",
  "state": "CA",
  "country": "USA"
}

Here is llava1.5-13B:

{
"image": "https://i.imgur.com/39vZv.jpg",
"description": "A California driver's license with a woman's picture on it. The license is blue and white and has a picture of a bear on it. The license number is 11324567890."
}

I've not yet looked into architectural challenges. But this is literally game changer..
That's seriously good OCR and its image detection abilities are beyond anything I've remotely seen from llava 1.5/ShareGPT4V

@monatis @FSSRepo

@cmp-nct cmp-nct changed the title We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward. Enhancement: We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward. Dec 9, 2023
@mirek190
Copy link

mirek190 commented Dec 9, 2023

I know about that model from some time and also mentioned here but NO ONE CARE ... why I do not know ....

@ericruleman
Copy link

+1 CogVLM is the best open source vision model currently available. Having a super powerful multi-modal LLM that's easy to run locally is a game changer.

I know that Ollama is looking to add CogVLM support, but they need llama.cpp to support it first.

@truebit
Copy link

truebit commented Dec 24, 2023

+1 CogVLM/CogAgent is amazing on mobile UI dection and UI object detection.

@mirek190
Copy link

Is better than GPT4:V

@darkacorn
Copy link

+1 we need it ! asap

@Foul-Tarnished
Copy link
Contributor

MobileVLM might be even better

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 7, 2024

MobileVLM might be even better

There is no demo of it online, it uses the same vision encoder and similar projection as llava just a tiny llm instead of the 7B vicuna.
It is trained on llava-1.5 data (so lacks a lot of sharegptv4 knowledge) but claiming that tiny thing is at eye level with GPT4V needs some more evidence than 5 words. The benchmarks do not support it.

I didn't test it on llama.cpp but my guess is that it requires minimal changes to get the language model supported - the projection has small changes as well (normalization)
Regarding support: the authors actually already had it working in llama.cpp according to the paper (mentioned using Q4 on the llm) but didn't release the changes as a fork or PR for some reason ?

I'm not saying it is not what you claim - just from what I've seen at a first view I find it highly unlikely. Would be a huge development in showcasing what the small CLIP can do despite everyone else not being able to do the same.

I believe MobileVLM is worthy of support, it's tiny and appears to be a little bit worse than llava-1.5 but of course much faster.
Should not distract from CogVLM being the best open source one

@darkacorn
Copy link

cogvlm is far better then llava - llava already works on most - so please lets stick with cogvlm if anyone would embark on that - as it takes about 80g vram here in fp16 .. and bnb isnt cutting it

@darkacorn
Copy link

darkacorn commented Jan 10, 2024

pong - how does that not get any traction ?!
the main example given is worthless as thats simple ocr - but cogvlm is so much better then llava on every vertical

@dtiarks
Copy link

dtiarks commented Jan 10, 2024

I started looking into it. But have many on my schedule currently.

@husnoo
Copy link

husnoo commented Jan 10, 2024 via email

@darkacorn
Copy link

darkacorn commented Jan 10, 2024

https://github.com/THUDM/CogVLM no branch not affiliated with them either

@darkacorn
Copy link

I started looking into it. But have many on my schedule currently.

understandable - i talked to turboderp(exllama) and casper (autoawq) too .. apparently its quite a bit of work to get a quant / inference going outside of the regular transformer arch

@dtiarks
Copy link

dtiarks commented Jan 10, 2024

Jip that's also my feeling. To make the deep feature fusion work you have to provide an additonal mask as input. That's quite different from the usual stuff.

@chigkim
Copy link

chigkim commented Jan 24, 2024

I'd love to see Cogvlm support as well.

@glarsson
Copy link

You are all (including me) welcome to contribute either code or money for a coffee for these hard-working individuals making this stuff work for us.

@mirek190
Copy link

I'm also waiting for it ....

@darkacorn
Copy link

darkacorn commented Jan 24, 2024

we all wait - we need more eyeballs on that "feature request" .. sadly most people dont seem to care enough about vision yet

information from turbo(exllama) to get a rough version done is about 50h of work initially and then ofc the upkeep of it - but given the litte demand is has - it seems to be a wasted effort

we really just need 1 quant .. and then we can adapt it pretty quickly to everything else

@glarsson
Copy link

What do you mean it seems to be a wasted effort?

@darkacorn
Copy link

"given the litte demand is has - it seems to be a wasted effort" i dont know how i could be clearer in that statement - if more people would be interrested in vision that would turn faster .. but apparently most just focus on regular llm's / multimodality does sadly not have a huge demand

@glarsson
Copy link

Alright I wasn't trying to diminsh your text but thanks for explaining it, I did not realize that.

@darkacorn
Copy link

darkacorn commented Jan 24, 2024

trust me i would love it to be quanted too - makes my life easyer .. 36g per fp16 model and you eventually want all 3 in vram just blocks my resources up - i would love to have it smaller and faster - but if there are not a few experts chip in and start .. - its just not the most rewarding work for them as very few people want it even if cogvlm is the best vision model we got

THUDM/CogVLM#346

lets see maybe they chip in and get the ball rolling

@mirek190
Copy link

I also don't understand why is so low interest in Cogvlm because is far more better than llava which is still in development....

@dtiarks
Copy link

dtiarks commented Jan 24, 2024

After working on it for a bit I found that it is not trivial to convert it to llama.cpp. The implementation of EVA-CLIP is different from the OpenAI CLIP model. There are some subtleties I'm trying to wrap my head around. So progress is relatively slow but intereset is there...

@darkacorn
Copy link

@dtiarks if you are up for it hop on discord we are all on the TheBloke AI's discord ( link should be on any huggingface repo he has/ dont want to spam there here)

thanks for narrowing the problem set down at least a bit

im sure turboderp / casper can help narrow those "sublties" down even further

@longregen
Copy link
Contributor

This would be a game changer, since CogVLM is so much better than llava. Using llava after seeing what CogVLM can do feels like asking llama 7B for code after using gpt 4.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 31, 2024

I personally have changed my mind, CogVLM is a huge thing - no one really wanted to invest the work integrating it.
Here we have xcomposer2, which is almost as fast as llava-1.5, higher resolution than CogVLM and quite possibly better as well.
#5232

A good part of the work is done, though my time is getting lower for a while and the lora integration is not done yet

@longregen
Copy link
Contributor

longregen commented Feb 3, 2024

If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.

@husnoo
Copy link

husnoo commented Feb 3, 2024 via email

@chigkim
Copy link

chigkim commented Feb 3, 2024

If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.

@cmp-nct is already working on Llava 1.6 in #5267

@longregen
Copy link
Contributor

longregen commented Feb 4, 2024

If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.

@cmp-nct is already working on Llava 1.6 in #5267

Yes, but it seems like the author can't work on it and/or has other priorities

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Feb 4, 2024

If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.

@cmp-nct is already working on Llava 1.6 in #5267

Yes, but it seems like the author can't work on it and/or has other priorities

For me it's a small side project, I have dozens of large (commercial) projects I am working on.
I really want to see full llava-1.6 (as well as cogvlm and xcomposer2) support, llava-1.6 is the best candidate followed by xcomposer2 then covvlm.

A bounty is intriguing, ironically I once tried the same on fiverr once to advance this project and not one of the "AI developers" there is actually able to contribute anything.
Though the requirement to have it merged is maybe too much, as you only have limited influence in actually getting something merged even if a PR is fully functional.

So I've not given it up, I just have slow progress atm. Also happy to add a collaborator into my PR-branch of course.

  • 90% done

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Feb 8, 2024

One pesky bug is remaining but it's working quite great already, especially the large model.
#5267

You'll need to re-create the projector gguf files, you can keep the llm gguf files.
For the projector you need to add the variables into the config.json (as described), otherwise it will be detected as llava-1.5.
I've uploaded a config.json to my HF, uploading the projectors as well.

You'll notice llava-1.6 is working if you need a ton of embedding tokens.
llava-1.5 has 576 tokens, llava-1.6 up to 5 times that.

Update
The PR is ready for use, the tensor bug is handled, it's not merged into main llama.cpp yet. So you need to manually check out the branch/PR

@mjspeck
Copy link

mjspeck commented Mar 1, 2024

Now that LLaVA 1.6 has been added, is there no longer much interest in adding CogVLM?

@dtiarks
Copy link

dtiarks commented Mar 1, 2024

@mjspeck I started implementing it and got pretty far. However I got stuck at a point where I need some input from experts like @ggerganov

There is a branch at https://github.com/dtiarks/llama.cpp/tree/cog-vlm
The code is under examples/cog-vlm. The problem is that the language model's ("deepfeaturefusion") graph seems to be broken when selecting the correct expert. This is somewhat similar to the MoE implementation. Maybe @ggerganov or someone else can help.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Mar 1, 2024

I had put my attention on the dynamic-lora-expert internlm implemented (xcomposer2), which shows very similar results and spatial awareness as cogvlm but probably a magnitude faster in performance.
However, I got stuck in tensor differences, potentially reaching into the current CLIP implementation or just an error in how I am managing the attention lora (the attention calculations and permutations in pytorch are significantly differing from llama.cpp). Debugging those differences are super time intense.
So I got stuck there and currently look into other areas.

cogVLM is still interesting imho, I'm just doubting the longterm potential given much smaller networks show similar powers. Maybe I'm mistaken though, I've limited view on the difference in their output.

@mjspeck
Copy link

mjspeck commented Mar 4, 2024

The performance of the Cog agent is what's most interesting. Not sure if LLaVA 1.6 has been tested on similar problems or if xcomposer2 has either.

@github-actions github-actions bot added stale and removed stale labels Apr 5, 2024
@mjspeck
Copy link

mjspeck commented May 1, 2024

Just wanna say we still would have a lot of interest in using CogVLM on llama.cpp

@mirek190
Copy link

mirek190 commented May 1, 2024

+1

@daaain
Copy link

daaain commented May 20, 2024

There's v2 now: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B

@mirek190
Copy link

wow

@xiaoyuediandao
Copy link

NotImplementedError: Architecture 'CogVLMForCausalLM' not supported!

@marksalpeter
Copy link

+1

@cmp-nct
Copy link
Contributor Author

cmp-nct commented May 23, 2024

I think that llava-1.6 is the better one, it is heavyweight compared to 1.5 but lighter than cog and with batching optimization it could be almost as fast as llava 1.5. Batching would not be difficult to add into clip.cpp! It's basically ready for it, just needs some tunes.

One big step missing for out llava 1.6 implementation is the line based tensor manipulation.
llama.cpp llava 1.6 implementation uses the more simple variant of llava 1.6 because of lack of 5d tensors I was not able to get that properly implemented so I had to take a shortcut.
That shortcut is noticeable when it comes to OCR for example.

Someone who is very good in ggml tensors (better than me) could add the line based manipulation into llava 1.6.
Then we could add the batching into CLIP to run all llava-1.6 image batches at once instead of sequential and we'd have a very high quality result. Surpassing cogvlm imho.
At much less work than implementing the whole cog architecture.

@chaoqunxie
Copy link

I think that llava-1.6 is the better one, it is heavyweight compared to 1.5 but lighter than cog and with batching optimization it could be almost as fast as llava 1.5. Batching would not be difficult to add into clip.cpp! It's basically ready for it, just needs some tunes.

One big step missing for out llava 1.6 implementation is the line based tensor manipulation. llama.cpp llava 1.6 implementation uses the more simple variant of llava 1.6 because of lack of 5d tensors I was not able to get that properly implemented so I had to take a shortcut. That shortcut is noticeable when it comes to OCR for example.

Someone who is very good in ggml tensors (better than me) could add the line based manipulation into llava 1.6. Then we could add the batching into CLIP to run all llava-1.6 image batches at once instead of sequential and we'd have a very high quality result. Surpassing cogvlm imho. At much less work than implementing the whole cog architecture.

infact ollama supportted see https://ollama.com/library/llava:13b-v1.6-vicuna-q5_K_M

@husnoo
Copy link

husnoo commented Jun 3, 2024 via email

@geroldmeisinger
Copy link

int4 version: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B-int4

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@chigkim
Copy link

chigkim commented Jul 19, 2024

Still not supported.

1 similar comment
@jiaolongxue
Copy link

Still not supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests