how to use the inference_demo.py #12

aoji0606 · 2024-11-28T09:28:27Z

Thanks for your work

but, how to use the inference_demo.py ?

XuGW-Kevin · 2024-11-28T09:57:54Z

Thanks for your work

but, how to use the inference_demo.py ?

This demo can be used with VLMEvalKit. I'll upload a Gradio App demo later.

ramkumarkoppu · 2024-12-02T10:25:51Z

HF demo app is not available, I get this

XuGW-Kevin · 2024-12-02T12:05:20Z

HF demo app is not available, I get this

It's very weird, because I just tried and the Gradio app works.
Can you run other Gradio apps well?

ramkumarkoppu · 2024-12-02T15:15:23Z

this time HF demo available, but when I run the example image

with the same prompt at https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot
Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6, I get different answer:

(Here begins the SUMMARY stage) To solve the problem, I will identify and count the objects in the image, excluding the tiny shiny balls and purple objects, and then determine how many objects remain. (Here ends the SUMMARY stage)
(Here begins the CAPTION stage) The image shows a collection of variously shaped objects on a flat surface. These include spheres, cubes, and cylinders in different colors, including blue, green, yellow, red, and purple. (Here ends the CAPTION stage)
(Here begins the REASONING stage) First, I will identify all the objects in the image: there are spheres, cubes, and cylinders. Next, I will exclude the tiny shiny balls, which are the spheres, and the purple objects. The purple objects are a cylinder and a sphere. After removing these, I will count the remaining objects. The remaining objects are a blue cylinder, a green sphere, a yellow cylinder, a red cube, and a yellow sphere. This totals to five objects. (Here ends the REASONING stage)
(Here begins the CONCLUSION stage) 5 (Here ends the CONCLUSION stage)

XuGW-Kevin · 2024-12-02T15:34:36Z

this time HF demo available, but when I run the example image

with the same prompt at https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6, I get different answer:

(Here begins the SUMMARY stage) To solve the problem, I will identify and count the objects in the image, excluding the tiny shiny balls and purple objects, and then determine how many objects remain. (Here ends the SUMMARY stage) (Here begins the CAPTION stage) The image shows a collection of variously shaped objects on a flat surface. These include spheres, cubes, and cylinders in different colors, including blue, green, yellow, red, and purple. (Here ends the CAPTION stage) (Here begins the REASONING stage) First, I will identify all the objects in the image: there are spheres, cubes, and cylinders. Next, I will exclude the tiny shiny balls, which are the spheres, and the purple objects. The purple objects are a cylinder and a sphere. After removing these, I will count the remaining objects. The remaining objects are a blue cylinder, a green sphere, a yellow cylinder, a red cube, and a yellow sphere. This totals to five objects. (Here ends the REASONING stage) (Here begins the CONCLUSION stage) 5 (Here ends the CONCLUSION stage)

Exactly. This is expected because the model in this Gradio App does not use inference time scaling.
The examples are generated using BS=2 with inference time scaling, which is 4~5 times slower than the current model.

XuGW-Kevin · 2024-12-02T15:35:35Z

But I'll definitely upload the model with inference time scaling later, hopefully in 3~4 days. Thanks for your interest!

ramkumarkoppu · 2024-12-02T15:46:45Z

I am surprised to see the wring answer 5 by the model. Will this model with inference time scaling available to download from the HF?

XuGW-Kevin · 2024-12-02T15:50:14Z

I am surprised to see the wring answer 5 by the model. Will this model with inference time scaling available to download from the HF?

The model with inference time scaling is identical to the current model. Actually, if you set a non-zero temperature (temperature=0.6, top_p=0.9), you may observe that the model can answer the question correctly with constant possibility. Inference time scaling just improves the possibility.

You may also try the base Llama-3.2V model. That model can hardly generate a correct answer even with multiple tries.

ramkumarkoppu · 2024-12-02T15:54:32Z

I don't see the option to set temperature and top_p on HF demo to try

XuGW-Kevin · 2024-12-02T17:54:06Z

I don't see the option to set temperature and top_p on HF demo to try

Hi, thank you for raising the issue! Now the Gradio App is set to {temperature=0.6, top_p=0.9}.

I've reviewed the results we tested earlier. The two demos in our paper are both picked from the MMStar benchmark, and we used VLMEvalKit to generate and test the results.

Today, I replicated the results without using inference-time scaling.
For the second question (science), the model has about a 30-40% chance of answering correctly.
For the first question, the model rarely gives the correct answer of 10-2=8 (I tried dozens of times before getting the correct reasoning once.). Instead, it often provides incorrect responses such as 8-2=6 or 10-3=7. However, it is worth mentioning that the model at least gets one part (either 10 or 2) correct in these cases.

I understand that the probability of correctly answering the two demos is not very high. This is mainly because I directly selected the most difficult questions from those that were correctly answered in a single generation by the model from the MMStar benchmark. As a result, the model won't get it right every time. However, it's worth mentioning that our model's performance on these two questions is as good as GPT-4o, and in fact, even GPT-4o can hardly solve these two questions.

If you attempt to compare the performance of Llama-3.2-11B-Vision-Instruct, you will find that it has almost no chance of providing the correct logic for these questions.
For the first question, I tested Llama-3.2-11B-Vision-Instruct on poe.com 10 times, and it never got any intermediate steps (10 or 2) correct.
For the second question, I tested Llama-3.2-11B-Vision-Instruct on poe.com and had to try 15 times before it answered correctly once.
This highlights that the progress made by LLaVA-CoT compared to Llama-3.2-11B-Vision-Instruct is still significant.

I apologize for any confusion this may have caused and hope this addresses your concerns. Feel free to reach out with any follow-up questions!

ramkumarkoppu · 2024-12-02T18:01:41Z

Tried again the same question with updated demo app, and still gives me wrong answer: C

XuGW-Kevin · 2024-12-02T18:04:44Z

Tried again the same question with updated demo app, and still gives me wrong answer: C

Could you try for a few more times? I think most of the time the model will give answer D because either 10-3 or 8-2 will lead to answer D.

ramkumarkoppu · 2024-12-02T21:41:06Z

this time it choose answer the objects number not choosing the multiple choices:
This totals to six objects. (Here ends the REASONING stage)

(Here begins the CONCLUSION stage) 6 (Here ends the CONCLUSION stage)

XuGW-Kevin · 2024-12-03T02:41:15Z

Yes, this is also possible and our base model will also have this issue.
I just found out that in MMStar benchmark every problem starts with Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end. Adding this prompt will alleviate this issue (but not eliminate it).

I'll provide you with some more examples generated by LLaVA-CoT:

Question 1:
Correct answers (8):

Partly Correct Answers that can get 10:

Partly Correct Answers that can get 2:

Question 2:

XuGW-Kevin · 2024-12-03T04:02:40Z

this time HF demo available, but when I run the example image

with the same prompt at https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6, I get different answer:

(Here begins the SUMMARY stage) To solve the problem, I will identify and count the objects in the image, excluding the tiny shiny balls and purple objects, and then determine how many objects remain. (Here ends the SUMMARY stage) (Here begins the CAPTION stage) The image shows a collection of variously shaped objects on a flat surface. These include spheres, cubes, and cylinders in different colors, including blue, green, yellow, red, and purple. (Here ends the CAPTION stage) (Here begins the REASONING stage) First, I will identify all the objects in the image: there are spheres, cubes, and cylinders. Next, I will exclude the tiny shiny balls, which are the spheres, and the purple objects. The purple objects are a cylinder and a sphere. After removing these, I will count the remaining objects. The remaining objects are a blue cylinder, a green sphere, a yellow cylinder, a red cube, and a yellow sphere. This totals to five objects. (Here ends the REASONING stage) (Here begins the CONCLUSION stage) 5 (Here ends the CONCLUSION stage)

I also find another model developed by the community (they used our dataset to train it):
https://huggingface.co/BarraHome/Mistroll-3.0-CoT-Llama-3.2-11B-Vision-Instruct

They also display a similar demo there:

ramkumarkoppu · 2024-12-05T20:34:58Z

Hi Guowei,

How to reproduce the original results like exact answer of B.8 ? any tuning parameters?

XuGW-Kevin · 2024-12-06T05:42:15Z

Hi Guowei,

How to reproduce the original results like exact answer of B.8 ? any tuning parameters?

Hi Ramkumar,

It’s not possible to reproduce the exact original answers, as multiple candidate answers are generated randomly at different stages, making exact replication unfeasible. However, statistical results are fully reproducible. This specific example comes from CLEVR-MATH, a task in the MMStar benchmark, and you can reproduce the statistical results on MMStar using the VLMEvalKit.

MMStar includes many similar types of questions. You can explore the ones LLaVA-CoT successfully answers. Due to randomness, the specific questions LLaVA-CoT answers correctly may vary between runs, but the statistical results remain nearly identical.

You can find the guide for reproducing results without inference scaling at https://huggingface.co/Xkev/Llama-3.2V-11B-cot. For models with inference scaling, you only need to replace the original Llama-3.2V with the script we provide at https://github.com/PKU-YuanGroup/LLaVA-CoT/blob/main/inference_demo/inference_demo.py.

Let me know if you have further questions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to use the inference_demo.py #12

how to use the inference_demo.py #12

aoji0606 commented Nov 28, 2024

XuGW-Kevin commented Nov 28, 2024

ramkumarkoppu commented Dec 2, 2024

XuGW-Kevin commented Dec 2, 2024

ramkumarkoppu commented Dec 2, 2024 •

edited

Loading

XuGW-Kevin commented Dec 2, 2024

XuGW-Kevin commented Dec 2, 2024

ramkumarkoppu commented Dec 2, 2024

XuGW-Kevin commented Dec 2, 2024 •

edited

Loading

ramkumarkoppu commented Dec 2, 2024

XuGW-Kevin commented Dec 2, 2024 •

edited

Loading

ramkumarkoppu commented Dec 2, 2024

XuGW-Kevin commented Dec 2, 2024

ramkumarkoppu commented Dec 2, 2024

XuGW-Kevin commented Dec 3, 2024

XuGW-Kevin commented Dec 3, 2024

ramkumarkoppu commented Dec 5, 2024

XuGW-Kevin commented Dec 6, 2024 •

edited

Loading

how to use the inference_demo.py #12

how to use the inference_demo.py #12

Comments

aoji0606 commented Nov 28, 2024

XuGW-Kevin commented Nov 28, 2024

ramkumarkoppu commented Dec 2, 2024

XuGW-Kevin commented Dec 2, 2024

ramkumarkoppu commented Dec 2, 2024 • edited Loading

XuGW-Kevin commented Dec 2, 2024

XuGW-Kevin commented Dec 2, 2024

ramkumarkoppu commented Dec 2, 2024

XuGW-Kevin commented Dec 2, 2024 • edited Loading

ramkumarkoppu commented Dec 2, 2024

XuGW-Kevin commented Dec 2, 2024 • edited Loading

ramkumarkoppu commented Dec 2, 2024

XuGW-Kevin commented Dec 2, 2024

ramkumarkoppu commented Dec 2, 2024

XuGW-Kevin commented Dec 3, 2024

XuGW-Kevin commented Dec 3, 2024

ramkumarkoppu commented Dec 5, 2024

XuGW-Kevin commented Dec 6, 2024 • edited Loading

ramkumarkoppu commented Dec 2, 2024 •

edited

Loading

XuGW-Kevin commented Dec 2, 2024 •

edited

Loading

XuGW-Kevin commented Dec 2, 2024 •

edited

Loading

XuGW-Kevin commented Dec 6, 2024 •

edited

Loading