-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to use the inference_demo.py #12
Comments
This demo can be used with VLMEvalKit. I'll upload a Gradio App demo later. |
this time HF demo available, but when I run the example image with the same prompt at https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot (Here begins the SUMMARY stage) To solve the problem, I will identify and count the objects in the image, excluding the tiny shiny balls and purple objects, and then determine how many objects remain. (Here ends the SUMMARY stage) |
Exactly. This is expected because the model in this Gradio App does not use inference time scaling. |
But I'll definitely upload the model with inference time scaling later, hopefully in 3~4 days. Thanks for your interest! |
I am surprised to see the wring answer 5 by the model. Will this model with inference time scaling available to download from the HF? |
The model with inference time scaling is identical to the current model. Actually, if you set a non-zero temperature (temperature=0.6, top_p=0.9), you may observe that the model can answer the question correctly with constant possibility. Inference time scaling just improves the possibility. You may also try the base Llama-3.2V model. That model can hardly generate a correct answer even with multiple tries. |
I don't see the option to set temperature and top_p on HF demo to try |
Hi, thank you for raising the issue! Now the Gradio App is set to {temperature=0.6, top_p=0.9}. I've reviewed the results we tested earlier. The two demos in our paper are both picked from the MMStar benchmark, and we used VLMEvalKit to generate and test the results. Today, I replicated the results without using inference-time scaling. I understand that the probability of correctly answering the two demos is not very high. This is mainly because I directly selected the most difficult questions from those that were correctly answered in a single generation by the model from the MMStar benchmark. As a result, the model won't get it right every time. However, it's worth mentioning that our model's performance on these two questions is as good as GPT-4o, and in fact, even GPT-4o can hardly solve these two questions. If you attempt to compare the performance of Llama-3.2-11B-Vision-Instruct, you will find that it has almost no chance of providing the correct logic for these questions. I apologize for any confusion this may have caused and hope this addresses your concerns. Feel free to reach out with any follow-up questions! |
Tried again the same question with updated demo app, and still gives me wrong answer: C |
Could you try for a few more times? I think most of the time the model will give answer D because either 10-3 or 8-2 will lead to answer D. |
this time it choose answer the objects number not choosing the multiple choices: (Here begins the CONCLUSION stage) 6 (Here ends the CONCLUSION stage) |
I also find another model developed by the community (they used our dataset to train it): |
Hi Guowei, How to reproduce the original results like exact answer of B.8 ? any tuning parameters? |
Hi Ramkumar, It’s not possible to reproduce the exact original answers, as multiple candidate answers are generated randomly at different stages, making exact replication unfeasible. However, statistical results are fully reproducible. This specific example comes from CLEVR-MATH, a task in the MMStar benchmark, and you can reproduce the statistical results on MMStar using the VLMEvalKit. MMStar includes many similar types of questions. You can explore the ones LLaVA-CoT successfully answers. Due to randomness, the specific questions LLaVA-CoT answers correctly may vary between runs, but the statistical results remain nearly identical. You can find the guide for reproducing results without inference scaling at https://huggingface.co/Xkev/Llama-3.2V-11B-cot. For models with inference scaling, you only need to replace the original Llama-3.2V with the script we provide at https://github.com/PKU-YuanGroup/LLaVA-CoT/blob/main/inference_demo/inference_demo.py. Let me know if you have further questions! |
Thanks for your work
but, how to use the inference_demo.py ?
The text was updated successfully, but these errors were encountered: