Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare with Qwen2-VL #2

Open
SushantGautam opened this issue Nov 19, 2024 · 6 comments
Open

Compare with Qwen2-VL #2

SushantGautam opened this issue Nov 19, 2024 · 6 comments

Comments

@SushantGautam
Copy link

No description provided.

@XuGW-Kevin
Copy link
Collaborator

XuGW-Kevin commented Nov 19, 2024

Great issue! Sorry for the overlook of Qwen2-VL.
Qwen2-VL is strong; we have no reason not to compare.
We're going to release a model trained on Qwen2-VL in 2-3 weeks, and update our paper by that time.

However, if you really want to know the performance of Qwen2-VL-7B on our reasoning benchmark now, I can give you the answer: 65.85. Our LLaVA-o1(Llama-3.2-11B-Vision) is 65.8.
Our LLaVA-o1(Llama-3.2-11B-Vision) is worse than Qwen2-VL-7B. This is correct.
However, we must point out that Qwen2-VL-7B used huge amounts of training data (at least millions, I believe) and we only used 100k. I think the most important thing is our model has significant improvements over Llama-3.2-Vision-Instruct because one can always switch to a stronger base model (like Qwen2-VL) and use more data (scale up the dataset).

To prove this, we will publish a model trained on Qwen2-VL soon.
Stay tuned!

@Elenore1997
Copy link

Great issue! Sorry for the overlook of Qwen2-VL. Qwen2-VL is strong; we have no reason not to compare. We're going to release a model trained on Qwen2-VL in 2-3 weeks, and update our paper by that time.

However, if you really want to know the performance of Qwen2-VL-7B on our reasoning benchmark now, I can give you the answer: 65.85. Our LLaVA-o1(Llama-3.2-11B-Vision) is 65.8. Our LLaVA-o1(Llama-3.2-11B-Vision) is worse than Qwen2-VL-7B. This is correct. However, we must point out that Qwen2-VL-7B used huge amounts of training data (at least millions, I believe) and we only used 100k. I think the most important thing is our model has significant improvements over Llama-3.2-Vision-Instruct because one can always switch to a stronger base model (like Qwen2-VL) and use more data (scale up the dataset).

To prove this, we will publish a model trained on Qwen2-VL soon. Stay tuned!

Great job! I would like to ask that how the LLaVA-O1 model and the O1 model based on Qwen2VL perform on Chinese language-image reasoning / understanding tasks? Thanks in advanced!

@zhangfaen
Copy link

+1 for comparition with Qwen-VL 2

@XuGW-Kevin
Copy link
Collaborator

+1 for comparition with Qwen-VL 2

Thanks for your interest! @zhangfaen
We're making progress on this. We'll release the comparison within a week.

@ankitdotpy
Copy link

Great issue! Sorry for the overlook of Qwen2-VL. Qwen2-VL is strong; we have no reason not to compare. We're going to release a model trained on Qwen2-VL in 2-3 weeks, and update our paper by that time.

Hi @XuGW-Kevin
I was wondering if there have been any updates on the model release that you mentioned. I would love to contribute to the project. Please let me know if there are opportunities to contribute or engage further.

@XuGW-Kevin
Copy link
Collaborator

Great issue! Sorry for the overlook of Qwen2-VL. Qwen2-VL is strong; we have no reason not to compare. We're going to release a model trained on Qwen2-VL in 2-3 weeks, and update our paper by that time.

Hi @XuGW-Kevin

I was wondering if there have been any updates on the model release that you mentioned. I would love to contribute to the project. Please let me know if there are opportunities to contribute or engage further.

Yeah we've trained such a model on Qwen2-VL-7B-Instruct, and tested a few benchmarks. We find that while the performance on some benchmarks improves, the performance on other benchmarks becomes worse. The overall performance doesn't improve too much. I suspect that Qwen actually used the training questions in AI2D, ScienceQA, etc. Therefore, if we further finetune Qwen, the model will overfit to these questions. I'm not sure whether my suspicions are reasonable, and I think perhaps finetuning on the base Qwen model may help with this issue. I'm also happy to hear about any ideas you have!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants