Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add vinoground #326

Merged
merged 1 commit into from
Oct 16, 2024
Merged

add vinoground #326

merged 1 commit into from
Oct 16, 2024

Conversation

HanSolo9682
Copy link
Contributor

Hi, I want to add our video benchmark Vinoground to the lmms-eval database. This temporal counterfactual benchmark contains 1000 short and natural video-caption pairs. The best model, GPT-4o, can only perform at 35% on one of our metrics, while humans can achieve ~90% at ease. I have been able to reproduce our results with the code provided here on LLaVA-Video-7B-Qwen2. I believe that more models should be allowed to evaluate on Vinoground to truly test their dense temporal reasoning capabilities, and hence i find lmms-eval a great platform to do so.

@Luodian
Copy link
Contributor

Luodian commented Oct 16, 2024

Hi thanks for this PR, can you also pin a result screenshot for a random model?

Also there are some linting issues may need to use pre-commit to resolve it.

@HanSolo9682
Copy link
Contributor Author

Screenshot 2024-10-16 at 01 45 43

@HanSolo9682
Copy link
Contributor Author

I have just ran pre-commit and fixed the linting.

@Luodian Luodian merged commit a72a9c0 into EvolvingLMMs-Lab:main Oct 16, 2024
1 check passed
KairuiHu pushed a commit that referenced this pull request Oct 24, 2024
Co-authored-by: jzhang2427 <jzhang2427@wisc.edu>
ZhaoCinyu pushed a commit to ZhaoCinyu/lmms-eval that referenced this pull request Dec 9, 2024
Co-authored-by: jzhang2427 <jzhang2427@wisc.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants