Skip to content

(CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.

License

Notifications You must be signed in to change notification settings

AILab-CVC/SEED-Bench

Repository files navigation

SEED-Bench: Benchmarking Multimodal Large Language Models

SEED-Bench-H

SEED-Bench-2-Plus Arxiv

SEED-Bench-2 Arxiv

SEED-Bench-1 Arxiv

图片名称

SEED-Bench-H is a comprehensive integration of previous SEED-Bench series (SEED-Bench, SEED-Bench-2, SEED-Bench-2-Plus), with additional evaluation dimension. It consists of 28K multiple-choice questions with precise human annotations, spanning 34 dimensions, including the evaluation of both text and image generation.

SEED-Bench-2-Plus comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs, each of which covers a wide spectrum of textrich scenarios in the real world.

SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions, including the evaluation of both text and image generation.

SEED-Bench-1 consists of 19K multiple-choice questions with accurate human annotations, covering 12 evaluation dimensions including both the spatial and temporal understanding.

News

[2024.7.11] SEED-Bench-H, SEED-Bench-2-Plus, SEED-Bench-2, and SEED-Bench-1 data is released on ModelScope, thanks to ModelScope Community.

[2024.6.18] SEED-Bench-2 can be evaluated on VLMEvalKit, thanks to kennymckormick.

[2024.5.30] We released SEED-Bench-H, which is a comprehensive integration of previous SEED-Bench series (SEED-Bench, SEED-Bench-2, SEED-Bench-2-Plus), with additional evaluation dimension. The additional evaluation dimension including Image to Latex, Visual Story Comprehension, Few-shot Segmentation, Few-shot Keypoint, Few-shot Depth, and Few-shot Object Detection. Please refer SEED-Bench-H for detailed. Corresponding dataset is released on SEED-Bench-H.

[2024.5.25] SEED-Bench-2-Plus can be evaluated on VLMEvalKit, thanks to kennymckormick.

[2024.4.26] We are excited to announce the release of SEED-Bench-2-Plus, a benchmark specifically designed for text-rich visual comprehension. The accompanying dataset is released on SEED-Bench-2-Plus.

[2024.4.23] We are pleased to share the comprehensive evaluation results for Gemini-Vision-Pro and Claude-3-Opus on SEED-Bench-1 and SEED-Bench-2. You can access detailed performance on the SEED-Bench Leaderboard. Please note that for Gemini-Vision-Pro we only report task performance when the model responds with at least 50% valid data in the task.

[2024.2.27] SEED-Bench is accepted by CVPR 2024.

[2023.12.18] We have placed the comprehensive evaluation results for GPT-4v on SEED-Bench-1 and SEED-Bench-2. These can be accessed at GPT-4V for SEED-Bench-1 and GPT-4V for SEED-Bench-2. If you're interested, please feel free to take a look.

[2023.12.4] We have updated the SEED-Bench Leaderboard for SEED-Bench-2. Additionally, we have updated the evaluation results for GPT-4v on both SEED-Bench-1 and SEED-Bench-2. If you are interested, please visit the SEED-Bench Leaderboard for more details.

[2023.11.30] We have updated the SEED-Bench-v1 JSON (manually screening the multiple-choice questions for videos) and provided corresponding video frames for easier testing. Please refer to SEED-Bench for more information.

[2023.11.27] SEED-Bench-2 is released! Data and evaluation code is available now.

[2023.9.9] We are actively looking for self-motivated interns. Please feel free to reach out if you are interested.

[2023.8.16] SEED-Bench Leaderboard is released! You can upload your model's results now.

[2023.7.30] SEED-Bench is released! Data and evaluation code is available now.

Leaderboard

Welcome to SEED-Bench Leaderboard!

Leaderboard Submission

You can submit your model results in SEED-Bench Leaderboard now. You can use our evaluation code to obtain 'results.json' in 'results' folder as below.

python eval.py --model instruct_blip --anno_path SEED-Bench.json --output-dir results --task all

Then you can upload 'results.json' in SEED-Bench Leaderboard.

After submitting, please press refresh button to get the latest results.

Data Preparation

You can download the data of SEED-Bench released on HuggingFace repo SEED-Bench, SEED-Bench-2, SEED-Bench-2-Plus, and SEED-Bench-H. Also, you can download data from ModelScope. Please refer to DATASET.md for data preparation.

Installation

Please refer to INSTALL.md.

Run Evaluation

Please refer to EVALUATION.md.

License

SEED-Bench is released under Apache License Version 2.0.

Declaration

SEED-Bench-2-Plus

Data Sources: Data from the internet under CC-BY licenses.

Please contact us if you believe any data infringes upon your rights, and we will remove it.

SEED-Bench-2

Data Sources:

Please contact us if you believe any data infringes upon your rights, and we will remove it.

SEED-Bench-1

For the images of SEED-Bench-1, we use the data from Conceptual Captions Dataset (https://ai.google.com/research/ConceptualCaptions/) following its license (https://github.com/google-research-datasets/conceptual-captions/blob/master/LICENSE). Tencent does not hold the copyright for these images and the copyright belongs to the original owner of Conceptual Captions Dataset.

For the videos of SEED-Bench-1, we use tha data from Something-Something v2 (https://developer.qualcomm.com/software/ai-datasets/something-something), Epic-kitchen 100 (https://epic-kitchens.github.io/2023) and Breakfast (https://serre-lab.clps.brown.edu/resource/breakfast-actions-dataset/). We only provide the video name. Please download them in their official websites.

Citing

If you find this repository helpful, please consider citing it:

@article{li2024seed2plus,
  title={SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension},
  author={Li, Bohao and Ge, Yuying and Chen, Yi and Ge, Yixiao and Zhang, Ruimao and Shan, Ying},
  journal={arXiv preprint arXiv:2404.16790},
  year={2024}
}

@article{li2023seed2,
  title={SEED-Bench-2: Benchmarking Multimodal Large Language Models},
  author={Li, Bohao and Ge, Yuying and Ge, Yixiao and Wang, Guangzhi and Wang, Rui and Zhang, Ruimao and Shan, Ying},
  journal={arXiv preprint arXiv:2311.17092},
  year={2023}
  }

@article{li2023seed,
  title={Seed-bench: Benchmarking multimodal llms with generative comprehension},
  author={Li, Bohao and Wang, Rui and Wang, Guangzhi and Ge, Yuying and Ge, Yixiao and Shan, Ying},
  journal={arXiv preprint arXiv:2307.16125},
  year={2023}
}

About

(CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published