[Result] Update Evaluation Results (#60)

* update MME, SEEDBench * update results * update LLaVABench * fix * update AI2D accuracy * update LLaVABench * update README * update teaser link
open-compass · Jan 22, 2024 · 493a7e8 · 493a7e8
1 parent e992046
commit 493a7e8
Show file tree

Hide file tree

Showing 12 changed files with 363 additions and 244 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-![LOGO](https://github-production-user-asset-6210df.s3.amazonaws.com/34324155/295443340-a300f073-4995-48a5-af94-495141606cf7.jpg)
+![LOGO](http://opencompass.openxlab.space/utils/MMLB.jpg)
 <div align="center"><b>A Toolkit for Evaluating Large Vision-Language Models. </b></div>
 <div align="center"><br>
 <a href="https://opencompass.org.cn/leaderboard-multimodal">🏆 Learderboard </a> •
@@ -9,16 +9,15 @@
 <a href="#%EF%B8%8F-citation">🖊️Citation </a>
 <br><br>
 </div>
-
 **VLMEvalKit** (the python package name is **vlmeval**) is an **open-source evaluation toolkit** of **large vision-language models (LVLMs)**. It enables **one-command evaluation** of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt **generation-based evaluation** for all LVLMs (obtain the answer via `generate` / `chat`  interface), and provide the evaluation results obtained with both **exact matching** and **LLM(ChatGPT)-based answer extraction**. 
 
 ## 🆕 News
 
-- **[2024-01-14]** We have supported [**LLaVABench (in-the-wild)**](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild).
+- **[2024-01-21]** We have updated results for [**LLaVABench (in-the-wild)**](/results/LLaVABench.md) and [**AI2D**](/results/AI2D.md).
 - **[2024-01-14]** We have supported [**AI2D**](https://allenai.org/data/diagrams) and provided the [**script**](/scripts/AI2D_preproc.ipynb) for data pre-processing. 🔥🔥🔥
 - **[2024-01-13]** We have supported [**EMU2 / EMU2-Chat**](https://github.com/baaivision/Emu) and [**DocVQA**](https://www.docvqa.org). 🔥🔥🔥
 - **[2024-01-11]** We have supported [**Monkey**](https://github.com/Yuliang-Liu/Monkey). 🔥🔥🔥
-- **[2024-01-09]** The performance numbers on our official multi-modal leaderboards can be downloaded in json files: [**MMBench Leaderboard**](http://opencompass.openxlab.space/utils/MMBench.json), [**OpenCompass Multi-Modal Leaderboard**](http://opencompass.openxlab.space/utils/MMLB.json). We also add a [notebook](scripts/visualize.ipynb) to visualize these results.🔥🔥🔥
+- **[2024-01-09]** The performance numbers on our official multi-modal leaderboards can be downloaded in json files: [**MMBench Leaderboard**](http://opencompass.openxlab.space/utils/MMBench.json), [**OpenCompass Multi-Modal Leaderboard**](http://opencompass.openxlab.space/utils/MMLB.json). We also added a [**notebook**](scripts/visualize.ipynb) to visualize these results.🔥🔥🔥
 - **[2024-01-03]** We support **ScienceQA (Img)** (Dataset Name: ScienceQA_[VAL/TEST], [**eval results**](results/ScienceQA.md)), **HallusionBench** (Dataset Name: HallusionBench, [**eval results**](/results/HallusionBench.md)), and **MathVista** (Dataset Name: MathVista_MINI, [**eval results**](/results/MathVista.md)).  🔥🔥🔥
 - **[2023-12-31]** We release the [**preliminary results**](/results/VQA.md) of three VQA datasets (**OCRVQA**, **TextVQA**, **ChatVQA**). The results are obtained by exact matching and may not faithfully reflect the real performance of VLMs on the corresponding task.
 
@@ -46,9 +45,9 @@
 | [**OCRVQA**](https://ocr-vqa.github.io)                      | OCRVQA_[TESTCORE/TEST]                                 | ✅         | ✅          | [**VQA**](/results/VQA.md)                                   |
 | [**TextVQA**](https://textvqa.org)                           | TextVQA_VAL                                            | ✅         | ✅          | [**VQA**](/results/VQA.md)                                   |
 | [**ChartQA**](https://github.com/vis-nlp/ChartQA)            | ChartQA_VALTEST_HUMAN                                  | ✅         | ✅          | [**VQA**](/results/VQA.md)                                   |
+| [**AI2D**](https://allenai.org/data/diagrams)                | AI2D                                                   | ✅         | ✅          | [**AI2D**](/results/AI2D.md)                                 |
+| [**LLaVABench**](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) | LLaVABench                                             | ✅         | ✅          | [**LLaVABench**](/results/LLaVABench.md)                     |
 | [**DocVQA**](https://www.docvqa.org)                         | DocVQA_VAL                                             | ✅         | ✅          |                                                              |
-| [**AI2D**](https://allenai.org/data/diagrams)                | AI2D                                                   | ✅         | ✅          |                                                              |
-| [**LLaVABench**](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) | LLaVABench                      | ✅         | ✅          |                                       |
 | [**Core-MM**](https://github.com/core-mm/core-mm)            | CORE_MM                                                | ✅         |            |                                                              |
 
 **Supported API Models**

diff --git a/results/AI2D.md b/results/AI2D.md
@@ -0,0 +1,39 @@
+# AI2D Evaluation Results
+
+> During evaluation, we use `GPT-3.5-Turbo-0613` as the choice extractor for all VLMs if the choice can not be extracted via heuristic matching. **Zero-shot** inference is adopted. 
+
+## AI2D Accuracy
+
+| Model                       |   overall |
+|:----------------------------|----------:|
+| Monkey-Chat                 |      72.6 |
+| GPT-4v (detail: low)        |      71.3 |
+| Qwen-VL-Chat                |      68.5 |
+| Monkey                      |      67.6 |
+| GeminiProVision             |      66.7 |
+| QwenVLPlus                  |      63.7 |
+| Qwen-VL                     |      63.4 |
+| LLaVA-InternLM2-20B (QLoRA) |      61.4 |
+| CogVLM-17B-Chat             |      60.3 |
+| ShareGPT4V-13B              |      59.3 |
+| TransCore-M                 |      59.2 |
+| LLaVA-v1.5-13B (QLoRA)      |      59   |
+| LLaVA-v1.5-13B              |      57.9 |
+| ShareGPT4V-7B               |      56.7 |
+| InternLM-XComposer-VL       |      56.1 |
+| LLaVA-InternLM-7B (QLoRA)   |      56   |
+| LLaVA-v1.5-7B (QLoRA)       |      55.2 |
+| mPLUG-Owl2                  |      55.2 |
+| SharedCaptioner             |      55.1 |
+| IDEFICS-80B-Instruct        |      54.4 |
+| LLaVA-v1.5-7B               |      54.1 |
+| PandaGPT-13B                |      49.2 |
+| LLaVA-v1-7B                 |      47.8 |
+| IDEFICS-9B-Instruct         |      42.7 |
+| InstructBLIP-7B             |      40.2 |
+| VisualGLM                   |      40.2 |
+| InstructBLIP-13B            |      38.6 |
+| MiniGPT-4-v1-13B            |      33.4 |
+| OpenFlamingo v2             |      30.7 |
+| MiniGPT-4-v2                |      29.4 |
+| MiniGPT-4-v1-7B             |      28.7 |
diff --git a/results/Caption.md b/results/Caption.md
@@ -10,34 +10,36 @@
 
 ### Evaluation Results
 
-| Model                         |   BLEU-4 |   BLEU-1 |   ROUGE-L |   CIDEr |   Word_cnt mean. |   Word_cnt std. |
-|:------------------------------|---------:|---------:|----------:|--------:|-----------------:|----------------:|
-| Qwen-VL-Chat                  |     34   |     75.8 |      54.9 |    98.9 |             10   |             1.7 |
-| IDEFICS-80B-Instruct          |     32.5 |     76.1 |      54.1 |    94.9 |              9.7 |             3.2 |
-| IDEFICS-9B-Instruct           |     29.4 |     72.7 |      53.4 |    90.4 |             10.5 |             4.4 |
-| InstructBLIP-7B               |     20.9 |     56.8 |      39.9 |    58.1 |             11.6 |             5.9 |
-| InstructBLIP-13B              |     16.9 |     50   |      37   |    52.4 |             11.8 |            12.8 |
-| InternLM-XComposer-VL         |     12.4 |     38.3 |      37.9 |    41   |             26.3 |            22.2 |
-| TransCore-M                   |      8.8 |     30.3 |      36.1 |    34.7 |             39.9 |            27.9 |
-| GeminiProVision               |      8.4 |     33.2 |      31.2 |     9.7 |             35.2 |            15.7 |
-| LLaVA-v1.5-7B (QLoRA, XTuner)  |      7.2 |     25   |      36.6 |    43.2 |             48.8 |            42.9 |
-| mPLUG-Owl2                    |      7.1 |     25.8 |      33.6 |    35   |             45.8 |            32.1 |
-| LLaVA-v1-7B                   |      6.7 |     27.3 |      26.7 |     6.1 |             40.9 |            16.1 |
-| VisualGLM                     |      5.4 |     28.6 |      23.6 |     0.2 |             41.5 |            11.5 |
-| LLaVA-v1.5-13B (QLoRA, XTuner) |      5.3 |     19.6 |      25.8 |    17.8 |             72.2 |            39.4 |
-| LLaVA-v1.5-13B                |      5.1 |     20.7 |      21.2 |     0.3 |             70.6 |            22.3 |
-| LLaVA-v1.5-7B                 |      4.6 |     19.6 |      19.9 |     0.1 |             72.5 |            21.7 |
-| PandaGPT-13B                  |      4.6 |     19.9 |      19.3 |     0.1 |             65.4 |            16.6 |
-| MiniGPT-4-v1-13B              |      4.4 |     20   |      19.8 |     1.3 |             64.4 |            30.5 |
-| MiniGPT-4-v1-7B               |      4.3 |     19.6 |      17.5 |     0.8 |             61.9 |            30.6 |
-| LLaVA-InternLM-7B (QLoRA)      |      4   |     17.3 |      17.2 |     0.1 |             82.3 |            21   |
-| CogVLM-17B-Chat               |      3.6 |     21.3 |      20   |     0.1 |             56.2 |            13.7 |
-| Qwen-VL                       |      3.5 |     11.6 |      30   |    41.1 |             46.6 |           105.2 |
-| GPT-4v (detail: low)          |      3.3 |     18   |      18.1 |     0   |             77.8 |            20.4 |
-| ShareGPT4V-7B                 |      1.4 |      9.7 |      10.6 |     0.1 |            147.9 |            45.4 |
-| MiniGPT-4-v2                  |      1.4 |     12.6 |      13.3 |     0.1 |             83   |            27.1 |
-| OpenFlamingo v2               |      1.3 |      6.4 |      15.8 |    14.9 |             60   |            81.9 |
-| SharedCaptioner               |      1   |      8.8 |       9.2 |     0   |            164.2 |            31.6 |
+| Model                       |   BLEU-4 |   BLEU-1 |   ROUGE-L |   CIDEr |   Word_cnt mean. |   Word_cnt std. |
+|:----------------------------|---------:|---------:|----------:|--------:|-----------------:|----------------:|
+| EMU2-Chat                   |     38.7 |     78.2 |      56.9 |   109.2 |              9.6 |             1.1 |
+| Qwen-VL-Chat                |     34   |     75.8 |      54.9 |    98.9 |             10   |             1.7 |
+| IDEFICS-80B-Instruct        |     32.5 |     76.1 |      54.1 |    94.9 |              9.7 |             3.2 |
+| IDEFICS-9B-Instruct         |     29.4 |     72.7 |      53.4 |    90.4 |             10.5 |             4.4 |
+| InstructBLIP-7B             |     20.9 |     56.8 |      39.9 |    58.1 |             11.6 |             5.9 |
+| InstructBLIP-13B            |     16.9 |     50   |      37   |    52.4 |             11.8 |            12.8 |
+| InternLM-XComposer-VL       |     12.4 |     38.3 |      37.9 |    41   |             26.3 |            22.2 |
+| GeminiProVision             |      8.4 |     33.2 |      31.2 |     9.7 |             35.2 |            15.7 |
+| LLaVA-v1.5-7B (QLoRA)       |      7.2 |     25   |      36.6 |    43.2 |             48.8 |            42.9 |
+| mPLUG-Owl2                  |      7.1 |     25.8 |      33.6 |    35   |             45.8 |            32.1 |
+| LLaVA-v1-7B                 |      6.7 |     27.3 |      26.7 |     6.1 |             40.9 |            16.1 |
+| VisualGLM                   |      5.4 |     28.6 |      23.6 |     0.2 |             41.5 |            11.5 |
+| LLaVA-v1.5-13B (QLoRA)      |      5.3 |     19.6 |      25.8 |    17.8 |             72.2 |            39.4 |
+| LLaVA-v1.5-13B              |      5.1 |     20.7 |      21.2 |     0.3 |             70.6 |            22.3 |
+| LLaVA-v1.5-7B               |      4.6 |     19.6 |      19.9 |     0.1 |             72.5 |            21.7 |
+| PandaGPT-13B                |      4.6 |     19.9 |      19.3 |     0.1 |             65.4 |            16.6 |
+| MiniGPT-4-v1-13B            |      4.4 |     20   |      19.8 |     1.3 |             64.4 |            30.5 |
+| MiniGPT-4-v1-7B             |      4.3 |     19.6 |      17.5 |     0.8 |             61.9 |            30.6 |
+| LLaVA-InternLM-7B (QLoRA)   |      4   |     17.3 |      17.2 |     0.1 |             82.3 |            21   |
+| LLaVA-InternLM2-20B (QLoRA) |      4   |     17.9 |      17.3 |     0   |             83.2 |            20.4 |
+| CogVLM-17B-Chat             |      3.6 |     21.3 |      20   |     0.1 |             56.2 |            13.7 |
+| Qwen-VL                     |      3.5 |     11.6 |      30   |    41.1 |             46.6 |           105.2 |
+| GPT-4v (detail: low)        |      3.3 |     18   |      18.1 |     0   |             77.8 |            20.4 |
+| TransCore-M                 |      2.1 |     14.2 |      13.8 |     0.2 |             92   |             6.7 |
+| ShareGPT4V-7B               |      1.4 |      9.7 |      10.6 |     0.1 |            147.9 |            45.4 |
+| MiniGPT-4-v2                |      1.4 |     12.6 |      13.3 |     0.1 |             83   |            27.1 |
+| OpenFlamingo v2             |      1.3 |      6.4 |      15.8 |    14.9 |             60   |            81.9 |
+| SharedCaptioner             |      1   |      8.8 |       9.2 |     0   |            164.2 |            31.6 |
 
 We noticed that, VLMs that generate long image descriptions tend to achieve inferior scores under different caption metrics.
 

diff --git a/results/HallusionBench.md b/results/HallusionBench.md
@@ -29,31 +29,36 @@
 > Models are sorted by the **descending order of qAcc.** 
 
 
-| Model                         |   aAcc |   fAcc |   qAcc |
-|:------------------------------|-------:|-------:|-------:|
-| GPT-4v (detail: low)          |   65.8 |   38.4 |   35.2 |
-| GeminiProVision               |   63.9 |   37.3 |   34.3 |
-| Qwen-VL-Chat                  |   56.4 |   27.7 |   26.4 |
-| MiniGPT-4-v1-7B               |   52.4 |   17.3 |   25.9 |
-| CogVLM-17B-Chat               |   55.1 |   26.3 |   24.8 |
-| InternLM-XComposer-VL         |   57   |   26.3 |   24.6 |
-| MiniGPT-4-v1-13B              |   51.3 |   16.2 |   24.6 |
-| SharedCaptioner               |   55.6 |   22.8 |   24.2 |
-| MiniGPT-4-v2                  |   52.6 |   16.5 |   21.1 |
-| InstructBLIP-7B               |   53.6 |   20.2 |   19.8 |
-| Qwen-VL                       |   57.6 |   12.4 |   19.6 |
-| OpenFlamingo v2               |   52.7 |   17.6 |   18   |
-| mPLUG-Owl2                    |   48.9 |   22.5 |   16.7 |
-| VisualGLM                     |   47.2 |   11.3 |   16.5 |
-| IDEFICS-9B-Instruct           |   50.1 |   16.2 |   15.6 |
-| ShareGPT4V-7B                 |   48.2 |   21.7 |   15.6 |
-| LLaVA-InternLM-7B (QLoRA)      |   49.1 |   22.3 |   15.4 |
-| InstructBLIP-13B              |   47.9 |   17.3 |   15.2 |
-| LLaVA-v1.5-7B                 |   48.3 |   19.9 |   14.1 |
-| LLaVA-v1.5-13B (QLoRA, XTuner) |   46.9 |   17.6 |   14.1 |
-| LLaVA-v1.5-7B (QLoRA, XTuner)  |   46.2 |   16.2 |   13.2 |
-| LLaVA-v1.5-13B                |   46.7 |   17.3 |   13   |
-| IDEFICS-80B-Instruct          |   46.1 |   13.3 |   11   |
-| TransCore-M                   |   44.7 |   16.5 |   10.1 |
-| LLaVA-v1-7B                   |   44.1 |   13.6 |    9.5 |
-| PandaGPT-13B                  |   43.1 |    9.2 |    7.7 |
+| Model                       |   aAcc |   fAcc |   qAcc |
+|:----------------------------|-------:|-------:|-------:|
+| GPT-4v (detail: low)        |   65.8 |   38.4 |   35.2 |
+| GeminiProVision             |   63.9 |   37.3 |   34.3 |
+| Monkey-Chat                 |   58.4 |   30.6 |   29   |
+| Qwen-VL-Chat                |   56.4 |   27.7 |   26.4 |
+| MiniGPT-4-v1-7B             |   52.4 |   17.3 |   25.9 |
+| Monkey                      |   55.1 |   24   |   25.5 |
+| CogVLM-17B-Chat             |   55.1 |   26.3 |   24.8 |
+| MiniGPT-4-v1-13B            |   51.3 |   16.2 |   24.6 |
+| InternLM-XComposer-VL       |   57   |   26.3 |   24.6 |
+| SharedCaptioner             |   55.6 |   22.8 |   24.2 |
+| MiniGPT-4-v2                |   52.6 |   16.5 |   21.1 |
+| InstructBLIP-7B             |   53.6 |   20.2 |   19.8 |
+| Qwen-VL                     |   57.6 |   12.4 |   19.6 |
+| OpenFlamingo v2             |   52.7 |   17.6 |   18   |
+| EMU2-Chat                   |   49.4 |   22.3 |   16.9 |
+| mPLUG-Owl2                  |   48.9 |   22.5 |   16.7 |
+| ShareGPT4V-13B              |   49.8 |   21.7 |   16.7 |
+| VisualGLM                   |   47.2 |   11.3 |   16.5 |
+| TransCore-M                 |   49.7 |   21.4 |   15.8 |
+| IDEFICS-9B-Instruct         |   50.1 |   16.2 |   15.6 |
+| ShareGPT4V-7B               |   48.2 |   21.7 |   15.6 |
+| LLaVA-InternLM-7B (QLoRA)   |   49.1 |   22.3 |   15.4 |
+| InstructBLIP-13B            |   47.9 |   17.3 |   15.2 |
+| LLaVA-InternLM2-20B (QLoRA) |   47.7 |   17.1 |   14.3 |
+| LLaVA-v1.5-13B (QLoRA)      |   46.9 |   17.6 |   14.1 |
+| LLaVA-v1.5-7B               |   48.3 |   19.9 |   14.1 |
+| LLaVA-v1.5-7B (QLoRA)       |   46.2 |   16.2 |   13.2 |
+| LLaVA-v1.5-13B              |   46.7 |   17.3 |   13   |
+| IDEFICS-80B-Instruct        |   46.1 |   13.3 |   11   |
+| LLaVA-v1-7B                 |   44.1 |   13.6 |    9.5 |
+| PandaGPT-13B                |   43.1 |    9.2 |    7.7 |