Skip to content

Commit

Permalink
[Result] Update Evaluation Results (#60)
Browse files Browse the repository at this point in the history
* update MME, SEEDBench

* update results

* update LLaVABench

* fix

* update AI2D accuracy

* update LLaVABench

* update README

* update teaser link
  • Loading branch information
kennymckormick authored Jan 22, 2024
1 parent e992046 commit 493a7e8
Show file tree
Hide file tree
Showing 12 changed files with 363 additions and 244 deletions.
11 changes: 5 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
![LOGO](https://github-production-user-asset-6210df.s3.amazonaws.com/34324155/295443340-a300f073-4995-48a5-af94-495141606cf7.jpg)
![LOGO](http://opencompass.openxlab.space/utils/MMLB.jpg)
<div align="center"><b>A Toolkit for Evaluating Large Vision-Language Models. </b></div>
<div align="center"><br>
<a href="https://opencompass.org.cn/leaderboard-multimodal">🏆 Learderboard </a> •
Expand All @@ -9,16 +9,15 @@
<a href="#%EF%B8%8F-citation">🖊️Citation </a>
<br><br>
</div>

**VLMEvalKit** (the python package name is **vlmeval**) is an **open-source evaluation toolkit** of **large vision-language models (LVLMs)**. It enables **one-command evaluation** of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt **generation-based evaluation** for all LVLMs (obtain the answer via `generate` / `chat` interface), and provide the evaluation results obtained with both **exact matching** and **LLM(ChatGPT)-based answer extraction**.

## 🆕 News

- **[2024-01-14]** We have supported [**LLaVABench (in-the-wild)**](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild).
- **[2024-01-21]** We have updated results for [**LLaVABench (in-the-wild)**](/results/LLaVABench.md) and [**AI2D**](/results/AI2D.md).
- **[2024-01-14]** We have supported [**AI2D**](https://allenai.org/data/diagrams) and provided the [**script**](/scripts/AI2D_preproc.ipynb) for data pre-processing. 🔥🔥🔥
- **[2024-01-13]** We have supported [**EMU2 / EMU2-Chat**](https://github.com/baaivision/Emu) and [**DocVQA**](https://www.docvqa.org). 🔥🔥🔥
- **[2024-01-11]** We have supported [**Monkey**](https://github.com/Yuliang-Liu/Monkey). 🔥🔥🔥
- **[2024-01-09]** The performance numbers on our official multi-modal leaderboards can be downloaded in json files: [**MMBench Leaderboard**](http://opencompass.openxlab.space/utils/MMBench.json), [**OpenCompass Multi-Modal Leaderboard**](http://opencompass.openxlab.space/utils/MMLB.json). We also add a [notebook](scripts/visualize.ipynb) to visualize these results.🔥🔥🔥
- **[2024-01-09]** The performance numbers on our official multi-modal leaderboards can be downloaded in json files: [**MMBench Leaderboard**](http://opencompass.openxlab.space/utils/MMBench.json), [**OpenCompass Multi-Modal Leaderboard**](http://opencompass.openxlab.space/utils/MMLB.json). We also added a [**notebook**](scripts/visualize.ipynb) to visualize these results.🔥🔥🔥
- **[2024-01-03]** We support **ScienceQA (Img)** (Dataset Name: ScienceQA_[VAL/TEST], [**eval results**](results/ScienceQA.md)), **HallusionBench** (Dataset Name: HallusionBench, [**eval results**](/results/HallusionBench.md)), and **MathVista** (Dataset Name: MathVista_MINI, [**eval results**](/results/MathVista.md)). 🔥🔥🔥
- **[2023-12-31]** We release the [**preliminary results**](/results/VQA.md) of three VQA datasets (**OCRVQA**, **TextVQA**, **ChatVQA**). The results are obtained by exact matching and may not faithfully reflect the real performance of VLMs on the corresponding task.

Expand Down Expand Up @@ -46,9 +45,9 @@
| [**OCRVQA**](https://ocr-vqa.github.io) | OCRVQA_[TESTCORE/TEST] ||| [**VQA**](/results/VQA.md) |
| [**TextVQA**](https://textvqa.org) | TextVQA_VAL ||| [**VQA**](/results/VQA.md) |
| [**ChartQA**](https://github.com/vis-nlp/ChartQA) | ChartQA_VALTEST_HUMAN ||| [**VQA**](/results/VQA.md) |
| [**AI2D**](https://allenai.org/data/diagrams) | AI2D ||| [**AI2D**](/results/AI2D.md) |
| [**LLaVABench**](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) | LLaVABench ||| [**LLaVABench**](/results/LLaVABench.md) |
| [**DocVQA**](https://www.docvqa.org) | DocVQA_VAL ||| |
| [**AI2D**](https://allenai.org/data/diagrams) | AI2D ||| |
| [**LLaVABench**](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) | LLaVABench ||| |
| [**Core-MM**](https://github.com/core-mm/core-mm) | CORE_MM || | |

**Supported API Models**
Expand Down
39 changes: 39 additions & 0 deletions results/AI2D.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# AI2D Evaluation Results

> During evaluation, we use `GPT-3.5-Turbo-0613` as the choice extractor for all VLMs if the choice can not be extracted via heuristic matching. **Zero-shot** inference is adopted.
## AI2D Accuracy

| Model | overall |
|:----------------------------|----------:|
| Monkey-Chat | 72.6 |
| GPT-4v (detail: low) | 71.3 |
| Qwen-VL-Chat | 68.5 |
| Monkey | 67.6 |
| GeminiProVision | 66.7 |
| QwenVLPlus | 63.7 |
| Qwen-VL | 63.4 |
| LLaVA-InternLM2-20B (QLoRA) | 61.4 |
| CogVLM-17B-Chat | 60.3 |
| ShareGPT4V-13B | 59.3 |
| TransCore-M | 59.2 |
| LLaVA-v1.5-13B (QLoRA) | 59 |
| LLaVA-v1.5-13B | 57.9 |
| ShareGPT4V-7B | 56.7 |
| InternLM-XComposer-VL | 56.1 |
| LLaVA-InternLM-7B (QLoRA) | 56 |
| LLaVA-v1.5-7B (QLoRA) | 55.2 |
| mPLUG-Owl2 | 55.2 |
| SharedCaptioner | 55.1 |
| IDEFICS-80B-Instruct | 54.4 |
| LLaVA-v1.5-7B | 54.1 |
| PandaGPT-13B | 49.2 |
| LLaVA-v1-7B | 47.8 |
| IDEFICS-9B-Instruct | 42.7 |
| InstructBLIP-7B | 40.2 |
| VisualGLM | 40.2 |
| InstructBLIP-13B | 38.6 |
| MiniGPT-4-v1-13B | 33.4 |
| OpenFlamingo v2 | 30.7 |
| MiniGPT-4-v2 | 29.4 |
| MiniGPT-4-v1-7B | 28.7 |
58 changes: 30 additions & 28 deletions results/Caption.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,34 +10,36 @@
### Evaluation Results

| Model | BLEU-4 | BLEU-1 | ROUGE-L | CIDEr | Word_cnt mean. | Word_cnt std. |
|:------------------------------|---------:|---------:|----------:|--------:|-----------------:|----------------:|
| Qwen-VL-Chat | 34 | 75.8 | 54.9 | 98.9 | 10 | 1.7 |
| IDEFICS-80B-Instruct | 32.5 | 76.1 | 54.1 | 94.9 | 9.7 | 3.2 |
| IDEFICS-9B-Instruct | 29.4 | 72.7 | 53.4 | 90.4 | 10.5 | 4.4 |
| InstructBLIP-7B | 20.9 | 56.8 | 39.9 | 58.1 | 11.6 | 5.9 |
| InstructBLIP-13B | 16.9 | 50 | 37 | 52.4 | 11.8 | 12.8 |
| InternLM-XComposer-VL | 12.4 | 38.3 | 37.9 | 41 | 26.3 | 22.2 |
| TransCore-M | 8.8 | 30.3 | 36.1 | 34.7 | 39.9 | 27.9 |
| GeminiProVision | 8.4 | 33.2 | 31.2 | 9.7 | 35.2 | 15.7 |
| LLaVA-v1.5-7B (QLoRA, XTuner) | 7.2 | 25 | 36.6 | 43.2 | 48.8 | 42.9 |
| mPLUG-Owl2 | 7.1 | 25.8 | 33.6 | 35 | 45.8 | 32.1 |
| LLaVA-v1-7B | 6.7 | 27.3 | 26.7 | 6.1 | 40.9 | 16.1 |
| VisualGLM | 5.4 | 28.6 | 23.6 | 0.2 | 41.5 | 11.5 |
| LLaVA-v1.5-13B (QLoRA, XTuner) | 5.3 | 19.6 | 25.8 | 17.8 | 72.2 | 39.4 |
| LLaVA-v1.5-13B | 5.1 | 20.7 | 21.2 | 0.3 | 70.6 | 22.3 |
| LLaVA-v1.5-7B | 4.6 | 19.6 | 19.9 | 0.1 | 72.5 | 21.7 |
| PandaGPT-13B | 4.6 | 19.9 | 19.3 | 0.1 | 65.4 | 16.6 |
| MiniGPT-4-v1-13B | 4.4 | 20 | 19.8 | 1.3 | 64.4 | 30.5 |
| MiniGPT-4-v1-7B | 4.3 | 19.6 | 17.5 | 0.8 | 61.9 | 30.6 |
| LLaVA-InternLM-7B (QLoRA) | 4 | 17.3 | 17.2 | 0.1 | 82.3 | 21 |
| CogVLM-17B-Chat | 3.6 | 21.3 | 20 | 0.1 | 56.2 | 13.7 |
| Qwen-VL | 3.5 | 11.6 | 30 | 41.1 | 46.6 | 105.2 |
| GPT-4v (detail: low) | 3.3 | 18 | 18.1 | 0 | 77.8 | 20.4 |
| ShareGPT4V-7B | 1.4 | 9.7 | 10.6 | 0.1 | 147.9 | 45.4 |
| MiniGPT-4-v2 | 1.4 | 12.6 | 13.3 | 0.1 | 83 | 27.1 |
| OpenFlamingo v2 | 1.3 | 6.4 | 15.8 | 14.9 | 60 | 81.9 |
| SharedCaptioner | 1 | 8.8 | 9.2 | 0 | 164.2 | 31.6 |
| Model | BLEU-4 | BLEU-1 | ROUGE-L | CIDEr | Word_cnt mean. | Word_cnt std. |
|:----------------------------|---------:|---------:|----------:|--------:|-----------------:|----------------:|
| EMU2-Chat | 38.7 | 78.2 | 56.9 | 109.2 | 9.6 | 1.1 |
| Qwen-VL-Chat | 34 | 75.8 | 54.9 | 98.9 | 10 | 1.7 |
| IDEFICS-80B-Instruct | 32.5 | 76.1 | 54.1 | 94.9 | 9.7 | 3.2 |
| IDEFICS-9B-Instruct | 29.4 | 72.7 | 53.4 | 90.4 | 10.5 | 4.4 |
| InstructBLIP-7B | 20.9 | 56.8 | 39.9 | 58.1 | 11.6 | 5.9 |
| InstructBLIP-13B | 16.9 | 50 | 37 | 52.4 | 11.8 | 12.8 |
| InternLM-XComposer-VL | 12.4 | 38.3 | 37.9 | 41 | 26.3 | 22.2 |
| GeminiProVision | 8.4 | 33.2 | 31.2 | 9.7 | 35.2 | 15.7 |
| LLaVA-v1.5-7B (QLoRA) | 7.2 | 25 | 36.6 | 43.2 | 48.8 | 42.9 |
| mPLUG-Owl2 | 7.1 | 25.8 | 33.6 | 35 | 45.8 | 32.1 |
| LLaVA-v1-7B | 6.7 | 27.3 | 26.7 | 6.1 | 40.9 | 16.1 |
| VisualGLM | 5.4 | 28.6 | 23.6 | 0.2 | 41.5 | 11.5 |
| LLaVA-v1.5-13B (QLoRA) | 5.3 | 19.6 | 25.8 | 17.8 | 72.2 | 39.4 |
| LLaVA-v1.5-13B | 5.1 | 20.7 | 21.2 | 0.3 | 70.6 | 22.3 |
| LLaVA-v1.5-7B | 4.6 | 19.6 | 19.9 | 0.1 | 72.5 | 21.7 |
| PandaGPT-13B | 4.6 | 19.9 | 19.3 | 0.1 | 65.4 | 16.6 |
| MiniGPT-4-v1-13B | 4.4 | 20 | 19.8 | 1.3 | 64.4 | 30.5 |
| MiniGPT-4-v1-7B | 4.3 | 19.6 | 17.5 | 0.8 | 61.9 | 30.6 |
| LLaVA-InternLM-7B (QLoRA) | 4 | 17.3 | 17.2 | 0.1 | 82.3 | 21 |
| LLaVA-InternLM2-20B (QLoRA) | 4 | 17.9 | 17.3 | 0 | 83.2 | 20.4 |
| CogVLM-17B-Chat | 3.6 | 21.3 | 20 | 0.1 | 56.2 | 13.7 |
| Qwen-VL | 3.5 | 11.6 | 30 | 41.1 | 46.6 | 105.2 |
| GPT-4v (detail: low) | 3.3 | 18 | 18.1 | 0 | 77.8 | 20.4 |
| TransCore-M | 2.1 | 14.2 | 13.8 | 0.2 | 92 | 6.7 |
| ShareGPT4V-7B | 1.4 | 9.7 | 10.6 | 0.1 | 147.9 | 45.4 |
| MiniGPT-4-v2 | 1.4 | 12.6 | 13.3 | 0.1 | 83 | 27.1 |
| OpenFlamingo v2 | 1.3 | 6.4 | 15.8 | 14.9 | 60 | 81.9 |
| SharedCaptioner | 1 | 8.8 | 9.2 | 0 | 164.2 | 31.6 |

We noticed that, VLMs that generate long image descriptions tend to achieve inferior scores under different caption metrics.

Expand Down
61 changes: 33 additions & 28 deletions results/HallusionBench.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,31 +29,36 @@
> Models are sorted by the **descending order of qAcc.**

| Model | aAcc | fAcc | qAcc |
|:------------------------------|-------:|-------:|-------:|
| GPT-4v (detail: low) | 65.8 | 38.4 | 35.2 |
| GeminiProVision | 63.9 | 37.3 | 34.3 |
| Qwen-VL-Chat | 56.4 | 27.7 | 26.4 |
| MiniGPT-4-v1-7B | 52.4 | 17.3 | 25.9 |
| CogVLM-17B-Chat | 55.1 | 26.3 | 24.8 |
| InternLM-XComposer-VL | 57 | 26.3 | 24.6 |
| MiniGPT-4-v1-13B | 51.3 | 16.2 | 24.6 |
| SharedCaptioner | 55.6 | 22.8 | 24.2 |
| MiniGPT-4-v2 | 52.6 | 16.5 | 21.1 |
| InstructBLIP-7B | 53.6 | 20.2 | 19.8 |
| Qwen-VL | 57.6 | 12.4 | 19.6 |
| OpenFlamingo v2 | 52.7 | 17.6 | 18 |
| mPLUG-Owl2 | 48.9 | 22.5 | 16.7 |
| VisualGLM | 47.2 | 11.3 | 16.5 |
| IDEFICS-9B-Instruct | 50.1 | 16.2 | 15.6 |
| ShareGPT4V-7B | 48.2 | 21.7 | 15.6 |
| LLaVA-InternLM-7B (QLoRA) | 49.1 | 22.3 | 15.4 |
| InstructBLIP-13B | 47.9 | 17.3 | 15.2 |
| LLaVA-v1.5-7B | 48.3 | 19.9 | 14.1 |
| LLaVA-v1.5-13B (QLoRA, XTuner) | 46.9 | 17.6 | 14.1 |
| LLaVA-v1.5-7B (QLoRA, XTuner) | 46.2 | 16.2 | 13.2 |
| LLaVA-v1.5-13B | 46.7 | 17.3 | 13 |
| IDEFICS-80B-Instruct | 46.1 | 13.3 | 11 |
| TransCore-M | 44.7 | 16.5 | 10.1 |
| LLaVA-v1-7B | 44.1 | 13.6 | 9.5 |
| PandaGPT-13B | 43.1 | 9.2 | 7.7 |
| Model | aAcc | fAcc | qAcc |
|:----------------------------|-------:|-------:|-------:|
| GPT-4v (detail: low) | 65.8 | 38.4 | 35.2 |
| GeminiProVision | 63.9 | 37.3 | 34.3 |
| Monkey-Chat | 58.4 | 30.6 | 29 |
| Qwen-VL-Chat | 56.4 | 27.7 | 26.4 |
| MiniGPT-4-v1-7B | 52.4 | 17.3 | 25.9 |
| Monkey | 55.1 | 24 | 25.5 |
| CogVLM-17B-Chat | 55.1 | 26.3 | 24.8 |
| MiniGPT-4-v1-13B | 51.3 | 16.2 | 24.6 |
| InternLM-XComposer-VL | 57 | 26.3 | 24.6 |
| SharedCaptioner | 55.6 | 22.8 | 24.2 |
| MiniGPT-4-v2 | 52.6 | 16.5 | 21.1 |
| InstructBLIP-7B | 53.6 | 20.2 | 19.8 |
| Qwen-VL | 57.6 | 12.4 | 19.6 |
| OpenFlamingo v2 | 52.7 | 17.6 | 18 |
| EMU2-Chat | 49.4 | 22.3 | 16.9 |
| mPLUG-Owl2 | 48.9 | 22.5 | 16.7 |
| ShareGPT4V-13B | 49.8 | 21.7 | 16.7 |
| VisualGLM | 47.2 | 11.3 | 16.5 |
| TransCore-M | 49.7 | 21.4 | 15.8 |
| IDEFICS-9B-Instruct | 50.1 | 16.2 | 15.6 |
| ShareGPT4V-7B | 48.2 | 21.7 | 15.6 |
| LLaVA-InternLM-7B (QLoRA) | 49.1 | 22.3 | 15.4 |
| InstructBLIP-13B | 47.9 | 17.3 | 15.2 |
| LLaVA-InternLM2-20B (QLoRA) | 47.7 | 17.1 | 14.3 |
| LLaVA-v1.5-13B (QLoRA) | 46.9 | 17.6 | 14.1 |
| LLaVA-v1.5-7B | 48.3 | 19.9 | 14.1 |
| LLaVA-v1.5-7B (QLoRA) | 46.2 | 16.2 | 13.2 |
| LLaVA-v1.5-13B | 46.7 | 17.3 | 13 |
| IDEFICS-80B-Instruct | 46.1 | 13.3 | 11 |
| LLaVA-v1-7B | 44.1 | 13.6 | 9.5 |
| PandaGPT-13B | 43.1 | 9.2 | 7.7 |
Loading

0 comments on commit 493a7e8

Please sign in to comment.