Skip to content

Commit

Permalink
[Result] Update XTuner Performance (open-compass#31)
Browse files Browse the repository at this point in the history
* update report_missing

* update results

* update
  • Loading branch information
kennymckormick authored Dec 28, 2023
1 parent 3601f1b commit f1d0ce4
Show file tree
Hide file tree
Showing 5 changed files with 110 additions and 100 deletions.
52 changes: 27 additions & 25 deletions results/MME.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,38 +8,40 @@ In each cell, we list `vanilla score / ChatGPT Answer Extraction Score` if the t

VLMs are sorted by the descending order of Total score.

| Model | Total | perception | reasoning |
| :------------------- | ----------: | ----------: | --------: |
| Full | 2800 | 2000 | 800 |
| GeminiProVision | 2131 / 2149 | 1601 / 1609 | 530 / 540 |
| XComposer | 1874 | 1497 | 377 |
| qwen_chat | 1849 / 1860 | 1457 / 1468 | 392 |
| sharegpt4v_7b | 1799 / 1808 | 1491 | 308 / 318 |
| llava_v1.5_13b | 1800 / 1805 | 1485 / 1490 | 315 |
| mPLUG-Owl2 | 1781 / 1786 | 1435 / 1436 | 346 / 350 |
| llava_v1.5_7b | 1775 | 1490 | 285 |
| GPT-4v (detail: low) | 1737 / 1771 | 1300 / 1334 | 437 |
| TransCore_M | 1682 / 1701 | 1427 / 1429 | 254 / 272 |
| instructblip_13b | 1624 / 1646 | 1381 / 1383 | 243 / 263 |
| idefics_80b_instruct | 1507 / 1519 | 1276 / 1285 | 231 / 234 |
| instructblip_7b | 1313 / 1391 | 1084 / 1137 | 229 / 254 |
| idefics_9b_instruct | 1177 | 942 | 235 |
| PandaGPT_13B | 1072 | 826 | 246 |
| MiniGPT-4-v1-13B | 648 / 1067 | 533 / 794 | 115 / 273 |
| MiniGPT-4-v1-7B | 806 / 1048 | 622 / 771 | 184 / 277 |
| llava_v1_7b | 1027 / 1044 | 793 / 807 | 234 / 238 |
| MiniGPT-4-v2 | 968 | 708 | 260 |
| VisualGLM_6b | 738 | 628 | 110 |
| flamingov2 | 607 | 535 | 72 |
| qwen_base | 6 / 483 | 0 / 334 | 6 / 149 |
| Model | Total | Perception | Reasoning |
|:------------------------------|:------------|:-------------|:------------|
| GeminiProVision | 2131 / 2149 | 1601 / 1609 | 530 / 540 |
| InternLM-XComposer-VL | 1874 | 1497 | 377 |
| Qwen-VL-Chat | 1849 / 1860 | 1457 / 1468 | 392 |
| ShareGPT4V-7B | 1799 / 1808 | 1491 | 308 / 317 |
| LLaVA-v1.5-13B | 1800 / 1805 | 1485 / 1490 | 315 |
| mPLUG-Owl2 | 1781 / 1786 | 1435 / 1436 | 346 / 350 |
| LLaVA-v1.5-7B | 1775 | 1490 | 285 |
| GPT-4v (detail: low) | 1737 / 1771 | 1300 / 1334 | 437 |
| LLaVA-v1.5-13B (LoRA, XTuner) | 1766 | 1475 | 291 |
| LLaVA-v1.5-7B (LoRA, XTuner) | 1716 | 1434 | 282 |
| TransCore-M | 1681 / 1701 | 1427 / 1429 | 254 / 272 |
| instructblip_13b | 1624 / 1646 | 1381 / 1383 | 243 / 263 |
| LLaVA-InternLM-7B (LoRA) | 1637 | 1393 | 244 |
| IDEFICS-80B-Instruct | 1507 / 1519 | 1276 / 1285 | 231 / 234 |
| InstructBLIP-7B | 1313 / 1391 | 1084 / 1137 | 229 / 254 |
| IDEFICS-9B-Instruct | 1177 | 942 | 235 |
| PandaGPT-13B | 1072 | 826 | 246 |
| MiniGPT-4-v1-13B | 648 / 1067 | 533 / 794 | 115 / 273 |
| MiniGPT-4-v1-7B | 806 / 1048 | 622 / 771 | 184 / 277 |
| LLaVA-v1-7B | 1027 / 1044 | 793 / 807 | 234 / 237 |
| MiniGPT-4-v2 | 968 | 708 | 260 |
| VisualGLM | 738 | 628 | 110 |
| OpenFlamingo v2 | 607 | 535 | 72 |
| Qwen-VL | 6 / 483 | 0 / 334 | 6 / 149 |

### Comments

For most VLMs, using ChatGPT as the answer extractor or not may not significantly affect the final score. However, for some VLMs including instructblip_7b, MiniGPT-4-v1, and qwen_base, the score improvement with ChatGPT answer extractor is significant. The table below demonstrates the score gap between two answer extraction strategies:

| MME Score Improvement with ChatGPT Answer Extractor | Models |
| --------------------------------------------------- | ------------------------------------------------------------ |
| **No (0)** | XComposer, llava_v1.5_7b, idefics_9b_instruct, PandaGPT_13B, MiniGPT-4-v2, <br>VisualGLM_6b, flamingov2 |
| **No (0)** | XComposer, llava_v1.5_7b, idefics_9b_instruct, PandaGPT_13B, MiniGPT-4-v2, <br>VisualGLM_6b, flamingov2, LLaVA-XTuner Series |
| **Minor (1~20)** | qwen_chat (11), llava_v1.5_13b (5), mPLUG-Owl2 (5), idefics_80b_instruct (12), llava_v1_7b (17), <br>sharegpt4v_7b (9), TransCore_M (19), GeminiProVision (18) |
| **Moderate (21~100)** | instructblip_13b (22), instructblip_7b (78), GPT-4v (34) |
| **Huge (> 100)** | MiniGPT-4-v1-7B (242), MiniGPT-4-v1-13B (419), qwen_base (477) |
51 changes: 26 additions & 25 deletions results/MMMU.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,28 +11,29 @@
### MMMU Scores

| Model | Overall<br>(Val) | Art & Design<br>(Val) | Business<br>(Val) | Science<br>(Val) | Health & Medicine<br>(Val) | Humanities & Social Science<br>(Val) | Tech & Engineering<br>(Val) | Overall<br>(Dev) |
|:---------------------|-------------------:|------------------------:|--------------------:|-------------------:|-----------------------------:|---------------------------------------:|------------------------------:|-------------------:|
| GPT-4v | 53.8 | 66.7 | 60 | 46 | 54.7 | 71.7 | 36.7 | 52.7 |
| GeminiProVision | 48.4 | 59.2 | 36 | 42 | 52 | 66.7 | 42.9 | 54 |
| qwen_chat | 37.6 | 49.2 | 36 | 28 | 32.7 | 55.8 | 31.9 | 30 |
| llava_v1.5_13b | 36.8 | 49.2 | 23.3 | 36 | 34 | 51.7 | 33.3 | 42 |
| sharegpt4v_7b | 36.7 | 50 | 27.3 | 26.7 | 37.3 | 50 | 34.8 | 30 |
| TransCore_M | 36.6 | 54.2 | 32 | 27.3 | 32 | 49.2 | 32.4 | 38.7 |
| llava_v1.5_7b | 36.1 | 45.8 | 25.3 | 34 | 32 | 48.3 | 35.7 | 38.7 |
| XComposer | 35.7 | 45.8 | 28.7 | 22.7 | 30.7 | 53.3 | 37.6 | 36.7 |
| mPLUG-Owl2 | 34.6 | 47.5 | 26 | 21.3 | 37.3 | 50 | 31.9 | 40.7 |
| instructblip_13b | 32.9 | 37.5 | 29.3 | 32 | 28.7 | 37.5 | 33.8 | 30 |
| PandaGPT_13B | 32.7 | 42.5 | 35.3 | 30 | 29.3 | 45.8 | 21.9 | 26.7 |
| llava_v1_7b | 32.1 | 31.7 | 24.7 | 31.3 | 32 | 37.5 | 35.2 | 33.3 |
| instructblip_7b | 30.4 | 38.3 | 28 | 22 | 30.7 | 39.2 | 28.6 | 24 |
| VisualGLM_6b | 28.9 | 30 | 24 | 28 | 28 | 40.8 | 26.2 | 28.7 |
| qwen_base | 28.8 | 43.3 | 18.7 | 25.3 | 32.7 | 42.5 | 19.5 | 29.3 |
| flamingov2 | 28.2 | 27.5 | 30 | 28.7 | 28 | 33.3 | 24.3 | 21.3 |
| Frequent Choice | 26.8 | | | | | | | |
| MiniGPT-4-v1-13B | 26.2 | 33.3 | 19.3 | 28.7 | 26 | 34.2 | 21 | 23.3 |
| idefics_80b_instruct | 25.1 | 39.2 | 17.3 | 23.3 | 24 | 48.3 | 11.4 | 23.3 |
| MiniGPT-4-v2 | 24.6 | 27.5 | 22.7 | 21.3 | 28 | 33.3 | 19 | 32 |
| MiniGPT-4-v1-7B | 23 | 32.5 | 27.3 | 18.7 | 17.3 | 15 | 26.2 | 19.3 |
| Random Choice | 22.1 | | | | | | | |
| idefics_9b_instruct | 19.6 | 22.5 | 11.3 | 20.7 | 23.3 | 31.7 | 13.3 | 20 |
| Model | Overall<br>(Val) | Art & Design<br>(Val) | Business<br>(Val) | Science<br>(Val) | Health & Medicine<br>(Val) | Humanities & Social Science<br>(Val) | Tech & Engineering<br>(Val) | Overall<br>(Dev) |
|:------------------------------|-------------------:|------------------------:|--------------------:|-------------------:|-----------------------------:|---------------------------------------:|------------------------------:|-------------------:|
| GPT-4v (detail: low) | 53.8 | 66.7 | 60 | 46 | 54.7 | 71.7 | 36.7 | 52.7 |
| GeminiProVision | 48.4 | 59.2 | 36 | 42 | 52 | 66.7 | 42.9 | 54 |
| Qwen-VL-Chat | 37.6 | 49.2 | 36 | 28 | 32.7 | 55.8 | 31.9 | 30 |
| LLaVA-InternLM-7B (LoRA) | 37 | 44.2 | 32 | 29.3 | 38.7 | 47.5 | 34.8 | 43.3 |
| LLaVA-v1.5-13B | 36.8 | 49.2 | 23.3 | 36 | 34 | 51.7 | 33.3 | 42 |
| ShareGPT4V-7B | 36.7 | 50 | 27.3 | 26.7 | 37.3 | 50 | 34.8 | 30 |
| TransCore-M | 36.6 | 54.2 | 32 | 27.3 | 32 | 49.2 | 32.4 | 38.7 |
| LLaVA-v1.5-7B | 36.1 | 45.8 | 25.3 | 34 | 32 | 48.3 | 35.7 | 38.7 |
| InternLM-XComposer-VL | 35.7 | 45.8 | 28.7 | 22.7 | 30.7 | 53.3 | 37.6 | 36.7 |
| LLaVA-v1.5-13B (LoRA, XTuner) | 35.1 | 40.8 | 30.7 | 26.7 | 35.3 | 45 | 35.2 | 43.3 |
| mPLUG-Owl2 | 34.6 | 47.5 | 26 | 21.3 | 37.3 | 50 | 31.9 | 40.7 |
| LLaVA-v1.5-7B (LoRA, XTuner) | 33.7 | 48.3 | 23.3 | 30 | 32.7 | 46.7 | 28.6 | 37.3 |
| instructblip_13b | 32.9 | 37.5 | 29.3 | 32 | 28.7 | 37.5 | 33.8 | 30 |
| PandaGPT-13B | 32.7 | 42.5 | 35.3 | 30 | 29.3 | 45.8 | 21.9 | 26.7 |
| LLaVA-v1-7B | 32.1 | 31.7 | 24.7 | 31.3 | 32 | 37.5 | 35.2 | 33.3 |
| InstructBLIP-7B | 30.4 | 38.3 | 28 | 22 | 30.7 | 39.2 | 28.6 | 24 |
| VisualGLM | 28.9 | 30 | 24 | 28 | 28 | 40.8 | 26.2 | 28.7 |
| Qwen-VL | 28.8 | 43.3 | 18.7 | 25.3 | 32.7 | 42.5 | 19.5 | 29.3 |
| OpenFlamingo v2 | 28.2 | 27.5 | 30 | 28.7 | 28 | 33.3 | 24.3 | 21.3 |
| MiniGPT-4-v1-13B | 26.2 | 33.3 | 19.3 | 28.7 | 26 | 34.2 | 21 | 23.3 |
| IDEFICS-80B-Instruct | 25.1 | 39.2 | 17.3 | 23.3 | 24 | 48.3 | 11.4 | 23.3 |
| MiniGPT-4-v2 | 24.6 | 27.5 | 22.7 | 21.3 | 28 | 33.3 | 19 | 32 |
| MiniGPT-4-v1-7B | 23 | 32.5 | 27.3 | 18.7 | 17.3 | 15 | 26.2 | 19.3 |
| IDEFICS-9B-Instruct | 19.6 | 22.5 | 11.3 | 20.7 | 23.3 | 31.7 | 13.3 | 20 |
Loading

0 comments on commit f1d0ce4

Please sign in to comment.