- For MMMU, we support the evaluation of the
dev
(150 samples) andvalidation
(900 samples) set. Here we only report the results on thevalidation
set.- Answer Inference:
- For models with
interleave_generate
interface (accept interleaved images & texts as inputs), all testing samples can be inferred.interleave_generate
is adopted for inference.- For models without
interleave_generate
interface, samples with more than one images are skipped (42 out of 1050, directly count as wrong).generate
is adopted for inference.- Evaluation:
- MMMU include two types of questions: multi-choice questions & open-ended QA.
- For open-ended QA (62/1050), we re-formulate it as multi-choice questions:
{'question': 'QQQ', 'answer': 'AAA'} -> {'question': 'QQQ', 'A': 'AAA', 'B': 'Other Answers', 'answer': 'A'}
, and then adopt the same evaluation paradigm for multi-choice questions.- For multi-choice questions (988/1050), we use GPT-3.5-Turbo-0613 for matching prediction with options if heuristic matching does not work.
Model | Overall | Art & Design | Business | Science | Health & Medicine | Humanities & Social Science | Tech & Engineering |
---|---|---|---|---|---|---|---|
GPT-4v (detail: low) | 53.8 | 67.5 | 59.3 | 46 | 54.7 | 70.8 | 37.1 |
GeminiProVision | 48.9 | 59.2 | 36.7 | 42.7 | 52 | 66.7 | 43.8 |
QwenVLPlus | 40.9 | 56.7 | 32 | 33.3 | 36.7 | 59.2 | 36.2 |
Monkey-Chat | 40.7 | 50 | 34 | 43.3 | 36 | 51.7 | 35.2 |
LLaVA-InternLM2-20B (QLoRA) | 39.4 | 52.5 | 30 | 34.7 | 40 | 54.2 | 33.3 |
Monkey | 38.9 | 55 | 31.3 | 35.3 | 37.3 | 45.8 | 34.8 |
CogVLM-17B-Chat | 37.3 | 51.7 | 34 | 36 | 35.3 | 41.7 | 31.4 |
Qwen-VL-Chat | 37 | 49.2 | 35.3 | 28 | 31.3 | 54.2 | 31.9 |
LLaVA-v1.5-13B | 36.9 | 49.2 | 24 | 37.3 | 33.3 | 50.8 | 33.3 |
LLaVA-InternLM-7B (QLoRA) | 36.9 | 44.2 | 32 | 29.3 | 38.7 | 46.7 | 34.8 |
ShareGPT4V-7B | 36.6 | 50 | 28.7 | 26 | 37.3 | 49.2 | 34.3 |
TransCore-M | 36.4 | 45.8 | 33.3 | 28.7 | 38 | 51.7 | 29 |
SharedCaptioner | 36.3 | 44.2 | 28.7 | 29.3 | 37.3 | 45.8 | 36.2 |
LLaVA-v1.5-7B | 36.2 | 45.8 | 26 | 34 | 32.7 | 47.5 | 35.7 |
InternLM-XComposer-VL | 35.6 | 45.8 | 28.7 | 22.7 | 30.7 | 52.5 | 37.6 |
LLaVA-v1.5-13B (QLoRA) | 35.2 | 40.8 | 30.7 | 27.3 | 35.3 | 44.2 | 35.7 |
EMU2-Chat | 35 | 44.2 | 33.3 | 32 | 32 | 41.7 | 31.4 |
ShareGPT4V-13B | 34.8 | 45.8 | 26 | 30.7 | 34.7 | 46.7 | 31 |
mPLUG-Owl2 | 34.7 | 47.5 | 26 | 21.3 | 38 | 50 | 31.9 |
LLaVA-v1.5-7B (QLoRA) | 33.7 | 48.3 | 23.3 | 30.7 | 32.7 | 45.8 | 28.6 |
InstructBLIP-13B | 33.2 | 37.5 | 30 | 32.7 | 30 | 36.7 | 33.8 |
PandaGPT-13B | 32.9 | 42.5 | 36 | 30.7 | 30 | 43.3 | 22.9 |
LLaVA-v1-7B | 32.3 | 31.7 | 26 | 31.3 | 32.7 | 35.8 | 35.7 |
InstructBLIP-7B | 30.6 | 38.3 | 28.7 | 22 | 30.7 | 39.2 | 28.6 |
VisualGLM | 29.9 | 30.8 | 27.3 | 28.7 | 29.3 | 40.8 | 26.2 |
Qwen-VL | 29.6 | 45 | 18.7 | 26.7 | 32.7 | 42.5 | 21 |
OpenFlamingo v2 | 28.8 | 27.5 | 30.7 | 29.3 | 28.7 | 33.3 | 25.2 |
MiniGPT-4-v1-13B | 26.3 | 31.7 | 20.7 | 28 | 25.3 | 35 | 21.9 |
Frequent Choice | 25.8 | 26.7 | 28.4 | 24 | 24.4 | 25.2 | 26.5 |
MiniGPT-4-v2 | 25 | 27.5 | 23.3 | 22 | 27.3 | 32.5 | 21 |
IDEFICS-80B-Instruct | 24 | 39.2 | 18 | 20 | 22 | 46.7 | 11 |
MiniGPT-4-v1-7B | 23.6 | 33.3 | 28.7 | 19.3 | 18 | 15 | 26.2 |
IDEFICS-9B-Instruct | 18.4 | 22.5 | 11.3 | 17.3 | 21.3 | 30 | 13.3 |