MMMU Evaluation Results

For MMMU, we support the evaluation of the dev (150 samples) and validation (900 samples) set. Here we only report the results on the validation set.

Answer Inference:

For models with interleave_generate interface (accept interleaved images & texts as inputs), all testing samples can be inferred. interleave_generate is adopted for inference.

For models without interleave_generate interface, samples with more than one images are skipped (42 out of 1050, directly count as wrong). generate is adopted for inference.

Evaluation:

MMMU include two types of questions: multi-choice questions & open-ended QA.

For open-ended QA (62/1050), we re-formulate it as multi-choice questions: {'question': 'QQQ', 'answer': 'AAA'} -> {'question': 'QQQ', 'A': 'AAA', 'B': 'Other Answers', 'answer': 'A'}, and then adopt the same evaluation paradigm for multi-choice questions.

For multi-choice questions (988/1050), we use GPT-3.5-Turbo-0613 for matching prediction with options if heuristic matching does not work.

MMMU Scores

Model	Overall	Art & Design	Business	Science	Health & Medicine	Humanities & Social Science	Tech & Engineering
GPT-4v (detail: low)	53.8	67.5	59.3	46	54.7	70.8	37.1
GeminiProVision	48.9	59.2	36.7	42.7	52	66.7	43.8
QwenVLPlus	40.9	56.7	32	33.3	36.7	59.2	36.2
Monkey-Chat	40.7	50	34	43.3	36	51.7	35.2
LLaVA-InternLM2-20B (QLoRA)	39.4	52.5	30	34.7	40	54.2	33.3
Monkey	38.9	55	31.3	35.3	37.3	45.8	34.8
CogVLM-17B-Chat	37.3	51.7	34	36	35.3	41.7	31.4
Qwen-VL-Chat	37	49.2	35.3	28	31.3	54.2	31.9
LLaVA-v1.5-13B	36.9	49.2	24	37.3	33.3	50.8	33.3
LLaVA-InternLM-7B (QLoRA)	36.9	44.2	32	29.3	38.7	46.7	34.8
ShareGPT4V-7B	36.6	50	28.7	26	37.3	49.2	34.3
TransCore-M	36.4	45.8	33.3	28.7	38	51.7	29
SharedCaptioner	36.3	44.2	28.7	29.3	37.3	45.8	36.2
LLaVA-v1.5-7B	36.2	45.8	26	34	32.7	47.5	35.7
InternLM-XComposer-VL	35.6	45.8	28.7	22.7	30.7	52.5	37.6
LLaVA-v1.5-13B (QLoRA)	35.2	40.8	30.7	27.3	35.3	44.2	35.7
EMU2-Chat	35	44.2	33.3	32	32	41.7	31.4
ShareGPT4V-13B	34.8	45.8	26	30.7	34.7	46.7	31
mPLUG-Owl2	34.7	47.5	26	21.3	38	50	31.9
LLaVA-v1.5-7B (QLoRA)	33.7	48.3	23.3	30.7	32.7	45.8	28.6
InstructBLIP-13B	33.2	37.5	30	32.7	30	36.7	33.8
PandaGPT-13B	32.9	42.5	36	30.7	30	43.3	22.9
LLaVA-v1-7B	32.3	31.7	26	31.3	32.7	35.8	35.7
InstructBLIP-7B	30.6	38.3	28.7	22	30.7	39.2	28.6
VisualGLM	29.9	30.8	27.3	28.7	29.3	40.8	26.2
Qwen-VL	29.6	45	18.7	26.7	32.7	42.5	21
OpenFlamingo v2	28.8	27.5	30.7	29.3	28.7	33.3	25.2
MiniGPT-4-v1-13B	26.3	31.7	20.7	28	25.3	35	21.9
Frequent Choice	25.8	26.7	28.4	24	24.4	25.2	26.5
MiniGPT-4-v2	25	27.5	23.3	22	27.3	32.5	21
IDEFICS-80B-Instruct	24	39.2	18	20	22	46.7	11
MiniGPT-4-v1-7B	23.6	33.3	28.7	19.3	18	15	26.2
IDEFICS-9B-Instruct	18.4	22.5	11.3	17.3	21.3	30	13.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMMU.md

MMMU.md

MMMU Evaluation Results

MMMU Scores

Files

MMMU.md

Latest commit

History

MMMU.md

File metadata and controls

MMMU Evaluation Results

MMMU Scores