Skip to content

Latest commit

 

History

History
46 lines (43 loc) · 5.31 KB

MathVista.md

File metadata and controls

46 lines (43 loc) · 5.31 KB

MathVista Results

  • We report the evaluation results on MathVista TestMini, which include 1000 test samples.

  • We adopt GPT-4-Turbo (1106) as the answer extractor when we failed to extract the answer with heuristic matching.

  • The performance of Human (High school) and Random Choice are copied from the official leaderboard.

Category Definitions: FQA: figure QA, GPS: geometry problem solving, MWP: math word problem, TQA: textbook QA, VQA: visual QA, ALG: algebraic, ARI: arithmetic, GEO: geometry, LOG: logical , NUM: numeric, SCI: scientific, STA: statistical.

Evaluation Results

Model ALL SCI TQA NUM ARI VQA GEO ALG GPS MWP LOG FQA STA
Human (High School) 60.3 64.9 63.2 53.8 59.2 55.9 51.4 50.9 48.4 73 40.7 59.7 63.9
GPT-4v (detail: low) 47.8 63.9 67.1 22.9 45.9 38.5 49.8 53 49.5 57.5 18.9 34.6 46.5
GeminiProVision 45.8 58.2 60.8 27.1 41.9 40.2 39.7 42.3 38.5 45.7 10.8 46.5 52.8
Monkey-Chat 34.8 48.4 42.4 22.9 29.7 33.5 25.9 26.3 26.9 28.5 13.5 41.6 41.5
CogVLM-17B-Chat 34.7 51.6 44.3 23.6 30.9 36.3 26.8 28.1 26.4 26.9 16.2 39.8 42.9
Qwen-VL-Chat 33.8 41.8 39.2 24.3 28.3 33 28.5 30.2 29.8 25.8 13.5 39.8 41.5
Monkey 32.5 38.5 36.1 21.5 28.6 35.2 26.8 27.4 27.4 22 18.9 39.8 38.9
EMU2-Chat 30 36.9 36.7 25 30.6 36.3 31.4 29.9 30.8 30.1 8.1 21.2 23.6
InternLM-XComposer-VL 29.5 37.7 37.3 27.8 28.6 34.1 31.8 28.1 28.8 29.6 13.5 22.3 22.3
SharedCaptioner 29 37.7 37.3 35.4 28.3 33 25.9 23.8 22.1 36.6 16.2 21.6 20.9
ShareGPT4V-13B 27.8 42.6 43.7 19.4 26.6 31.3 28.5 28.1 26.9 23.7 16.2 19.7 22.3
TransCore-M 27.8 43.4 44.9 20.8 28.6 36.9 24.3 25.6 23.6 21.5 8.1 19.3 22.3
LLaVA-v1.5-13B 26.4 37.7 38.6 22.9 24.9 32.4 22.6 24.2 22.6 18.8 21.6 23.4 23.6
LLaVA-InternLM-7B (QLoRA) 26.3 32 34.8 20.8 22.4 30.2 27.6 28.1 27.9 21 24.3 21.2 19.6
LLaVA-v1.5-13B (QLoRA) 26.2 44.3 39.2 20.1 24.1 32.4 21.3 22.4 22.1 18.8 18.9 22.7 22.6
IDEFICS-80B-Instruct 26.2 37.7 34.8 22.2 25.2 33 23.4 22.8 23.1 21.5 18.9 22.3 21.3
ShareGPT4V-7B 25.8 41 38.6 19.4 25.5 36.3 19.7 21.4 20.2 16.1 13.5 22.3 21.6
mPLUG-Owl2 25.3 44.3 41.8 18.8 23.5 31.8 18.8 20.3 17.8 16.7 13.5 23 23.9
PandaGPT-13B 24.6 36.1 30.4 17.4 21 27.4 23.8 23.8 25.5 18.8 16.2 22.7 21.9
LLaVA-InternLM2-20B (QLoRA) 24.6 45.1 44.3 20.8 20.7 35.8 24.3 26 24 9.7 16.2 16.4 15.9
LLaVA-v1.5-7B (QLoRA) 24.2 39.3 36.1 17.4 22.1 30.2 21.3 21.4 21.6 16.1 24.3 20.8 20.3
LLaVA-v1-7B 23.7 32.8 34.2 13.9 20.7 28.5 22.2 24.6 24 13.4 10.8 21.2 19.9
InstructBLIP-7B 23.7 33.6 31.6 13.9 23.5 29.6 19.7 20.6 20.2 15.6 13.5 23.4 21.3
LLaVA-v1.5-7B 23.6 33.6 36.7 11.1 21 28.5 18.8 23.1 19.2 14.5 13.5 22.3 21.6
MiniGPT-4-v2 22.9 29.5 32.3 13.2 17 25.7 22.6 26.7 24.5 10.8 16.2 22.7 20.3
VisualGLM 21.5 36.9 29.7 15.3 18.1 30.2 22.2 22.8 24 7 2.7 19 18.6
InstructBLIP-13B 21.5 28.7 27.8 19.4 21.5 31.8 17.6 18.5 18.3 13.4 13.5 19 17.9
IDEFICS-9B-Instruct 20.4 29.5 31 13.2 17.8 29.6 15.1 18.9 15.9 8.1 13.5 20.1 18.6
MiniGPT-4-v1-13B 20.4 27 24.7 9 18.1 27.4 20.9 22.8 22.6 9.7 10.8 19 16.9
MiniGPT-4-v1-7B 20.2 27 29.1 7.6 16.7 21.8 20.9 23.1 22.1 14 5.4 16.7 17.3
OpenFlamingo v2 18.6 22.1 24.7 5.6 16.4 24 21.3 23.8 23.6 8.1 10.8 14.9 13.3
Random Chance 17.9 15.8 23.4 8.8 13.8 24.3 22.7 25.8 24.1 4.5 13.4 15.5 14.3
Qwen-VL 15.5 34.4 29.7 10.4 12.2 22.9 9.6 10.7 9.1 5.4 16.2 14.1 11.6