We report the evaluation results on MathVista TestMini, which include 1000 test samples.
We adopt
GPT-4-Turbo (1106)
as the answer extractor when we failed to extract the answer with heuristic matching.The performance of Human (High school) and Random Choice are copied from the official leaderboard.
Category Definitions: FQA: figure QA, GPS: geometry problem solving, MWP: math word problem, TQA: textbook QA, VQA: visual QA, ALG: algebraic, ARI: arithmetic, GEO: geometry, LOG: logical , NUM: numeric, SCI: scientific, STA: statistical.
Model | ALL | SCI | TQA | NUM | ARI | VQA | GEO | ALG | GPS | MWP | LOG | FQA | STA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Human (High School) | 60.3 | 64.9 | 63.2 | 53.8 | 59.2 | 55.9 | 51.4 | 50.9 | 48.4 | 73 | 40.7 | 59.7 | 63.9 |
GPT-4v (detail: low) | 47.8 | 63.9 | 67.1 | 22.9 | 45.9 | 38.5 | 49.8 | 53 | 49.5 | 57.5 | 18.9 | 34.6 | 46.5 |
GeminiProVision | 45.8 | 58.2 | 60.8 | 27.1 | 41.9 | 40.2 | 39.7 | 42.3 | 38.5 | 45.7 | 10.8 | 46.5 | 52.8 |
Monkey-Chat | 34.8 | 48.4 | 42.4 | 22.9 | 29.7 | 33.5 | 25.9 | 26.3 | 26.9 | 28.5 | 13.5 | 41.6 | 41.5 |
CogVLM-17B-Chat | 34.7 | 51.6 | 44.3 | 23.6 | 30.9 | 36.3 | 26.8 | 28.1 | 26.4 | 26.9 | 16.2 | 39.8 | 42.9 |
Qwen-VL-Chat | 33.8 | 41.8 | 39.2 | 24.3 | 28.3 | 33 | 28.5 | 30.2 | 29.8 | 25.8 | 13.5 | 39.8 | 41.5 |
Monkey | 32.5 | 38.5 | 36.1 | 21.5 | 28.6 | 35.2 | 26.8 | 27.4 | 27.4 | 22 | 18.9 | 39.8 | 38.9 |
EMU2-Chat | 30 | 36.9 | 36.7 | 25 | 30.6 | 36.3 | 31.4 | 29.9 | 30.8 | 30.1 | 8.1 | 21.2 | 23.6 |
InternLM-XComposer-VL | 29.5 | 37.7 | 37.3 | 27.8 | 28.6 | 34.1 | 31.8 | 28.1 | 28.8 | 29.6 | 13.5 | 22.3 | 22.3 |
SharedCaptioner | 29 | 37.7 | 37.3 | 35.4 | 28.3 | 33 | 25.9 | 23.8 | 22.1 | 36.6 | 16.2 | 21.6 | 20.9 |
ShareGPT4V-13B | 27.8 | 42.6 | 43.7 | 19.4 | 26.6 | 31.3 | 28.5 | 28.1 | 26.9 | 23.7 | 16.2 | 19.7 | 22.3 |
TransCore-M | 27.8 | 43.4 | 44.9 | 20.8 | 28.6 | 36.9 | 24.3 | 25.6 | 23.6 | 21.5 | 8.1 | 19.3 | 22.3 |
LLaVA-v1.5-13B | 26.4 | 37.7 | 38.6 | 22.9 | 24.9 | 32.4 | 22.6 | 24.2 | 22.6 | 18.8 | 21.6 | 23.4 | 23.6 |
LLaVA-InternLM-7B (QLoRA) | 26.3 | 32 | 34.8 | 20.8 | 22.4 | 30.2 | 27.6 | 28.1 | 27.9 | 21 | 24.3 | 21.2 | 19.6 |
LLaVA-v1.5-13B (QLoRA) | 26.2 | 44.3 | 39.2 | 20.1 | 24.1 | 32.4 | 21.3 | 22.4 | 22.1 | 18.8 | 18.9 | 22.7 | 22.6 |
IDEFICS-80B-Instruct | 26.2 | 37.7 | 34.8 | 22.2 | 25.2 | 33 | 23.4 | 22.8 | 23.1 | 21.5 | 18.9 | 22.3 | 21.3 |
ShareGPT4V-7B | 25.8 | 41 | 38.6 | 19.4 | 25.5 | 36.3 | 19.7 | 21.4 | 20.2 | 16.1 | 13.5 | 22.3 | 21.6 |
mPLUG-Owl2 | 25.3 | 44.3 | 41.8 | 18.8 | 23.5 | 31.8 | 18.8 | 20.3 | 17.8 | 16.7 | 13.5 | 23 | 23.9 |
PandaGPT-13B | 24.6 | 36.1 | 30.4 | 17.4 | 21 | 27.4 | 23.8 | 23.8 | 25.5 | 18.8 | 16.2 | 22.7 | 21.9 |
LLaVA-InternLM2-20B (QLoRA) | 24.6 | 45.1 | 44.3 | 20.8 | 20.7 | 35.8 | 24.3 | 26 | 24 | 9.7 | 16.2 | 16.4 | 15.9 |
LLaVA-v1.5-7B (QLoRA) | 24.2 | 39.3 | 36.1 | 17.4 | 22.1 | 30.2 | 21.3 | 21.4 | 21.6 | 16.1 | 24.3 | 20.8 | 20.3 |
LLaVA-v1-7B | 23.7 | 32.8 | 34.2 | 13.9 | 20.7 | 28.5 | 22.2 | 24.6 | 24 | 13.4 | 10.8 | 21.2 | 19.9 |
InstructBLIP-7B | 23.7 | 33.6 | 31.6 | 13.9 | 23.5 | 29.6 | 19.7 | 20.6 | 20.2 | 15.6 | 13.5 | 23.4 | 21.3 |
LLaVA-v1.5-7B | 23.6 | 33.6 | 36.7 | 11.1 | 21 | 28.5 | 18.8 | 23.1 | 19.2 | 14.5 | 13.5 | 22.3 | 21.6 |
MiniGPT-4-v2 | 22.9 | 29.5 | 32.3 | 13.2 | 17 | 25.7 | 22.6 | 26.7 | 24.5 | 10.8 | 16.2 | 22.7 | 20.3 |
VisualGLM | 21.5 | 36.9 | 29.7 | 15.3 | 18.1 | 30.2 | 22.2 | 22.8 | 24 | 7 | 2.7 | 19 | 18.6 |
InstructBLIP-13B | 21.5 | 28.7 | 27.8 | 19.4 | 21.5 | 31.8 | 17.6 | 18.5 | 18.3 | 13.4 | 13.5 | 19 | 17.9 |
IDEFICS-9B-Instruct | 20.4 | 29.5 | 31 | 13.2 | 17.8 | 29.6 | 15.1 | 18.9 | 15.9 | 8.1 | 13.5 | 20.1 | 18.6 |
MiniGPT-4-v1-13B | 20.4 | 27 | 24.7 | 9 | 18.1 | 27.4 | 20.9 | 22.8 | 22.6 | 9.7 | 10.8 | 19 | 16.9 |
MiniGPT-4-v1-7B | 20.2 | 27 | 29.1 | 7.6 | 16.7 | 21.8 | 20.9 | 23.1 | 22.1 | 14 | 5.4 | 16.7 | 17.3 |
OpenFlamingo v2 | 18.6 | 22.1 | 24.7 | 5.6 | 16.4 | 24 | 21.3 | 23.8 | 23.6 | 8.1 | 10.8 | 14.9 | 13.3 |
Random Chance | 17.9 | 15.8 | 23.4 | 8.8 | 13.8 | 24.3 | 22.7 | 25.8 | 24.1 | 4.5 | 13.4 | 15.5 | 14.3 |
Qwen-VL | 15.5 | 34.4 | 29.7 | 10.4 | 12.2 | 22.9 | 9.6 | 10.7 | 9.1 | 5.4 | 16.2 | 14.1 | 11.6 |