MathVista Results

We report the evaluation results on MathVista TestMini, which include 1000 test samples.

We adopt GPT-4-Turbo (1106) as the answer extractor when we failed to extract the answer with heuristic matching.

The performance of Human (High school) and Random Choice are copied from the official leaderboard.

Category Definitions: FQA: figure QA, GPS: geometry problem solving, MWP: math word problem, TQA: textbook QA, VQA: visual QA, ALG: algebraic, ARI: arithmetic, GEO: geometry, LOG: logical , NUM: numeric, SCI: scientific, STA: statistical.

Evaluation Results

Model	ALL	SCI	TQA	NUM	ARI	VQA	GEO	ALG	GPS	MWP	LOG	FQA	STA
Human (High School)	60.3	64.9	63.2	53.8	59.2	55.9	51.4	50.9	48.4	73	40.7	59.7	63.9
GPT-4v (detail: low)	47.8	63.9	67.1	22.9	45.9	38.5	49.8	53	49.5	57.5	18.9	34.6	46.5
GeminiProVision	45.8	58.2	60.8	27.1	41.9	40.2	39.7	42.3	38.5	45.7	10.8	46.5	52.8
Monkey-Chat	34.8	48.4	42.4	22.9	29.7	33.5	25.9	26.3	26.9	28.5	13.5	41.6	41.5
CogVLM-17B-Chat	34.7	51.6	44.3	23.6	30.9	36.3	26.8	28.1	26.4	26.9	16.2	39.8	42.9
Qwen-VL-Chat	33.8	41.8	39.2	24.3	28.3	33	28.5	30.2	29.8	25.8	13.5	39.8	41.5
Monkey	32.5	38.5	36.1	21.5	28.6	35.2	26.8	27.4	27.4	22	18.9	39.8	38.9
EMU2-Chat	30	36.9	36.7	25	30.6	36.3	31.4	29.9	30.8	30.1	8.1	21.2	23.6
InternLM-XComposer-VL	29.5	37.7	37.3	27.8	28.6	34.1	31.8	28.1	28.8	29.6	13.5	22.3	22.3
SharedCaptioner	29	37.7	37.3	35.4	28.3	33	25.9	23.8	22.1	36.6	16.2	21.6	20.9
ShareGPT4V-13B	27.8	42.6	43.7	19.4	26.6	31.3	28.5	28.1	26.9	23.7	16.2	19.7	22.3
TransCore-M	27.8	43.4	44.9	20.8	28.6	36.9	24.3	25.6	23.6	21.5	8.1	19.3	22.3
LLaVA-v1.5-13B	26.4	37.7	38.6	22.9	24.9	32.4	22.6	24.2	22.6	18.8	21.6	23.4	23.6
LLaVA-InternLM-7B (QLoRA)	26.3	32	34.8	20.8	22.4	30.2	27.6	28.1	27.9	21	24.3	21.2	19.6
LLaVA-v1.5-13B (QLoRA)	26.2	44.3	39.2	20.1	24.1	32.4	21.3	22.4	22.1	18.8	18.9	22.7	22.6
IDEFICS-80B-Instruct	26.2	37.7	34.8	22.2	25.2	33	23.4	22.8	23.1	21.5	18.9	22.3	21.3
ShareGPT4V-7B	25.8	41	38.6	19.4	25.5	36.3	19.7	21.4	20.2	16.1	13.5	22.3	21.6
mPLUG-Owl2	25.3	44.3	41.8	18.8	23.5	31.8	18.8	20.3	17.8	16.7	13.5	23	23.9
PandaGPT-13B	24.6	36.1	30.4	17.4	21	27.4	23.8	23.8	25.5	18.8	16.2	22.7	21.9
LLaVA-InternLM2-20B (QLoRA)	24.6	45.1	44.3	20.8	20.7	35.8	24.3	26	24	9.7	16.2	16.4	15.9
LLaVA-v1.5-7B (QLoRA)	24.2	39.3	36.1	17.4	22.1	30.2	21.3	21.4	21.6	16.1	24.3	20.8	20.3
LLaVA-v1-7B	23.7	32.8	34.2	13.9	20.7	28.5	22.2	24.6	24	13.4	10.8	21.2	19.9
InstructBLIP-7B	23.7	33.6	31.6	13.9	23.5	29.6	19.7	20.6	20.2	15.6	13.5	23.4	21.3
LLaVA-v1.5-7B	23.6	33.6	36.7	11.1	21	28.5	18.8	23.1	19.2	14.5	13.5	22.3	21.6
MiniGPT-4-v2	22.9	29.5	32.3	13.2	17	25.7	22.6	26.7	24.5	10.8	16.2	22.7	20.3
VisualGLM	21.5	36.9	29.7	15.3	18.1	30.2	22.2	22.8	24	7	2.7	19	18.6
InstructBLIP-13B	21.5	28.7	27.8	19.4	21.5	31.8	17.6	18.5	18.3	13.4	13.5	19	17.9
IDEFICS-9B-Instruct	20.4	29.5	31	13.2	17.8	29.6	15.1	18.9	15.9	8.1	13.5	20.1	18.6
MiniGPT-4-v1-13B	20.4	27	24.7	9	18.1	27.4	20.9	22.8	22.6	9.7	10.8	19	16.9
MiniGPT-4-v1-7B	20.2	27	29.1	7.6	16.7	21.8	20.9	23.1	22.1	14	5.4	16.7	17.3
OpenFlamingo v2	18.6	22.1	24.7	5.6	16.4	24	21.3	23.8	23.6	8.1	10.8	14.9	13.3
Random Chance	17.9	15.8	23.4	8.8	13.8	24.3	22.7	25.8	24.1	4.5	13.4	15.5	14.3
Qwen-VL	15.5	34.4	29.7	10.4	12.2	22.9	9.6	10.7	9.1	5.4	16.2	14.1	11.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MathVista.md

MathVista.md

MathVista Results

Evaluation Results

Files

MathVista.md

Latest commit

History

MathVista.md

File metadata and controls

MathVista Results

Evaluation Results