ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges
ScratchEval provides a series of very challenging questions designed to test the large multimodal models' (LMM) visual code reasoning ability.
Specifically, we designed a series of challenging multiple-choice questions problems using the visual modular programming language Scratch and found that even the most advanced LMM still perform poorly on our test benchmark.
All our data is stored in ./data folder.
The benchmark was annotated and developed by the authors of this paper, and the dataset is released under the Apache 2.0 license.