📄 [Arxiv] · 🕸️ [Project Page] · 🤗 [Data]
- Define top-view spatial reasoning task for VLMs via 4 carefully designed tasks of increasing complexity, also encompassing 9 distinct fine-grained sub-tasks with a structured design of the questions focusing on different model abilities.
- Collect TopViewRS Dataset (Top-View Reasoning in Space), comprising 11,384 multiple-choice questions with either photo-realistic or semantic top-view maps of real-world scenarios
- Investigate 10 VLMs from different model families and sizes, highlighting the performance gap compared to human annotators.
Part of the benchmark is now available on Huggingface: https://huggingface.co/datasets/chengzu/topviewrs.
Coming soon.
If you find TopViewRS useful:
@misc{li2024topviewrs,
title={TopViewRS: Vision-Language Models as Top-View Spatial Reasoners},
author={Chengzu Li and Caiqi Zhang and Han Zhou and Nigel Collier and Anna Korhonen and Ivan Vulić},
year={2024},
eprint={2406.02537},
archivePrefix={arXiv},
primaryClass={cs.CL}
}