We propose an inspiring multimodal CoT framework named Cantor, which features a perceptual decision architecture that effectively integrates visual context and logical reasoning to solve visual reasoning tasks.
1. Installation
Git clone our repository and creating Gemini environment:
git clone https://github.com/ggg0919/cantor
cd cantor
pip install -q -U google-generativeai
2. Run Cantor Demo
python3 demo.py --query "Which month is the hottest on average in Detroit?" --image_path ./images/image.png --api_key "your Gemini's key"
--query
: Quetion
--image_path
: Image path
--api_key
: Your Gemini key
- Release the data and evaluation code on ScienceQA.
- Release the data and evaluation code on MathVista.
@article{gao2024cantor,
title={Cantor: Inspiring Multimodal Chain-of-Thought of MLLM},
author={Gao, Timin and Chen, Peixian and Zhang, Mengdan and Fu, Chaoyou and Shen, Yunhang and Zhang, Yan and Zhang, Shengchuan and Zheng, Xiawu and Sun, Xing and Cao, Liujuan and Ji, Rongrong},
journal={arXiv preprint arXiv:2404.16033},
year={2024}
}