Skip to content

InternLM/ARC-VL

Repository files navigation

Think Visually, Reason Textually: Vision-Language Synergy in ARC

Beichen Zhang · Yuhang Zang · Xiaoyi Dong · Yuhang Cao
Haodong Duan · Dahua Lin · Jiaqi Wang

Corresponding authors.

📢 News

🌈 Overview

We integrate Visual Intelligence into ARC-AGI to leverage the respective advantages of vision and text: vision supports global pattern abstraction and verification, whereas language specializes in precise execution.

We achieve this by introducing two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR) which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction.

Method

🛠️ Inference

Prepare your environment

git clone https://github.com/InternLM/Arc-VL
conda create -n arcvl python==3.11
conda activate arcvl
pip install -r requirements.txt

Modify setup_api_key.shand fill in your base_url and API keys. Activate it by running:

source setup_api_key.sh

Prepare for the data. The data can be downloaded in the following link:

ARC-AGI: https://github.com/fchollet/ARC-AGI

BARC: https://github.com/xu3kev/BARC

Re-ARC: https://github.com/michaelhodel/re-arc

Specify the test dataset, test model and dataset path, and run our vision-language synergy reasoning with the following code.

python inference.py --dataset_name="arc-agi" --model="gpt-4o" --data_path="Your_data_path"
--result_file="result_arcagi_4o.json"
--save_root="images/ARC-AGI/"

Finally, score the inference results.

python score.py --input_file="result.json" --output_file="result_scored.json"

Cases

We conduct an in-depth analysis of the specific outputs of different models (GPT-4o, Gemini-2.5-Pro-thinking-8192, o4-mini) when employing visual thinking versus textual thinking in the ARC-AGI task. Visual thinking demonstrates numerous unique advantages, such as the integration of 2D structural information, a global perspective, and long-range perception capabilities.

case1

case2

case3

case4

✒️Citation

If you find this project useful, please kindly cite:

@article{zhang2025think,
  title={Think Visually, Reason Textually: Vision-Language Synergy in ARC},
  author={Zhang, Beichen and Zang, Yuhang and Dong, Xiaoyi and Cao, Yuhang and Duan, Haodong and Lin, Dahua and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2511.15703},
  year={2025}
}

📄 License

Code License

Usage and License Notices: The code is intended and licensed for research use only.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published