by Lin Long, Changdae Oh, Seongheon Park, and Sharon Li.
This repository provides tools and scripts to analyze the language prior in Vision-Language Models by examining representation distances across different layers.
First, set up the environment variable for your data path:
export DATA_PATH=/path/to/your/data
Create dataset files in JSONL format under the DATA_PATH
directory. Each dataset should be named as {dataset}.jsonl
, where each line contains:
image
: the image path or base64 string starting with "data:image/"instruction
: the instruction texttarget_tokens
: the target tokens (e.g., ["Yes", "No"] or ["A", "B", "C", "D"])other keys
: additional keys you want to include
Example JSONL entry:
{"image": "/path/to/image.jpg", "instruction": "What color is the sky?", "target_tokens": ["A", "B", "C", "D"], "answer": "A"}
We provide reference data processing scripts for several datasets in the data_preparation/
folder.
2. Generate Hidden States
Generate hidden states for your model. Using Qwen2.5-VL as an example:
CUDA_VISIBLE_DEVICES=0 python generation/gen_qwenvl.py --dataset mme
Multi-GPU Support: This step supports multi-GPU parallel generation. After generation is complete, you need to merge the results:
python utils/merge.py
Use the plotting script to visualize representation distance curves:
python plot_divergences.py --model qwenvl --dataset mme
Available options:
--model
: Model name (e.g., qwenvl, llava, gemma)--dataset
: Dataset name (e.g., mme, mmbench, vlind)--data_path
: Path to the data directory (default: "data")
Install the required dependencies:
pip install -r requirements.txt