A minimal Python scaffold for benchmarking Vision-Language Models (VLMs) on visual reasoning tasks.
Visual Reasoning Bench provides a clean, modular architecture for evaluating VLMs. It includes:
- Datasets: Extensible dataset interface yielding
{id, image_path, question, answer} - Models: Base model class with
predict(image_path, question) β strinterface - Evaluation: Pipeline for running models on datasets and computing accuracy metrics
- Utilities: I/O and image processing helpers
visual-reasoning-bench/
βββ bench/
β βββ datasets/
β β βββ base.py # Base dataset class
β β βββ pathfinder.py # Pathfinder visual reasoning dataset
β βββ models/
β β βββ base.py # Base model interface
β β βββ llava.py # LLaVA model wrapper
β βββ evaluate/
β β βββ evaluator.py # Evaluation pipeline
β β βββ metrics.py # Accuracy and other metrics
β βββ utils/
β βββ io.py # Config loading, result saving
β βββ images.py # Image loading and preprocessing
βββ scripts/
β βββ run_eval.py # Main evaluation script
βββ configs/
β βββ example.yaml # Example configuration
βββ website/
βββ index.html # Project landing page
git clone https://github.com/serre-lab/visual-reasoning-bench.git
cd visual-reasoning-benchpython scripts/run_eval.py --config configs/example.yaml --verbose--config: Path to YAML configuration file (default:configs/example.yaml)--output: Path to save results JSON (default:results/evaluation.json)--verbose: Show progress bar during evaluation
The VPTDataset streams directly from Hugging Face (3D-PC/3D-PC). Install the datasets package (included in requirements.txt), export your OpenAI key, and run:
export OPENAI_API_KEY=sk-your-key
python scripts/run_eval.py --config configs/vpt_chatgpt.yaml --verboseTweak configs/vpt_chatgpt.yaml to choose hf_config (depth, vpt-basic, or vpt-strategy), pick a split (train, validation, test, human), or set limit for quick smoke tests. The loader automatically uses the dataset-provided prompt/statement when available, so the VLM sees the exact question used in the benchmark. Images stay in memory as raw bytes, so any model wrapper that accepts image_bytes (e.g., ChatGPTVisionModel) can benchmark VPT without extra preprocessing.
Set OPENAI_API_KEY (or pass api_key to the class) and run:
export OPENAI_API_KEY=sk-your-key
python scripts/demo_chatgpt_vlm.py --question "What color is this square?"The script instantiates ChatGPTVisionModel, feeds assets/demo_red_square.png, and prints a real response from the ChatGPT VLM. Adjust --image, --openai-model, and decoding params to probe other prompts.
Edit configs/example.yaml to customize your evaluation:
dataset:
name: pathfinder
data_dir: ./data/pathfinder
model:
name: llava
model_path: null
params:
temperature: 0.0
max_tokens: 512All datasets inherit from BaseDataset and must implement _load_data():
from bench.datasets import BaseDataset
class MyDataset(BaseDataset):
def _load_data(self):
self.samples = [
{
'id': 'sample_0',
'image_path': '/path/to/image.png',
'image_bytes': None, # Use raw bytes when no local path exists
'question': 'What do you see?',
'answer': 'A cat'
},
# ... more samples
]All models inherit from BaseModel and must implement predict():
from bench.models import BaseModel
class MyModel(BaseModel):
def predict(self, image_path: str | None, question: str, image_bytes: bytes | None = None) -> str:
# Your inference code here
use_bytes = image_bytes if image_bytes is not None else open(image_path, 'rb').read()
prediction = self.model.generate(use_bytes, question)
return predictionThe evaluator runs a model on a dataset and computes metrics:
from bench.datasets import PathfinderDataset
from bench.models import LLaVAModel
from bench.evaluate import Evaluator
dataset = PathfinderDataset(data_dir='./data/pathfinder')
model = LLaVAModel(model_path='path/to/checkpoint')
evaluator = Evaluator(model=model, dataset=dataset)
results = evaluator.evaluate(verbose=True)
print(f"Accuracy: {results['metrics']['accuracy']:.2%}")- Create a new file in
bench/datasets/ - Inherit from
BaseDataset - Implement
_load_data()method - Register in
bench/datasets/__init__.py
- Create a new file in
bench/models/ - Inherit from
BaseModel - Implement
predict()method - Register in
bench/models/__init__.py
Add metric functions to bench/evaluate/metrics.py:
def compute_f1_score(predictions, ground_truth):
# Your metric implementation
return f1_scoreThis is a scaffold implementation designed to be extended. Key areas for enhancement:
- Dataset Loading: Add proper data loading from various formats
- Model Integration: Integrate actual VLM implementations
- Image Processing: Add PIL/OpenCV for real image operations
- Metrics: Add more evaluation metrics (F1, BLEU, etc.)
- Visualization: Add result visualization tools
MIT License (or specify your license)
If you use this benchmark in your research, please cite:
@software{visual_reasoning_bench,
title={Visual Reasoning Bench},
author={Serre Lab},
year={2024},
url={https://github.com/serre-lab/visual-reasoning-bench}
}