Skip to content
forked from theogbrand/vlprm

This repository contains the official implementation of Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned

Notifications You must be signed in to change notification settings

declare-lab/vlprm

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training Vision-Language Process Reward Models (VL-PRMs) for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned

Training VL-PRMs with abstract reasoning problems results in strong generalization and reasoning performance improvements for Qwen2.5-VL and Gemma 3 family of models when used in test-time scaling settings.



Updates:

  • 2025-09: Released VL-PRMs: Arxiv

This repository provides an overview of all resources for the paper "Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned".

Artifacts

Structure

  • rollout_data_generation/: Synthetic data creation code
  • train/: Training scripts
  • eval/: Evaluation scripts

Training

To train VL-PRM-3B/VL-PRM-7B, you can use the train/huggingface_trainer/train/sft_qwen.sh script to launch a training job, or you can launch a sweep via train/huggingface_trainer/launch_sft_qwen_sweep_clean.sh if you are on a PBS cluster.

Alternatively, you can use the train/qwen_trainer/sft_7b.sh script, which works for VL-PRM-3B/VL-PRM-7B, or you can launch a sweep via train/qwen_trainer/launch_sweep_7b_sft.sh if you are on a PBS cluster. This trainer is adapted from the official Qwen2.5-VL repository link. You need to set the image path in train/qwen_trainer/qwenvl/data/data_qwen.py or train/qwen_trainer/qwenvl/data/data_qwen_packed.py. Set your HF_TOKEN in the .env.pbs file.

The first method uses the HuggingFace Trainer API out of box, the second method uses HuggingFace Trainer with some input processing recommended by the official Qwen2.5-VL repository link to speed up training. We experience similar results with both methods.

To train VL-PRM-3B/VL-PRM-7B, we recommend 8 H100/H200 GPUs i.e. 1 node with 8 GPUs. We used 8 H200 GPUs for most of our jobs.

Quick start:

git clone https://github.com/theogbrand/vlprm.git
cd vlprm/train/huggingface_trainer
uv pip install -r requirements.txt
bash train/sft_qwen.sh

Note: If you encounter an out-of-memory (OOM) issues, consider reducing the batch size and gradient accumulation steps.

Evaluation

We cloned VLMEvalKit at commit db0c9ae2c7c2150b9b730b88326ebfb0bfb91356 and modified it accordingly based on the base policy model and evaluation task selected.

We recommend using vLLM version 0.10.1.1 with V1 engine, transformers version 4.55.2 and flash-attn version 2.8.0.post2 for inference. We experience varying results across models when different versions of these key packages are used, and found fixing these versions resulted in the best overall performance.

You can launch the evaluation using the corresponding launch scripts in each evaluation directory. For example, eval/tts_eval/reward_guided_search/vllm_launch_bon_evaluation.sh.

Evaluation result files can be found at: https://huggingface.co/datasets/ob11/VL-PRM-Evaluation-Results/

Data

To recreate VL-PRM-300K follow the steps in rollout_data_generation/README.md.

Known Issues

  • Reproducing the results in the paper requires using the exact same seed for the synthetic data generation and training the VL-PRMs.

Citation

@misc{ong2025vlprms,
      title={VL-PRMs: Vision-Language Process Reward Models}, 
      author={Brandon Ong, Tej Deep Pala, Vernon Toh, William Chandra Tjhi and Soujanya Poria},
      year={2025},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={}, 
}

About

This repository contains the official implementation of Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 91.9%
  • Shell 7.1%
  • Jupyter Notebook 1.0%