MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models via Reinforcement Learning
Run the setup script to configure the environment:
bash setup.shThis script will:
- Create conda environment
medvlm-r1 - Install necessary dependencies
- Configure the open-r1-multimodal framework
Use the Jupyter notebook to quickly experience the model:
jupyter notebook demo.ipynbThe demo includes:
- Model loading
- Medical image VQA examples
- Inference process demonstration
The model generates structured reasoning process:
<think>
The image is a magnetic resonance imaging (MRI) scan of a knee joint. The scan shows a chondral abnormality, which is a type of cartilage damage. This is evident from the irregular shape and the presence of a defect in the cartilage.
</think>
<answer>A</answer>
Download the HuatuoGPT-Vision dataset via Hugging Face CLI:
# 1) Install Hugging Face CLI (if not already)
pip install -U "huggingface_hub[cli]"
# 2) (Optional) Login if the dataset requires auth
# huggingface-cli login
# 3) Download the dataset to a local directory
# Replace <TARGET_DIR> with your local path, e.g., /data/datasets/PubMedVision
hf download FreedomIntelligence/PubMedVision \
--repo-type dataset \
--local-dir <TARGET_DIR> \
--local-dir-use-symlinks False \
--include "*"
# After download, set <DATASET_PATH_ROOT>=<TARGET_DIR> in your scriptsThe dataset contains:
- MRI, CT, X-ray medical images
- Corresponding visual question-answer pairs
- Multi-modal medical reasoning tasks
Run the training script:
bash train_script.shNote: Please update the following paths in the script:
<DATASET_NAME>: Dataset name<GPU_NUM>: Number of GPUs<LOG_PATH>: Log output path<HF_CACHE_DIR>: Hugging Face cache directory<WANDB_ENTITY>: Weights & Biases entity<WANDB_PROJECT>: Project name<OUTPUT_DIR_ROOT>: Output directory root path<MODEL_REPO_OR_DIR>: Model path<DATASET_PATH_ROOT>: Dataset root path<MASTER_ADDR>: Master node address<MASTER_PORT>: Master node port
Run the testing script:
bash test_script.shNote: Please update the following paths in the script:
<HF_CACHE_DIR>: Hugging Face cache directory<CUDA_DEVICES>: CUDA devices<MODEL_REPO_OR_DIR>: Model path<DATASET_PATH_ROOT>: Dataset root path<OUTPUT_DIR>: Output directory
The testing script supports the following parameters:
MODALITY: Modality type (MRI, CT, Ultrasound, Xray, Dermoscopy, Microscopy, Fundus)PROMPT_TYPE: Prompt type (simple, complex)BSZ: Batch sizeMAX_NEW_TOKENS: Maximum new tokens to generateDO_SAMPLE: Whether to sampleTEMPERATURE: Temperature parameter
r1-v-med/
├── demo.ipynb # Demo notebook
├── setup.sh # Setup script
├── train_script.sh # Training script
├── test_script.sh # Testing script
├── MRI_CT_XRAY_300each_dataset.json # Test dataset
├── images/ # Example images
│ ├── successful_cases/ # Successful cases
│ └── failure_cases/ # Failure cases
└── src/
├── eval/ # Evaluation code
│ └── test_qwen2vl_med.py # Testing script
├── distill_r1/ # R1 distillation related
└── open-r1-multimodal/ # Based framework
└── src/open_r1/
├── grpo.py # GRPO training code
└── trainer/
└── grpo_trainer.py # GRPO trainer
If you find our work helpful, please cite:
@article{pan2025medvlm,
title={MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning},
author={Pan, Jiazhen and Liu, Che and Wu, Junde and Liu, Fenglin and Zhu, Jiayuan and Li, Hongwei Bran and Chen, Chen and Ouyang, Cheng and Rueckert, Daniel},
journal={arXiv preprint arXiv:2502.19634},
year={2025}
}Our code is based on the following open-source projects:
- open-r1-multimodal: https://github.com/EvolvingLMMs-Lab/
- R1-V: https://github.com/StarsfieldAI/R1-V
Thanks to these excellent open-source projects for providing a solid foundation for our research.
This project is licensed under the Apache 2.0 License.