A GPT-4V annotated preference dataset for large vision language models.
[Project Page] [Datasets] [Silkie Model] [Paper]
The instructions are sampled from various domains to cover different capabilities of LVLMs
We construct a model pool consists of 12 LVLMs, including
- GPT-4V
- LLaVA-series
- LLaVA-v1.5-7B
- LLaVA-v1.5-13B
- LLaVA-RLHF-7b-v1.5-224
- LLaVA-RLHF-13b-v1.5-336
- Qwen-VL-7B
- IDEFICS-9b-Instruct
- Fuyu-8B
- InstructBLIP-serise
- InstructBLIP-Vicuna-7B
- InstructBLIP-Vicuna-13B
- VisualGLM-6B
- MMICL-Vicuna-13B
We select Qwen-VL-Chat as the backbone model and perform DPO on our dataset.
Generated by DALL·E 3
The resulting model, Silkie, achieves comprehensive improvements on various benchmarks
To run our training scripts, create a virtual environment and install the dependencies first.
conda create -n silkie python=3.10 && conda activate silkie
pip install -r requirements.txt
Our training scripts support both single-node and multi-node training.
We provide a launch_dpo.py
script that handles both cases. If you want to launch a job locally, you can use:
python launch_dpo.py --config dpo_config/example.yaml --working $WORKING_DIR
If you want to launch a job on a Slurm cluster, specify GPUS_PER_NODE
in launch_dpo.py
and run:
python launch_dpo.py --config dpo_config/example.yaml --working $WORKING_DIR --gpus $NUM_GPUS
@article{2023vlfeedback,
author = {Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and Yazheng Yang and Benyou Wang and Lingpeng Kong},
title = {Silkie: Preference Distillation for Large Visual Language Models},
publisher = {arXiv:2312.10665},
year = {2023}
}
We would like to thank the authors of trl and Qwen-VL for their great work.