1. Prepare environment
Git clone our repository, creating a python environment and activate it via the following command
cd MiniGPT-4
conda env create -f environment.yml
conda activate minigptv
2. Prepare the pretrained LLM weights
MiniGPT-v2 is based on Llama2-chat-7b. Download the corresponding LLM weights from the following huggingface space via huggingface download.
3. Prepare the pretrained model checkpoints
Download the stage-2 pretrained MiniGPT4-v2 checkpoints from here and put it to MLLM-DataEngine-v2/MiniGPT-4/checkpoint_stage2.pth
Download the dataset
Image source | Download path |
---|---|
COCO 2014 images | images captions |
COCO VQA | vqa train vqa val |
Visual Genome | images part1 images part2 image meta data |
TextCaps | images annotations |
RefCOCO | annotations |
RefCOCO+ | annotations |
RefCOCOg | annotations |
OKVQA | annotations |
AOK-VQA | annotations |
OCR-VQA | annotations |
GQA | images annotations |
Filtered flickr-30k | annotations |
Multi-task conversation | annotations |
Filtered unnatural instruction | annotations |
LLaVA | Compelex reasoning Detailed description Conversation |
Download MLLM-DataEngine generated data from huggingface or opendatalab, and put dataengine_minigpt4.json
under:
train_dataset
└── data_engine
└── dataengine_minigpt4.json
...
Download the COCO 2014 images and captions, put them as follows:
train_dataset
└── COCO2014
├── train
└── coco_karpathy_train.json
...
Download the vqav2 train and validation json files
├── train_dataset
│ ├── vqav2
│ ├── vqa_train.json
| ├── vqa_val.json
Download visiual genome images and annotation files
train_dataset
├── vg
│ ├── VG_100K
│ ├── VG_100K_2
│ ├── region_descriptions.json
│ └── image_data.json
...
Download the TextCaps images and annotation files
├── train_dataset
│ ├── textcaps
│ ├── train_images
│ ├── TextCaps_0.1_train.json
Download the RefCOCO, RefCOCO+, RefCOCOg annotation files
train_dataset
├── refcoco
│ ├── refcoco
│ │ ├── instances.json
│ │ ├── refs(google).p
│ │ └── refs(unc).p
│ ├── refcoco+
│ │ ├── instances.json
│ │ └── refs(unc).p
│ └── refcocog
│ ├── instances.json
│ ├── refs(google).p
│ └─── refs(und).p
...
train_dataset
├── okvqa
├── okvqa_train.json
Download the AOK-VQA annotation dataset
export AOKVQA_DIR=YOUR_DATASET_PATH
mkdir -p ${AOKVQA_DIR}
curl -fsSL https://prior-datasets.s3.us-east-2.amazonaws.com/aokvqa/aokvqa_v1p0.tar.gz | tar xvz -C ${AOKVQA_DIR}
train_dataset
├── aokvqa
├── aokvqa_v1p0_train.json
Download the OCR-VQA annotation files download the images with loadDataset.py script
train_dataset
├── ocrvqa
├── images
├── dataset.json
Download the GQA annotation files and images
train_dataset
├── gqa
├── images
├── train_balanced_questions.json
Download filtered Flickr-30k images (fill this form on official website or from kaggle) and annotation files
train_dataset
├── filtered_flickr
│ ├── images
│ ├── captiontobbox.json
│ ├── groundedcaption.json
│ └── phrasetobbox.json
...
Download the multi-task converstation dataset
train_dataset
├── multitask_conversation
│ └── multitask_conversation.json
...
Download the filtered unnatural instruction annotation files (we remove the very long sentences from the original unnatural instruction dataset)
train_dataset
├── unnatural_instructions
├── filtered_unnatural_instruction.json
train_dataset
├── llava
├── conversation_58k.json
├── detail_23k.json
├── complex_reasoning_77k.json
We perform the stage-3 training on 8xA100 gpus, which takes 8-10 hours. Run the following command to train model:
torchrun --master-port $RANDOM --nproc_per_node 8 train.py --cfg-path train_configs/minigptv2_finetune_dataengine.yaml
-
For evaluation on downstream datasets, first download evaluation dataset and put folder under
MLLM-DataEngine-v2/MiniGPT-4
. -
Change
ckpt
key ineval_configs/minigptv2_benchmark_evaluation.yaml
to the model you trained. Changeckpt
todataengine_minigpt4v2.pth
if you want to reproduce results in paper, download model from here.
-
download SEED-Bench images (not video frames) and put under
evaluation_dataset/SEED-Bench-image
-
inference on SEED-Bench
torchrun --master-port $RANDOM --nproc_per_node 1 eval_scripts/eval_vqa.py --cfg-path ./eval_configs/minigptv2_benchmark_evaluation.yaml --dataset seed
- calculate results
python eval_scripts/convert_seed_for_submission_minigpt4.py \
--annotation-file ./evaluation_dataset/seed/SEED-Bench-image.json \
--result-file ./evaluation_results/seed.jsonl
- Inference on MMBench
torchrun --master-port $RANDOM --nproc_per_node 1 eval_scripts/eval_vqa.py --cfg-path ./eval_configs/minigptv2_benchmark_evaluation.yaml --dataset mmbench
- Convert results to MMBench format
python eval_scripts/convert_mmbench_for_submission.py \
--annotation-file evaluation_dataset/mmbench/mmbench_dev_20230712.tsv \
--result-file evaluation_results/mmbench.jsonl \
--output-file evaluation_results/mmbench.xlsx
- Submit the results to the evaluation server
COCO2014 val: download COCO2014 validation images and put under evaluation_dataset/coco2014_val/
VizWiz: download vizwiz validation set images from here and put under evaluation_dataset/vizwiz/vizwiz_images
VSR: download VSR images from here and put under evaluation_dataset/vsr/vsr_images
torchrun --master-port $RANDOM --nproc_per_node 1 eval_scripts/eval_vqa.py --cfg-path ./eval_configs/minigptv2_benchmark_evaluation.yaml --dataset okvqa,vizwiz,vsr
Incremental Dataset | Data Amount | SEED | MMB | OKVQA | VizWiz | VSR |
---|---|---|---|---|---|---|
None(baseline) | - | 49.21 | 38.83 | 56.03 | 53.08 | 61.37 |
MLLM-DataEngine | 270k | 63.83 | 52.92 | 56.87 | 54.39 | 62.43 |