Skip to content

Latest commit

 

History

History
287 lines (210 loc) · 9.95 KB

README.md

File metadata and controls

287 lines (210 loc) · 9.95 KB

MLLM-DataEngine for MiniGPT4-v2

Installation

1. Prepare environment

Git clone our repository, creating a python environment and activate it via the following command

cd MiniGPT-4
conda env create -f environment.yml
conda activate minigptv

2. Prepare the pretrained LLM weights

MiniGPT-v2 is based on Llama2-chat-7b. Download the corresponding LLM weights from the following huggingface space via huggingface download.

3. Prepare the pretrained model checkpoints

Download the stage-2 pretrained MiniGPT4-v2 checkpoints from here and put it to MLLM-DataEngine-v2/MiniGPT-4/checkpoint_stage2.pth

Data Preparation

Download the dataset for finetuning the MiniGPT-v2

Download the dataset

Image source Download path
COCO 2014 images images    captions
COCO VQA vqa train    vqa val
Visual Genome images part1    images part2    image meta data
TextCaps images    annotations
RefCOCO annotations
RefCOCO+ annotations
RefCOCOg annotations
OKVQA annotations
AOK-VQA annotations
OCR-VQA annotations
GQA images    annotations
Filtered flickr-30k annotations
Multi-task conversation annotations
Filtered unnatural instruction annotations
LLaVA Compelex reasoning    Detailed description    Conversation

MLLM-DataEngine generated data

Download MLLM-DataEngine generated data from huggingface or opendatalab, and put dataengine_minigpt4.json under:

train_dataset
└── data_engine
    └── dataengine_minigpt4.json
...

COCO captions

Download the COCO 2014 images and captions, put them as follows:

train_dataset
└── COCO2014
    ├── train
    └── coco_karpathy_train.json
...

COCO VQA

Download the vqav2 train and validation json files

├── train_dataset
│   ├── vqav2
│       ├── vqa_train.json
|       ├── vqa_val.json

Visual genome

Download visiual genome images and annotation files

train_dataset
├── vg
│   ├── VG_100K
│   ├── VG_100K_2
│   ├── region_descriptions.json
│   └── image_data.json
...

TextCaps

Download the TextCaps images and annotation files

├── train_dataset
│   ├── textcaps
│       ├── train_images
│       ├── TextCaps_0.1_train.json

RefCOCO, RefCOCO+, RefCOCOg

Download the RefCOCO, RefCOCO+, RefCOCOg annotation files

train_dataset
├── refcoco
│   ├── refcoco
│   │   ├── instances.json
│   │   ├── refs(google).p
│   │   └── refs(unc).p
│   ├── refcoco+
│   │   ├── instances.json
│   │   └── refs(unc).p
│   └── refcocog
│       ├── instances.json
│       ├── refs(google).p
│       └─── refs(und).p
...

OKVQA

train_dataset
├── okvqa
    ├── okvqa_train.json

AOK-VQA

Download the AOK-VQA annotation dataset

export AOKVQA_DIR=YOUR_DATASET_PATH
mkdir -p ${AOKVQA_DIR}
curl -fsSL https://prior-datasets.s3.us-east-2.amazonaws.com/aokvqa/aokvqa_v1p0.tar.gz | tar xvz -C ${AOKVQA_DIR}
train_dataset
├── aokvqa
    ├── aokvqa_v1p0_train.json

OCR-VQA

Download the OCR-VQA annotation files download the images with loadDataset.py script

train_dataset
├── ocrvqa
    ├── images
    ├── dataset.json

GQA

Download the GQA annotation files and images

train_dataset
├── gqa
    ├── images
    ├── train_balanced_questions.json

filtered Flickr-30k

Download filtered Flickr-30k images (fill this form on official website or from kaggle) and annotation files

train_dataset
├── filtered_flickr
│   ├── images
│   ├── captiontobbox.json
│   ├── groundedcaption.json
│   └── phrasetobbox.json
...

Multi-task conversation

Download the multi-task converstation dataset

train_dataset
├── multitask_conversation
│   └── multitask_conversation.json
...

Unnatural instruction

Download the filtered unnatural instruction annotation files (we remove the very long sentences from the original unnatural instruction dataset)

train_dataset
    ├── unnatural_instructions
        ├── filtered_unnatural_instruction.json

LLaVA

train_dataset
    ├── llava
        ├── conversation_58k.json
        ├── detail_23k.json
        ├── complex_reasoning_77k.json

Training

We perform the stage-3 training on 8xA100 gpus, which takes 8-10 hours. Run the following command to train model:

torchrun --master-port $RANDOM --nproc_per_node 8 train.py --cfg-path train_configs/minigptv2_finetune_dataengine.yaml

Evaluation

  1. For evaluation on downstream datasets, first download evaluation dataset and put folder under MLLM-DataEngine-v2/MiniGPT-4.

  2. Change ckpt key in eval_configs/minigptv2_benchmark_evaluation.yaml to the model you trained. Change ckpt to dataengine_minigpt4v2.pth if you want to reproduce results in paper, download model from here.

SEED-Bench

  1. download SEED-Bench images (not video frames) and put under evaluation_dataset/SEED-Bench-image

  2. inference on SEED-Bench

torchrun --master-port $RANDOM --nproc_per_node 1 eval_scripts/eval_vqa.py --cfg-path ./eval_configs/minigptv2_benchmark_evaluation.yaml --dataset seed
  1. calculate results
python eval_scripts/convert_seed_for_submission_minigpt4.py \
    --annotation-file ./evaluation_dataset/seed/SEED-Bench-image.json \
    --result-file ./evaluation_results/seed.jsonl

MMBench

  1. Inference on MMBench
torchrun --master-port $RANDOM --nproc_per_node 1 eval_scripts/eval_vqa.py --cfg-path ./eval_configs/minigptv2_benchmark_evaluation.yaml --dataset mmbench
  1. Convert results to MMBench format
python eval_scripts/convert_mmbench_for_submission.py \
    --annotation-file evaluation_dataset/mmbench/mmbench_dev_20230712.tsv \
    --result-file evaluation_results/mmbench.jsonl \
    --output-file evaluation_results/mmbench.xlsx
  1. Submit the results to the evaluation server

OKVQA, VizWiz, VSR

COCO2014 val: download COCO2014 validation images and put under evaluation_dataset/coco2014_val/

VizWiz: download vizwiz validation set images from here and put under evaluation_dataset/vizwiz/vizwiz_images

VSR: download VSR images from here and put under evaluation_dataset/vsr/vsr_images

torchrun --master-port $RANDOM --nproc_per_node 1 eval_scripts/eval_vqa.py --cfg-path ./eval_configs/minigptv2_benchmark_evaluation.yaml --dataset okvqa,vizwiz,vsr

Main Results

Incremental Dataset Data Amount SEED MMB OKVQA VizWiz VSR
None(baseline) - 49.21 38.83 56.03 53.08 61.37
MLLM-DataEngine 270k 63.83 52.92 56.87 54.39 62.43