Run Luo1,2*, Yunshui Li1,2*, Longze Chen1,2*, Wanwei He1,2, Ting-En Lin5, Ziqiang Liu1,2, Lei Zhang1,2
Zikai Song6, Xiaobo Xia4, Tongliang Liu4, Min Yang1,2🌟, Binyuan Hui3🌟
* Equal contribution 🌟 Corresponding author
1 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
3 Alibaba Group
4 The University of Sydney
5 Tsinghua University
6 HUST
- [07/21]🔥DEEM is coming! We release the code, models, and data for DEEM!
- [05/24]🔥DEEM is coming! We release the paper for DEEM!
Please follow the instructions below to install the required packages.
- Clone this repository
https://github.com/RainBowLuoCS/DEEM.git
- Install Package
conda create -n deem python=3.10 -y
conda activate deem
cd DEEM
pip install -r requirements.txt
# install `MultiScaleDeformableAttention` module
cd uni_interleaved/models/utils/ops
python setup.py install
- Download all pretrained model components from huggingface into the
assets/
directory by running the following commands:
python scripts/download_models.py
Here are the pretrained weights on Stage 1 data only:
Model | Diffusion Model | Base LLM | Vision Encoder | Pretrain Data | Download |
---|---|---|---|---|---|
MM-interleaved-7B (Baseline) | SD 2.1 🔥 | Vicuna-7B-v1.5 | ConvNext-B | MMC4+LAION | ckpt |
DEEM-7B | SD 2.1 | Vicuna-7B-v1.5 | ConvNext-B🔥 | MMC4+LAION | ckpt |
(DEEM+MM-interleaved)-7B | SD 2.1 🔥 | Vicuna-7B-v1.5 | ConvNext-B🔥 | MMC4+LAION | ckpt |
We provide all our fully finetuned models on Stage 2 and 3 data for DEEM:
Model | Base LLM | Vision Encoder | Finetuning Data | Download |
---|---|---|---|---|
DEEM-VQA 7B | Vicuna-7B-v1.5 | ConvNext-B | LLaVA-665k+VQA+COCO | ckpt |
DEEM-MASK 7B | Vicuna-7B-v1.5 | ConvNext-B | ReferCOCO+VG+PartData | ckpt |
Please follow MM-Interleaved , LLaVA and Osprey to prepare the corresponding images and data.
datasets
├── laion
│ ├── laion_annts
│ │ └── laion_shard_{0...1774}_v1.json
│ └── laion_images
│ │ └── {00000..01174}.tar
├── mmc4
│ ├── mmc4_annts
│ │ └── docs_no_face_shard_{0..23098}_v3.jsonl
│ └── mmc4_images
│ │ └── b9040a0dbb22.jpg
├── aokvqa
│ └── aokvqa_v1p0_train.json
├── image2parag
│ ├── paragraphs_coco.json
│ ├── paragraphs_v1.json
│ ├── test_split.json
│ ├── train_split.json
│ └── val_split.json
├── coco
│ ├── train2014
│ ├── train2017
│ ├── val2014
│ ├── val2017
│ └── annotations
│ │ ├── coco_karpathy_train.json
│ │ ├── coco_karpathy_val.json
│ │ ├── captions_train2017.json
│ │ ├── coco_karpathy_val_gt.json
│ │ ├── coco_karpathy_test.json
│ │ ├── instances_train2017
│ │ └── coco_karpathy_test_gt.json
├── lncoco
│ ├── coco_train_captions.jsonl
│ └── coco_val_captions.jsonl
├── flickr30k
│ ├── flickr30k-images
│ ├── captiontobbox.json
│ ├── flickr30k_test1k.json
│ ├── phrasetobbox.json
│ └── groundedcaption.json
├── gqa
│ ├── images
│ ├── test_balanced_questions.json
│ ├── train_balanced_questions.json
│ └── testdev_balanced_questions.json
├── robustvqa
│ ├── imagenet-r
│ ├── imagenet-a
│ ├── imagenetv2
│ └── robustvqa_test.json
├── llava
│ └── llava_v1_5_mix665k.json
├── nocaps
│ ├── val_imgs
│ └── nocaps_val_4500_captions.json
├── ocr_vqa
│ ├── images
│ └── dataset.json
├── okvqa
│ ├── OpenEnded_mscoco_train2014_questions.json
│ ├── OpenEnded_mscoco_val2014_questions.json
│ ├── mscoco_train2014_annotations.json
│ └── mscoco_val2014_annotations.json
├── part_data
│ ├── test.json
│ ├── train.json
│ ├── val.json
│ ├── partImagenet
│ │ ├── train
│ │ └── partImagenet_train_format.json
│ └── pascal_part
│ │ ├── VOCdevkit
│ │ └── pascalpart_train.json
├── pope
│ ├── coco_pope_adversarial.json
│ ├── coco_pope_popular.json
│ └── coco_pope_random.jso
├── refcoco
│ ├── refcoco+
│ ├── refcoco
│ ├── refcocog
│ ├── finetune_refcoco+_train_with_mask.json
│ ├── finetune_refcoco_train_with_mask.json
│ └── finetune_refcocog_val_with_mask.json
├── textcaps
│ ├── TextCaps_0.1_train.json
│ └── TextCaps_0.1_val.json
├── textvqa
│ ├── train_images
│ ├── TextVQA_0.5.1_train.json
│ ├── textvqa_val_annotations.json
│ ├── TextVQA_0.5.1_val.json
│ └── textvqa_val_questions.json
├── vcr
│ ├── vcr1images
│ ├── test.jsonl
│ ├── train.jsonl
│ ├── val.jsonl
│ └── textvqa_val_questions.json
├── vg
│ ├── VG_100K
│ ├── VG_100k_2
│ ├── region_descriptions.json
│ ├── image_data.json
│ ├── vg_train_with_mask.json
│ └── question_answers.json
├── visdial
│ ├── VisualDialog_val2018
│ ├── visdial_1.0_val_dense_annotations.json
│ └── visdial_1.0_val.json
├── vizwiz
│ ├── val
│ ├── test.json
│ ├── train.json
│ └── val.json
└── vqav2
│ ├── v2_OpenEnded_mscoco_train2014_questions.json
│ ├── v2_OpenEnded_mscoco_val2014_questions.json
│ ├── v2_mscoco_train2014_annotations.json
│ └── v2_mscoco_val2014_annotations.json
We provide two very convenient scripts to download a large amount of pre-training data, laion and mmc4. You can download the pre-training data by running the following scripts. Due to network reasons, only 40% of the data can be obtained in the end.
python ./scripts/download_laion.py --mode=annt
python ./scripts/download_mmc4.py --mode=annt
python ./scripts/download_laion.py --mode=images
python ./scripts/download_mmc4.py --mode=images
You can find all dataset downloading and convertaion scripts or information in scripts/
Note that after downloading the mmc4 dataset, you need to use the following conversion script to convert it into the pre-trained format. Use the corresponding version according to the download method you choose.
python ./scripts/convert_mmc4_for_pretrain.py
You can use the robustvqa file we provide in datasets or regenerate it yourself using the script
python ./scripts/convert_imagenet_for_robustvqa_test.py
DEEM training consists of three stages: (1) image-text alignment pre-training; (2) image-text supervised fine-tuning ; and (3) mask-text supervised fine-tuning.
DEEM is trained on 32 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
accordingly. Always keep the global batch size the same: per_device_train_batch_size
x gradient_accumulation_steps
x num_gpus
.
Please make sure you download and organize the data following Preparation before training and evaluation.
bash scripts/train.sh
We perform evaluation on several image-based benchmarks. Please see our paper for the more details.
If you want to evaluate the model on image-based benchmarks, please use the evaluation scripts for automatic evaluation.
bash scripts/evaluate.sh
For gqa we need to use the following script separately for evaluation.
unzip -d ./uni_interleaved/utils/gqa_metrics_src/ ./uni_interleaved/utils/gqa_metrics_src/train_choices.zip
python ./uni_interleaved/utils/gqa_eval.py
We provide some examples in this section. More examples can be found in our paper
-
Release training & evaluation code
-
Release stage 1 image-text alignment pre-training model weights
-
Release stage 2 image-text sft model weights
-
Release stage 3 mask-text sft model weights
If you find this repo useful for your research, please consider citing the paper
@article{luo2024deem,
title={Deem: Diffusion models serve as the eyes of large language models for image perception},
author={Luo, Run and Li, Yunshui and Chen, Longze and He, Wanwei and Lin, Ting-En and Liu, Ziqiang and Zhang, Lei and Song, Zikai and Xia, Xiaobo and Liu, Tongliang and others},
journal={arXiv preprint arXiv:2405.15232},
year={2024}
}
We would like to thank the following repos for their great work:
- This work is built upon theMM-Interleaved
- This work utilizes LLMs from , Vicuna
- This work utilizes the great work from OpenFlamingo, transformers, diffusers, LLaMA, CLIP, BLIP, ViT-Adapter and Osprey.
This project is released under the Apache 2.0 license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.