DreamLIP: Language-Image Pre-training with Long Captions
Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, Yujun Shen
Project Page | Paper | Data
- [2024/11/26] Long captions (LLAVA1.5, InstructBLIP and shareGPT4V) of COYO24M/LAION49M are released in huggingface~
- [2024/08/26] Long captions (LLAVA1.5, InstructBLIP and shareGPT4V) of CC3M/CC12M/YFCC15M are released in huggingface~
- [2024/07/16] Upload the pretrained weight of VIT-B/16 pretrained in CC3M, CC12M, YFCC15M, and merged-30M (long captions of ShareGPT4V)!
- [2024/07/08] DreamLIP is accepted by ECCV 2024!
- 🔥 Exploring how language-image pre-training could benefit from long captions.
- 🔥 Strong improvement on semantic segmentation, image-text retrieval, semantic segmentation, and image understanding in MLLM.
- 🔥 DreamLIP trained with 30M image-text pairs achieves on par or even better performance than CLIP trained with 400M pairs.
- Release long captions of CC3M, CC12M, YFCC15M, COYO24M and LAION49M.
- Release training code.
Dataset | Huggingface Dataset |
---|---|
CC3M | Raw/Long/Short Caption |
CC12M | Raw/Long/Short Caption |
YFCC15M | Raw/Long/Short Caption |
Laion49M | Long Caption |
COYO24M | Long Caption |
Dataset | Model | ShareGPT4V | InstructBLIP + LLAVA1.5 + ShareGPT4V |
---|---|---|---|
CC3M | ViT-B/16 | Link | TODO |
CC12M | ViT-B/16 | Link | TODO |
YFCC15M | ViT-B/16 | Link | TODO |
CC30M | ViT-B/16 | Link | TODO |
Environment installation
pip install -r requirments.txt
Evaluate zero shot classification
bash eval_zs.sh
The project is under a standard Creative Common CC-BY-4.0 License.
We open source this library to the community to facilitate the research. If you do like our work and use the codebase for your projects, please cite our work as follows.
@inproceedings{DreamLIP,
title={DreamLIP: Language-Image Pre-training with Long Captions},
author={Zheng, Kecheng and Zhang, Yifei and Wu, Wei and Lu, Fan and Ma, Shuailei and Jin, Xin and Chen, Wei and Shen, Yujun},
booktitle={ECCV},
year={2024}
}
This project is based on open_clip, and thanks for the nice work! We also thank InstructBLIP, ShareGPT4V and LLAVA for the pretrained models and codes.