Recent advances in Large Language Models (LLMs) have catalyzed the development of Large Multimodal Models(LMMs). However, existing research primarily focuses on tuning language and image instructions, ignoring the critical pretraining phase where models learn to process textual and visual modalities jointly. In this paper, we propose a new pretraining paradigm for LMMs to enhance the visual comprehension capabilities of LLMs by introducing a novel cross-modal comprehension stage. Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens. Then, we conceptualize visual tokens as analogous to a ''foreign language'' for the LLMs and propose a mixed attention mechanism with bidirectional visual attention and unidirectional textual attention to comprehensively enhance the understanding of visual tokens. Meanwhile, we integrate a detailed caption generation task, leveraging rich descriptions to further facilitate LLMs in understanding visual semantic information. After pretraining on 1.5 million publicly accessible data, we present a new foundation model called Croc. Experimental results demonstrate that Croc achieves new state-of-the-art performance on massive vision-language benchmarks. To support reproducibility and facilitate further research, we will release the training code and pre-trained model weights.
[2024/10/21] The paper and code are released!💥
- Better model base on Croc
- Checkpoints of Croc-13B
- Checkpoints of Croc-7B
- Training code for Croc
Name | LLM | Checkpoint | MMBench | MMBench-CN | SEED | MM-Vet | SQA-image | VQA-v2 | POPE | GQA | LLaVA-W |
---|---|---|---|---|---|---|---|---|---|---|---|
Croc-7B | Vicuna-7B | Croc-7B | 69.1 | 60.5 | 63.0 | 36.8 | 72.3 | 80.1 | 86.9 | 63.5 | 73.3 |
git clone https://github.com/deepglint/Croc.git
cd Croc
conda create -n croc python=3.10 -y
conda activate croc
pip install --upgrade pip
pip install -e .
pip install flash-attn --no-build-isolation
Stage 1: Pretraining MLP
bash scripts/pretrain_mlp.sh
Stage 1.5: Pretraining Croc
bash scripts/pretrain_croc.sh
Stage 2: Instructional Finetuning
bash scripts/finetune.sh
@misc{xie2024crocpretraininglargemultimodal,
title={Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension},
author={Yin Xie and Kaicheng Yang and Ninghua Yang and Weimo Deng and Xiangzi Dai and Tiancheng Gu and Yumeng Wang and Xiang An and Yongle Zhao and Ziyong Feng and Jiankang Deng},
year={2024},
eprint={2410.14332},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.14332},
}
We extend our deepest gratitude to the creators and contributors of the following projects:
- LLaVA: The comprehensive codebase for training Vision-Language Models (VLMs).
Their exceptional work has been instrumental to our research and development efforts.