Skip to content
/ Croc Public

Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

License

Notifications You must be signed in to change notification settings

deepglint/Croc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

842f5fe2-84ad-464a-83d6-408bf1a0d9fa.webp

Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

Arxiv Hugging Face

842f5fe2-84ad-464a-83d6-408bf1a0d9fa.webp

Recent advances in Large Language Models (LLMs) have catalyzed the development of Large Multimodal Models(LMMs). However, existing research primarily focuses on tuning language and image instructions, ignoring the critical pretraining phase where models learn to process textual and visual modalities jointly. In this paper, we propose a new pretraining paradigm for LMMs to enhance the visual comprehension capabilities of LLMs by introducing a novel cross-modal comprehension stage. Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens. Then, we conceptualize visual tokens as analogous to a ''foreign language'' for the LLMs and propose a mixed attention mechanism with bidirectional visual attention and unidirectional textual attention to comprehensively enhance the understanding of visual tokens. Meanwhile, we integrate a detailed caption generation task, leveraging rich descriptions to further facilitate LLMs in understanding visual semantic information. After pretraining on 1.5 million publicly accessible data, we present a new foundation model called Croc. Experimental results demonstrate that Croc achieves new state-of-the-art performance on massive vision-language benchmarks. To support reproducibility and facilitate further research, we will release the training code and pre-trained model weights.

📜 News

[2024/10/21] The paper and code are released!💥

👨‍💻 Todo

  • Better model base on Croc
  • Checkpoints of Croc-13B
  • Checkpoints of Croc-7B
  • Training code for Croc

🤖 Model Zoo

Name LLM Checkpoint MMBench MMBench-CN SEED MM-Vet SQA-image VQA-v2 POPE GQA LLaVA-W
Croc-7B Vicuna-7B Croc-7B 69.1 60.5 63.0 36.8 72.3 80.1 86.9 63.5 73.3

Install

git clone https://github.com/deepglint/Croc.git
cd Croc
conda create -n croc python=3.10 -y
conda activate croc

pip install --upgrade pip
pip install -e .
pip install flash-attn --no-build-isolation

Training

Stage 1: Pretraining MLP

bash scripts/pretrain_mlp.sh

Stage 1.5: Pretraining Croc

bash scripts/pretrain_croc.sh

Stage 2: Instructional Finetuning

bash scripts/finetune.sh

Citation

@misc{xie2024crocpretraininglargemultimodal,
      title={Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension}, 
      author={Yin Xie and Kaicheng Yang and Ninghua Yang and Weimo Deng and Xiangzi Dai and Tiancheng Gu and Yumeng Wang and Xiang An and Yongle Zhao and Ziyong Feng and Jiankang Deng},
      year={2024},
      eprint={2410.14332},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.14332}, 
}

Acknowledgement

We extend our deepest gratitude to the creators and contributors of the following projects:

  1. LLaVA: The comprehensive codebase for training Vision-Language Models (VLMs).

Their exceptional work has been instrumental to our research and development efforts.

About

Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published