Skip to content

[NeurIPS'24] Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

License

Notifications You must be signed in to change notification settings

foundation-multimodal-models/CAL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

👁️ Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

Xin Xiao1,2*, Bohong Wu2*, Jiacong Wang2,3, Chunyuan Li2, Xun Zhou2, Haoyuan Guo 2

1School of Computer Science, Wuhan University, 2ByteDance Inc

3School of Artificial Intelligence, University of Chinese Academy of Sciences

abstract: Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-optimal cross-modal alignment by over-emphasizing the text tokens that are less correlated with or even contradictory with the input images. In this paper, we advocate for assigning distinct contributions for each text token based on its visual correlation. Specifically, we present by contrasting image inputs, the difference in prediction logits on each text token provides strong guidance of visual correlation. We therefore introduce Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens. Our experimental results demonstrate that CAL consistently improves different types of VLMs across different resolutions and model sizes on various benchmark datasets. Importantly, our method incurs minimal additional computational overhead, rendering it highly efficient compared to alternative data scaling strategies.

News and Updates

  • 2024.09 🔥 CAL is accepted by NeurIPS 2024.
  • 2024.06 The code is released.

Results

We provide results comparision for LLaVA-NEXT here.

Method LLM OCRB. VQADoc VQAChart VQAText SQA MMS. MMT. Win/All
LLaVA-NeXT Vicuna-7B 542 75.1 62.2 64.2 68.5 33.7 49.5
LLaVA-NeXT+CAL Vicuna-7B 561 77.3 64.3 65.0 70.1 35.5 50.7 7 / 7
LLaVA-NeXT Vicuna-13B 553 78.4 63.8 67.0 71.8 37.5 50.4
LLaVA-NeXT+CAL Vicuna-13B 574 80.1 67.2 67.1 71.5 38.1 52.4 6 / 7
Method LLM COCO Caption TextCaps Refcocog_val Refcocog_test Win/All
LLaVA-NeXT Vicuna-7B 112.0 115.0 77.6 77.5
LLaVA-NeXT+CAL Vicuna-7B 114.7 124.7 78.4 78.1 4 / 4
LLaVA-NeXT Vicuna-13B 118.5 118.2 79.8 79.6
LLaVA-NeXT+CAL Vicuna-13B 120.6 124.4 80.4 80.3 4 / 4

Install

conda create -n CAL python=3.10 -y
conda activate CAL
bash install.sh

Dataset

Please follow the instruction in LLaVA to prepare data.

For customized data preparation, please refer to this guide.

Training

You can execute demo bash scripts in this directory to train LLaVA models.

1. Customize base settings

Before training, you need to customize some settings in the following table. Otherwise, the code will use the default paths specified in run.sh. When using multiple data sources, simply concatenate their paths with a space.

Setting Usage
base_dir Path saving root directory
exp_name Experiment name, associated with the saving path
pretrain_json Pretrain JSON data
pretrain_imagedir Pretrain data image directory
finetune_json Finetune JSON data
finetune_imagedir Finetune data image directory

For developers who cannot access Hugging Face directly, you can download the checkpoint files (modelscope) and modify the LLM path to the absolute path.

2. Running

For example, to train LLaVA-NEXT with the Vicuna-13B LLM, run:

bash run_scripts/llava16_7b.sh

Note: The code will dynamically calculate the batch size for each GPU according to the total_batchsize, grad_acumsteps and the number of GPUs. When your resourses are limited, you can reduce total_batchsize or set a larger grad_acumsteps in the settings.

$batchsize_{singleGPU} = batchsize_{total}/(grad_{acumsteps}*GPU_{num})$

For multinode training, you need to prepare a hostfile. We provide an example here. Customize it based on your environment.

Evaluation

We evaluate our model using lmms-eval. This tool is quick and easy to use, requiring no dataset preparation. For more details, please refer to the lmms-eval repository.

Customization

The core modifications primarily involve three files:

  1. llava/constants.py
  2. llava/model/llava_arch.py
  3. llava/model/language_model/llava_llama.py

You can run CAL by modifying the corresponding files in LLaVA-style codebase, like MGM.

Acknowledgement

  • LLaVA: the codebase we built upon.
  • lmms-eval: the codebase we evaluate our model.

Thanks a lot for their great works.

Citation

@misc{xiao2024seeing,
      title={Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment}, 
      author={Xin Xiao and Bohong Wu and Jiacong Wang and Chunyuan Li and Xun Zhou and Haoyuan Guo},
      publisher={NeurIPS},
      year={2024},
}

About

[NeurIPS'24] Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

Topics

Resources

License

Stars

Watchers

Forks