👁️ Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

Xin Xiao^1,2*, Bohong Wu^2*, Jiacong Wang^2,3, Chunyuan Li², Xun Zhou², Haoyuan Guo ²

¹School of Computer Science, Wuhan University, ²ByteDance Inc

³School of Artificial Intelligence, University of Chinese Academy of Sciences

abstract: Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-optimal cross-modal alignment by over-emphasizing the text tokens that are less correlated with or even contradictory with the input images. In this paper, we advocate for assigning distinct contributions for each text token based on its visual correlation. Specifically, we present by contrasting image inputs, the difference in prediction logits on each text token provides strong guidance of visual correlation. We therefore introduce Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens. Our experimental results demonstrate that CAL consistently improves different types of VLMs across different resolutions and model sizes on various benchmark datasets. Importantly, our method incurs minimal additional computational overhead, rendering it highly efficient compared to alternative data scaling strategies.

News and Updates

2024.09 🔥 CAL is accepted by NeurIPS 2024.
2024.06 The code is released.

Results

We provide results comparision for LLaVA-NEXT here.

Method	LLM	OCRB.	VQA^Doc	VQA^Chart	VQA^Text	SQA	MMS.	MMT.	Win/All
LLaVA-NeXT	Vicuna-7B	542	75.1	62.2	64.2	68.5	33.7	49.5
LLaVA-NeXT+CAL	Vicuna-7B	561	77.3	64.3	65.0	70.1	35.5	50.7	7 / 7
LLaVA-NeXT	Vicuna-13B	553	78.4	63.8	67.0	71.8	37.5	50.4
LLaVA-NeXT+CAL	Vicuna-13B	574	80.1	67.2	67.1	71.5	38.1	52.4	6 / 7

Method	LLM	COCO Caption	TextCaps	Refcocog_val	Refcocog_test	Win/All
LLaVA-NeXT	Vicuna-7B	112.0	115.0	77.6	77.5
LLaVA-NeXT+CAL	Vicuna-7B	114.7	124.7	78.4	78.1	4 / 4
LLaVA-NeXT	Vicuna-13B	118.5	118.2	79.8	79.6
LLaVA-NeXT+CAL	Vicuna-13B	120.6	124.4	80.4	80.3	4 / 4

Install

conda create -n CAL python=3.10 -y
conda activate CAL
bash install.sh

Dataset

Please follow the instruction in LLaVA to prepare data.

For customized data preparation, please refer to this guide.

Training

You can execute demo bash scripts in this directory to train LLaVA models.

1. Customize base settings

Before training, you need to customize some settings in the following table. Otherwise, the code will use the default paths specified in run.sh. When using multiple data sources, simply concatenate their paths with a space.

Setting	Usage
`base_dir`	Path saving root directory
`exp_name`	Experiment name, associated with the saving path
`pretrain_json`	Pretrain JSON data
`pretrain_imagedir`	Pretrain data image directory
`finetune_json`	Finetune JSON data
`finetune_imagedir`	Finetune data image directory

For developers who cannot access Hugging Face directly, you can download the checkpoint files (modelscope) and modify the LLM path to the absolute path.

2. Running

For example, to train LLaVA-NEXT with the Vicuna-13B LLM, run:

bash run_scripts/llava16_7b.sh

Note: The code will dynamically calculate the batch size for each GPU according to the total_batchsize, grad_acumsteps and the number of GPUs. When your resourses are limited, you can reduce total_batchsize or set a larger grad_acumsteps in the settings.

$batchsize_{singleGPU} = batchsize_{total}/(grad_{acumsteps}*GPU_{num})$

For multinode training, you need to prepare a hostfile. We provide an example here. Customize it based on your environment.

Evaluation

We evaluate our model using lmms-eval. This tool is quick and easy to use, requiring no dataset preparation. For more details, please refer to the lmms-eval repository.

Customization

The core modifications primarily involve three files:

llava/constants.py
llava/model/llava_arch.py
llava/model/language_model/llava_llama.py

You can run CAL by modifying the corresponding files in LLaVA-style codebase, like MGM.

Acknowledgement

LLaVA: the codebase we built upon.
lmms-eval: the codebase we evaluate our model.

Thanks a lot for their great works.

Citation

@misc{xiao2024seeing,
      title={Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment}, 
      author={Xin Xiao and Bohong Wu and Jiacong Wang and Chunyuan Li and Xun Zhou and Haoyuan Guo},
      publisher={NeurIPS},
      year={2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
docs		docs
images		images
llava		llava
run_scripts		run_scripts
scripts		scripts
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
hostfile.txt		hostfile.txt
install.sh		install.sh
pyproject.toml		pyproject.toml
unset.sh		unset.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

👁️ Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

News and Updates

Results

Install

Dataset

Training

1. Customize base settings

2. Running

Evaluation

Customization

Acknowledgement

Citation

About

Languages

License

foundation-multimodal-models/CAL

Folders and files

Latest commit

History

Repository files navigation

👁️ Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

News and Updates

Results

Install

Dataset

Training

1. Customize base settings

2. Running

Evaluation

Customization

Acknowledgement

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages