Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Extending the functionality of MLLMs by integrating an additional region-level vision encoder.

Usage and License Notices: The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Install

Clone this repository and navigate to PVIT folder

git clone https://github.com/THUNLP-MT/PVIT.git 
cd PVIT

Install Package

conda create -n pvit python=3.9.6
conda activate pvit
pip install -r requirements.txt

Install RegionCLIP

git clone https://github.com/microsoft/RegionCLIP.git
pip install -e RegionCLIP

Click here for more details.

PVIT Weights

To get PVIT weights, please first download weights of LLaMA and RegionCLIP. For RegionCLIP, please download regionclip_pretrained-cc_rn50x4.pth.

Click here for PVIT checkpoints. Please put all the weights in folder model_weights and merge PVIT weights with LLaMA weights through the following command.

BASE_MODEL=model_weights/llama-7b TARGET_MODEL=model_weights/pvit DELTA=model_weights/pvit-delta ./scripts/delta_apply.sh

Data Generation

We provide prompts and few-shot examples used when querying ChatGPT in both task-specific instruction data generation and general instruction data generation (Figure 3 (b) and Figure 3 (c) in our paper).

The data_generation/task-specific folder includes seeds, prompts and examples in single-turn conversation generation and multi-turn conversation generation. Single-turn conversation generation includes five types of tasks: small object recognition, object relationship-based reasoning, optical character recognition (OCR), object attribute-based reasoning, and same-category object discrimination,

The data_generation/general folder includes seeds, prompts and examples used in general instruction data generation.

Demo

To run our demo, you need to prepare PVIT checkpoints locally. Please follow the instructions here to download and merge the checkpoints.

Web Server

To run the demo, please first launch a web server with the following command.

MODEL_PATH=model_weights/pvit CONTROLLER_PORT=39996 WORKER_PORT=40004 ./scripts/model_up.sh

Streamlit Web UI

Run the following command to run a Streamlit demo locally. The port of MODEL_ADDR should be consistant with WORKER_PORT.

MODEL_ADDR=http://0.0.0.0:40004 ./scripts/run_demo.sh

CLI Inference

Run the following command to do cli inference locally. The port of MODEL_ADDR should be consistant with WORKER_PORT.

MODEL_ADDR=http://0.0.0.0:40004 ./scripts/run_cli.sh

Data

You can download stage1 and stage2 training data on huggingface. You are required to download pictures of COCO2017 Train, SBU Captioned Photo, Visual Genome, GQA and Visual Commonsense Reasoning datasets as well. Please put stage1 and stage2 data, and the downloaded pictures in folder data as follows. You can modify image_paths in data/stage1/mapping.yaml and data/stage2/mapping.yaml to change the path of downloaded pictures.

Train

Our model is trained in two stages. In stage 1, we initialize the model with the pre-trained LLaVA, and only train the linear projection layer that is responsible for transforming the region features. In stage 2, we only keep the parameters of the image encoder and the region encoder frozen, and fine-tune the rest of the model.

To train PVIT, please download the pretrained LLaVA checkpoints, and put it in folder model_weights.

The following commands are for stage 1 training.

export MODEL_PATH="model_weights/llava-lightning-7b-v1"
export REGION_CLIP_PATH="model_weights/regionclip_pretrained-cc_rn50x4.pth"
export DATA_PATH="data/stage1"
export OUTPUT_DIR="checkpoints/stage1_ckpt"
export PORT=25001
./scripts/train_stage1.sh

The following commands are for stage 2 training.

export MODEL_PATH="checkpoints/stage1_ckpt"
export REGION_CLIP_PATH="model_weights/regionclip_pretrained-cc_rn50x4.pth"
export DATA_PATH="data/stage2"
export OUTPUT_DIR="checkpoints/stage2_ckpt"
export PORT=25001
./scripts/train_stage2.sh

Evaluation

We propose FineEval dataset for human evaluation. See folder fine_eval for the dataset and model outputs. The files in the folder are as follows.

images: Image files of FineEval dataset.
instructions.jsonl: Questions of FineEval dataset.
pvit.jsonl: The results of PVIT (ours) model.
llava.jsonl: The results of LLaVA model.
shikra.jsonl: The results of Shikra model.
gpt4roi.jsonl: The results of GPT4RoI model.

To run PVIT on FineEval dataset, you can launch a web server and run the following command. The port of MODEL_ADDR should be consistant with WORKER_PORT.

MODEL_ADDR=http://0.0.0.0:40004 ./scripts/run_fine_eval.sh

Citation

If you find PVIT useful for your research and applications, please cite using this BibTeX:

@misc{chen2023positionenhanced,
      title={Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models}, 
      author={Chi Chen and Ruoyu Qin and Fuwen Luo and Xiaoyue Mi and Peng Li and Maosong Sun and Yang Liu},
      year={2023},
      eprint={2308.13437},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

LLaVA: the codebase we built upon, which has the amazing multi-modal capabilities!
Vicuna: the codebase LLaVA built upon, and the base model Vicuna-13B that has the amazing language capabilities!
RegionCLIP: Our prompt encoder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Contents

Install

PVIT Weights

Data Generation

Demo

Web Server

Streamlit Web UI

CLI Inference

Data

Train

Evaluation

Citation

Acknowledgement

Related Projects

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data		data
data_generation		data_generation
demo		demo
fine_eval		fine_eval
images		images
pvit		pvit
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

PVIT-official/PVIT

Folders and files

Latest commit

History

Repository files navigation

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Contents

Install

PVIT Weights

Data Generation

Demo

Web Server

Streamlit Web UI

CLI Inference

Data

Train

Evaluation

Citation

Acknowledgement

Related Projects

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages