Pink: Unveiling The Power of Referential Comprehension for Multi-modal LLMs.

[arXiv][Paper]

Pink: Unveiling The Power of Referential Comprehension for Multi-modal LLMs Shiyu Xuan, Qingpei Guo, Ming Yang, Shiliang Zhang
CVPR 2024

Pink Weights

Base: Pink_Base
Base_Object365: Pink_Object365
Base_RefCOCO: Pink_Refcoco

Data Download

Pretraining Dataset

The pretraining dataset used in this release is the same as in LLaVA which is a subset of CC-3M dataset. Please see here for a detailed description on the dataset structure and how to download the images.

Instruction Tuning Dataset

The datasets mentioned in the image need to be downloaded manually.

COCO: train2017
VisualGenome: part1, part2, objects, relationships, region descriptions
Object365: Object365
A-OKVQA: A-OKVQA
LLaVA-158K: LLaVA-158K

We also provide the converted dataset used in the instruction tuning:

https://huggingface.co/datasets/SY-Xuan/Pink_sft/

LLaMA2 Weight Download

Our model is based on Llama-2-7b-chat-hf. You need to download the weights manually.

Llama-2-7b-chat-hf: Llama-2-7b-chat-hf

Install

Install Package

conda create -n pink python=3.10 -y
conda activate pink
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Training

Stage 1

    bash scripts/stage1.sh

Stage 2

    bash scripts/stage2.sh

Stage 2 with Object365

    bash scripts/stage2_with_object365.sh

Self-consistent Bootstrapping

We convert the *.json of Object365. Please refer to dataset_generation/object365_detection.py

Bootstrapping

    bash scripts/object365_generate.sh

Self-consistent

Please refer to pink/eval/object365_filter.py

Evaluation

Please refer to inference.ipynb and scripts/eval_refcoco.sh.

Demo

To launch a Gradio web demo, use the following command.

python demo.py --checkpoint-path /path/to/pink --llama-path /path/to/llama2

Citation

If you find Pink useful for your research and applications, please cite using this BibTeX:

@InProceedings{Xuan_2024_CVPR,
    author    = {Xuan, Shiyu and Guo, Qingpei and Yang, Ming and Zhang, Shiliang},
    title     = {Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {13838-13848}
}

Acknowledgement

This code inherits some codes from LLaVA and Shikra. Thanks for these outstanding implementations.

Contact me

If you have any questions about this code or paper, feel free to contact me at shiyu_xuan@stu.pku.edu.cn.

Related Projects

LocLLM: We leverage LLM for the human keypoint localization. LocLLM shows remarkable performance on standard 2D/3D keypoint localization benchmarks. Moreover, incorporating language clues into the localization makes LocLLM show superior flexibility and generalizable capability in cross dataset keypoint localization, and even detecting novel type of keypoints unseen during training.

Ant-Multi-Modal-Framework: This repository contains codes for multi-modality learning from the Multimodal Cognition group of Ant Group that have been integrated into AntMMF.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
dataset_generation		dataset_generation
pink		pink
scripts		scripts
.gitignore		.gitignore
47.png		47.png
README.md		README.md
demo.py		demo.py
framework.png		framework.png
image.png		image.png
inference.ipynb		inference.ipynb
nash_high.jpeg		nash_high.jpeg
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pink: Unveiling The Power of Referential Comprehension for Multi-modal LLMs.

Contents

Pink Weights

Data Download

Pretraining Dataset

Instruction Tuning Dataset

LLaMA2 Weight Download

Install

Training

Stage 1

Stage 2

Stage 2 with Object365

Self-consistent Bootstrapping

Bootstrapping

Self-consistent

Evaluation

Demo

Citation

Acknowledgement

Contact me

Related Projects

About

Releases

Packages

Languages

SY-Xuan/Pink

Folders and files

Latest commit

History

Repository files navigation

Pink: Unveiling The Power of Referential Comprehension for Multi-modal LLMs.

Contents

Pink Weights

Data Download

Pretraining Dataset

Instruction Tuning Dataset

LLaMA2 Weight Download

Install

Training

Stage 1

Stage 2

Stage 2 with Object365

Self-consistent Bootstrapping

Bootstrapping

Self-consistent

Evaluation

Demo

Citation

Acknowledgement

Contact me

Related Projects

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages