MTV ⚡⚡

[NeurIPS 2024] Official Code for the Paper "Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning"

We present Multimodal Task Vector(MTV), a novel technique for compressing many-shot multimodal in-context examples. We find that this approach not only outperform vanille ICL for Large Multimodal Language Models but also require significantly less time and memory.

More details can be found in our paper.

Method Description

Our method consists of three steps. The first step performs some amount of forward pass on ICL examples and take the average activations of these forward pass. The second step consists of running REINFORCE to locate the attention heads in the Language Backbone that capture the given multimodal tasks. During zero-shot inference, intervention is performed on the selected attention heads to replace the current activations with the average activations, in which we called the Multimodal Task Vector.

💻 Setup

Datasets

For Vizwiz and OKVQA, please follow the instruction in the Qwen-VL repository. For Flower, CUB, and DTD, please download the images from their respective official websites. We provide the 2-way 1 shot text annotations in the data file.

Models

For the models used in the paper, please follow the installation steps outlined in their official repository.
Install this package by David Bau at Northeastern University.

git+https://github.com/davidbau/baukit@main#egg=baukit

Please refer to models.py if you would like to use custom models.

📝 Citation

If you found our work useful, please consider starring and citing. Thank you!

@article{huang2024multimodal,
  title={Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning},
  author={Huang, Brandon and Mitra, Chancharik and Arbelle, Assaf and Karlinsky, Leonid and Darrell, Trevor and Herzig, Roei},
  booktitle={Advances in neural information processing systems (NeurIPS)},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
MTV		MTV
assets		assets
eval_mm		eval_mm
finetune		finetune
touchstone		touchstone
.gitignore		.gitignore
BUILD.md		BUILD.md
Dockerfile.qwendemo		Dockerfile.qwendemo
Dockerfile.qwenint4openai		Dockerfile.qwenint4openai
Dockerfile.qwenopenai		Dockerfile.qwenopenai
FAQ.md		FAQ.md
FAQ_ja.md		FAQ_ja.md
FAQ_ko.md		FAQ_ko.md
FAQ_zh.md		FAQ_zh.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
README_CN.md		README_CN.md
README_JA.md		README_JA.md
README_KO.md		README_KO.md
TUTORIAL.md		TUTORIAL.md
TUTORIAL_ja.md		TUTORIAL_ja.md
TUTORIAL_ko.md		TUTORIAL_ko.md
TUTORIAL_zh.md		TUTORIAL_zh.md
finetune.py		finetune.py
openai_api.py		openai_api.py
requirements.txt		requirements.txt
requirements_openai_api.txt		requirements_openai_api.txt
requirements_web_demo.txt		requirements_web_demo.txt
web_demo_mm.py		web_demo_mm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MTV ⚡⚡

[NeurIPS 2024] Official Code for the Paper "Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning"

Method Description

💻 Setup

Datasets

Models

📝 Citation

About

Releases

Packages

Languages

License

Brandon3964/MultiModal-Task-Vector

Folders and files

Latest commit

History

Repository files navigation

MTV ⚡⚡

[NeurIPS 2024] Official Code for the Paper "Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning"

Method Description

💻 Setup

Datasets

Models

📝 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages