Skip to content

[NeurIPS 2024] Official Code for the Paper "Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning"

License

Notifications You must be signed in to change notification settings

Brandon3964/MultiModal-Task-Vector

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MTV ⚡⚡

[NeurIPS 2024] Official Code for the Paper "Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning"

We present Multimodal Task Vector(MTV), a novel technique for compressing many-shot multimodal in-context examples. We find that this approach not only outperform vanille ICL for Large Multimodal Language Models but also require significantly less time and memory.

More details can be found in our paper.

Method Description


Our method consists of three steps. The first step performs some amount of forward pass on ICL examples and take the average activations of these forward pass. The second step consists of running REINFORCE to locate the attention heads in the Language Backbone that capture the given multimodal tasks. During zero-shot inference, intervention is performed on the selected attention heads to replace the current activations with the average activations, in which we called the Multimodal Task Vector.

💻 Setup


Datasets

For Vizwiz and OKVQA, please follow the instruction in the Qwen-VL repository. For Flower, CUB, and DTD, please download the images from their respective official websites. We provide the 2-way 1 shot text annotations in the data file.

Models

  1. For the models used in the paper, please follow the installation steps outlined in their official repository.
  2. Install this package by David Bau at Northeastern University.
git+https://github.com/davidbau/baukit@main#egg=baukit

Please refer to models.py if you would like to use custom models.

📝 Citation


If you found our work useful, please consider starring and citing. Thank you!

@article{huang2024multimodal,
  title={Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning},
  author={Huang, Brandon and Mitra, Chancharik and Arbelle, Assaf and Karlinsky, Leonid and Darrell, Trevor and Herzig, Roei},
  booktitle={Advances in neural information processing systems (NeurIPS)},
  year={2024}
}

About

[NeurIPS 2024] Official Code for the Paper "Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.2%
  • Shell 4.8%