This repository will be the official Pytorch implementation for Cross-Modal Adapter.
Title: Cross-Modal Adapter for Text-Video Retrieval
Authors: Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni
Jiwen Lu, Jie Zhou, Shiji Song, Gao Huang (Corresponding Author)
Institute: Tsinghua University, BNRist and Beijing Institute of Technology
Publish: arXiv preprint (arXiv 2211.09623)
Contact: jhj20 at mails dot tsinghua dot edu dot cn
In this paper, we present a novel Cross-Modal Adapter for parameter-efficient fine-tuning. Although surprisingly simple, our approach has three notable benefits: (1) reduces 99.6% of fine-tuned parameters, and alleviates the problem of overfitting, (2) saves approximately 30% of training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets.
Our implementation is mainly based on the following codebases. We gratefully thank the authors for their wonderful works.
- CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval.
- hyperformer: Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks.