Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models
Raza Imam, Hanan Gani, Muhammad Huazaifa, Karthik Nandakumar
Mohamed Bin Zayed University of Artificial Intelligence
This repository provides the official PyTorch implementation of our TTL paper:
Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models
Authors: Raza Imam, Hanan Gani, Muhammad Huzaifa, Karthik Nandakumar
Our proposed TTL vs. Existing zero-shot optimization methods.
For more details, please check out our paper.
This repository contains the implementation of TTL for image classification with a pre-trained CLIP. We showed Test-Time (Low-rank) Adaptation vs Test-time Prompt Tuning. We consider the following configurations for TTL:
- Learnable low-rank weights are initialized with random Xavier initialization with rank r=16
- Initialized frozen prompts with a hand-crafted prompt as (e.g., "a photo of a ___")
This implementation is for the single-GPU configuration.
To evaluate on ImageNet, ImageNet-V2, and ImageNet-Sketch (which has 1000 classes), you will need a GPU with more than (not including) 16GB memory. This codebase is tested on a GPU with 24GB memory. To evaluate other datasets (with less than a few hundred classes), a GPU with 16GB memory will work fine.
The code is tested on PyTorch 2.2.1.
We suggest downloading all datasets to a root directory (${DATA_ROOT}
), and renaming the directory of each dataset as suggested in ${ID_to_DIRNAME}
in ./data/datautils.py
. This would allow you to evaluate multiple datasets within the same run.
If this is not feasible, you could evaluate different datasets separately, and change the ${DATA_ROOT}
accordingly in the bash script.
For out-of-distribution generalization, we consider 5 datasets:
For cross-datasets generalization, we consider 10 datasets:
For cross-dataset generalization, we adopt the same train/val/test splits as CoOp. Please refer to this page, and look for download links of split_zhou_${dataset_name}.json
, and put the json files under ./data/data_splits/
.
We provide script to run ttl.py under ./scripts
. You can modify the paths and other args in the scripts.
An example to run TTL with LoRA initialization on out-of-distribution datasets:
bash ./scripts/test_ttl.sh I/A/V/R/K.
The command line arg ${TEST_SETS}
can be multiple test datasets split by "/" (, which are stored under the same root dir ${DATA_ROOT}
).
Note that for simplicity, we use set_id
to denote different datasets. A complete list of set_id
can be found in ${ID_to_DIRNAME}
in ./data/datautils.py
.
Method | ImageNet(IN) | IN-A | IN-V2 | IN-R | IN-Sketch | Average | OOD Average |
---|---|---|---|---|---|---|---|
CLIP-ViT-B/16 | 67.30 | 47.14 | 59.90 | 71.20 | 43.00 | 57.71 | 55.31 |
Ensemble | 68.50 | 48.44 | 62.70 | 73.50 | 45.50 | 59.73 | 57.53 |
CoOp | 72.30 | 49.25 | 65.70 | 71.50 | 47.60 | 61.27 | 58.51 |
CoCoOp | 71.40 | 50.05 | 63.80 | 73.10 | 46.70 | 61.01 | 58.41 |
TPT | 68.90 | 54.59 | 63.13 | 77.05 | 47.99 | 62.33 | 60.69 |
CALIP | 66.74 | 47.76 | 60.76 | 73.99 | 46.12 | 59.07 | 57.16 |
PromptAlign | 60.02 | 45.52 | 54.53 | 72.84 | 37.72 | 54.13 | 52.65 |
TTL (Ours) | 70.23 | 60.51 | 64.55 | 77.54 | 48.61 | 64.29 | 62.80 |
TTL outperforms various prompt-tuning methods in strict zero-shot scenarios, including text, visual, and multi-modal prompts. While text prompt-tuning like TPT improves zero-shot adaptation, visual and multi-modal methods often underperform without pre-training. Even though PromptAlign shows potential when pre-trained, it struggles without it, reducing its effectiveness in real-world zero-shot tasks. TTL excels across all these approaches, achieving better performance without relying on any sort of pre-training.
At test time, TTL produces linearly separable features for zero-shot generalization compared to existing baselines like TPT and PromptAlign.
If you find our code useful or our work relevant, please consider citing:
@misc{imam2024testtimelowrankadaptation,
title={Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models},
author={Raza Imam and Hanan Gani and Muhammad Huzaifa and Karthik Nandakumar},
year={2024},
eprint={2407.15913},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.15913},
}
We thank the authors of TPT and DeYO for their open-source implementation and instructions on data preparation.