Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models

Raza Imam, Hanan Gani, Muhammad Huazaifa, Karthik Nandakumar

Mohamed Bin Zayed University of Artificial Intelligence

This repository provides the official PyTorch implementation of our TTL paper:

Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models
Authors: Raza Imam, Hanan Gani, Muhammad Huzaifa, Karthik Nandakumar

Our proposed TTL vs. Existing zero-shot optimization methods.

For more details, please check out our paper.

Overview

This repository contains the implementation of TTL for image classification with a pre-trained CLIP. We showed Test-Time (Low-rank) Adaptation vs Test-time Prompt Tuning. We consider the following configurations for TTL:

Learnable low-rank weights are initialized with random Xavier initialization with rank r=16
Initialized frozen prompts with a hand-crafted prompt as (e.g., "a photo of a ___")

Prerequisites

Hardware

This implementation is for the single-GPU configuration.

To evaluate on ImageNet, ImageNet-V2, and ImageNet-Sketch (which has 1000 classes), you will need a GPU with more than (not including) 16GB memory. This codebase is tested on a GPU with 24GB memory. To evaluate other datasets (with less than a few hundred classes), a GPU with 16GB memory will work fine.

Environment

The code is tested on PyTorch 2.2.1.

Datasets

We suggest downloading all datasets to a root directory (${DATA_ROOT}), and renaming the directory of each dataset as suggested in ${ID_to_DIRNAME} in ./data/datautils.py. This would allow you to evaluate multiple datasets within the same run.
If this is not feasible, you could evaluate different datasets separately, and change the ${DATA_ROOT} accordingly in the bash script.

For out-of-distribution generalization, we consider 5 datasets:

ImageNet
ImageNet-A
ImageNet-R
ImageNet-V2
ImageNet-Sketch

For cross-datasets generalization, we consider 10 datasets:

Flower102
DTD
OxfordPets
StanfordCars
UCF101
Caltech101
Food101
SUN397
Aircraft
EuroSAT

For cross-dataset generalization, we adopt the same train/val/test splits as CoOp. Please refer to this page, and look for download links of split_zhou_${dataset_name}.json, and put the json files under ./data/data_splits/.

Run TTL

We provide script to run ttl.py under ./scripts. You can modify the paths and other args in the scripts.

An example to run TTL with LoRA initialization on out-of-distribution datasets:

bash ./scripts/test_ttl.sh I/A/V/R/K.

The command line arg ${TEST_SETS} can be multiple test datasets split by "/" (, which are stored under the same root dir ${DATA_ROOT}).
Note that for simplicity, we use set_id to denote different datasets. A complete list of set_id can be found in ${ID_to_DIRNAME} in ./data/datautils.py.

Main Results

Quantitative Results

Method	ImageNet(IN)	IN-A	IN-V2	IN-R	IN-Sketch	Average	OOD Average
CLIP-ViT-B/16	67.30	47.14	59.90	71.20	43.00	57.71	55.31
Ensemble	68.50	48.44	62.70	73.50	45.50	59.73	57.53
CoOp	72.30	49.25	65.70	71.50	47.60	61.27	58.51
CoCoOp	71.40	50.05	63.80	73.10	46.70	61.01	58.41
TPT	68.90	54.59	63.13	77.05	47.99	62.33	60.69
CALIP	66.74	47.76	60.76	73.99	46.12	59.07	57.16
PromptAlign	60.02	45.52	54.53	72.84	37.72	54.13	52.65
TTL (Ours)	70.23	60.51	64.55	77.54	48.61	64.29	62.80

Qualitative Results

TTL outperforms various prompt-tuning methods in strict zero-shot scenarios, including text, visual, and multi-modal prompts. While text prompt-tuning like TPT improves zero-shot adaptation, visual and multi-modal methods often underperform without pre-training. Even though PromptAlign shows potential when pre-trained, it struggles without it, reducing its effectiveness in real-world zero-shot tasks. TTL excels across all these approaches, achieving better performance without relying on any sort of pre-training.

At test time, TTL produces linearly separable features for zero-shot generalization compared to existing baselines like TPT and PromptAlign.

Citation

If you find our code useful or our work relevant, please consider citing:

@misc{imam2024testtimelowrankadaptation,
      title={Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models}, 
      author={Raza Imam and Hanan Gani and Muhammad Huzaifa and Karthik Nandakumar},
      year={2024},
      eprint={2407.15913},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.15913}, 
}

Acknowledgements

We thank the authors of TPT and DeYO for their open-source implementation and instructions on data preparation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models

Overview

Prerequisites

Hardware

Environment

Datasets

Run TTL

Main Results

Quantitative Results

Qualitative Results

Citation

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models

Overview

Prerequisites

Hardware

Environment

Datasets

Run TTL

Main Results

Quantitative Results

Qualitative Results

Citation

Acknowledgements