This repository contains the code to reproduce the experiments from the paper "Enhancing CLIP with CLIP: Pseudolabeling for Limiter-Labeled Prompt Tuning". The paper explores the effect of leveraging pseudolabels to adapt vision-language models such as CLIP to downstream tasks in a unified way across prompt modalities, learning paradigms, and training strategies.
To set up the project environment, follow these steps:
-
Ensure that you have Python version 3.7.4 installed. You can check the Python version by running the following command:
python --version
-
Clone the repository by running the following command:
git clone https://github.com/BatsResearch/menghini-enhanceCLIPwithCLIP-code.git
-
Navigate to the root folder and execute the
setup.sh
script to install the required dependencies, includingpytorch
. Note that we assume the installation of a CUDA-compatible version ofpytorch
since GPUs are recommended for running the experiments. If you don't have access to GPUs, you can modify the script to remove the CUDA requirement.cd menghini-enhanceCLIPwithCLIP-code/ bash setup.sh
The experiments are conducted on the following six datasets: Flowers102, RECSIS45, FGVC-Aircraft, MNIST, EuroSAT, and DTD (FRAMED). We use the train and test splits provided in the paper ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models.
To access the FRAMED dataset, you can download it here. After downloading, unzip the folder to obtain the required data.
If you encounter any issues with the download or prefer an alternative method, you can follow these steps:
- Download the data by following the instructions provided here.
- Rename the folders as follows:
dtd/
toDTD/
eurosat_clip/
toEuroSAT/
fgvc-aircraft-2013b-variants102/
toFGVCAircraft/
oxford-flower-102/
toFlowers102/
mnist/
toMNIST/
resisc45_clip/
toRESICS45/
- Ensure that each folder contains the following files:
DTD/
should contain theclass_names.txt
fileEuroSAT/
should contain theclass_names.txt
fileFGVCAircraft/
should contain thelabels.txt
fileFlowers102/
should contain theclass_names.txt
fileMNIST/
should contain thelabels.txt
file
Before running the experiments, create the following folders to save prompts, pseudolabels, and results.
mkdir pseudolabels
mkdir logs
mkdir trained_prompts
mkdir evaluation
We organized the code such that for each learning paradigm we can run any combination of prompt modality and training strategy.
- CLIP [1]
bash scripts/run_clip.sh
- Standard prompt tuning without pseudolabels: CoOp [2], VPT [3], UPT [4].
- For SSL:
bash scripts/run_prompts_ssl.sh
- For TRZSL:
bash scripts/run_prompts_trzsl.sh
- For SSL:
To execute the training strategies employing pseudolabels across prompt modalities run the following
-
For SSL:
bash scripts/run_pseudolabels_ssl.sh
-
For UL:
bash scripts/run_pseudolabels_ul.sh
-
For TRZSL:
bash scripts/run_pseudolabels_trzsl.sh
Logs of the runs are save in logs/
.
The folder pseudolabels/
gathers the pseudolabeled used for each prompt modality, leanring paradigms, and training strategies. For iterative methods, we store them at each iteration.
In trained_prompts/
, we save the prompts used to make predictions. For iterative methods, we save the prompts at each iteration.
While in evaluation/
there will be the predictions of each method.
[1] Learning Transferable Visual Models From Natural Language Supervision, Radford et al. 2021
[2] Learning to prompt for vision-language models, Zhou et al. 2021
[3] Visual prompt tuning, Jia et al. 2022
[4] Unified vision and language prompt learning, Zang et al., 2022
To be filled with the results obtained from the experiments.
Textual prompts
Flowers102 | RESICS45 | FGVCAircraft | |||||||
Method | SSL | UL | TRZSL | SSL | UL | TRZSL | SSL | UL | TRZSL |
CLIP [1] | 63.7 | 63.7 | 63.4 | 54.5 | 54.5 | 54.5 | 17.6 | 17.6 | 17.9 |
CoOp [2] | 76.8 | - | 63.2 | 58.5 | - | 63.4 | 14.9 | - | 21.7 |
GRIP | 83.6 | 69.8 | 86.3 | 74.1 | 70.6 | 81.1 | 17.0 | 15.2 | 26.1 |
MNIST | EuroSAT | DTD | |||||||
CLIP | 25.1 | 25.1 | 20.8 | 32.9 | 32.9 | 30.5 | 43.2 | 43.2 | 43.4 |
CoOp [2] | 56.4 | - | 21.2 | 59.5 | - | 49.7 | 37.1 | - | 46.3 |
GRIP | 71.8 | 67.9 | 74.1 | 58.7 | 57.2 | 92.3 | 56.1 | 46.1 | 65.3 |
Visual prompts
Flowers102 | RESICS45 | FGVCAircraft | |||||||
Method | SSL | UL | TRZSL | SSL | UL | TRZSL | SSL | UL | TRZSL |
CLIP [1] | 63.7 | 63.7 | 63.4 | 54.5 | 54.5 | 54.5 | 17.6 | 17.6 | 17.9 |
VPT [3] | 63.7 | - | 64.7 | 60.8 | - | 67.1 | 17.8 | - | 26.7 |
GRIP | 67.9 | 63.1 | 77.2 | 71.2 | 68.4 | 82.2 | 19.4 | 17.5 | 26.4 |
MNIST | EuroSAT | DTD | |||||||
CLIP [1] | 25.1 | 25.1 | 20.8 | 32.9 | 32.9 | 30.5 | 43.2 | 43.2 | 43.4 |
VPT [3] | 42.5 | - | 25.5 | 47.1 | - | 62.2 | 36.4 | - | 44.2 |
GRIP | 69.7 | 68.0 | 69.5 | 63.5 | 63.7 | 97.0 | 54.6 | 50.5 | 62.8 |
Multimodal prompts
Flowers102 | RESICS45 | FGVCAircraft | |||||||
Method | SSL | UL | TRZSL | SSL | UL | TRZSL | SSL | UL | TRZSL |
CLIP [1] | 63.7 | 63.7 | 63.4 | 54.5 | 54.5 | 54.5 | 17.6 | 17.6 | 17.9 |
UPT [4] | 68.0 | - | 61.1 | 62.8 | - | 58.8 | 11.1 | - | 15.9 |
GRIP | 74.6 | 64.8 | 82.0 | 73.7 | 69.4 | 82.2 | 17.4 | 14.7 | 17.9 |
MNIST | EuroSAT | DTD | |||||||
CLIP | 25.1 | 25.1 | 20.8 | 32.9 | 32.9 | 30.5 | 43.2 | 43.2 | 43.4 |
UPT [4] | 64.4 | - | 63.6 | 68.9 | - | 60.4 | 43.7 | - | 36.9 |
GRIP | 65.9 | 68.2 | 73.8 | 60.4 | 61.5 | 95.5 | 54.1 | 47.4 | 64.4 |
If you find this work helpful, please consider citing the following paper:
@inproceedings{
menghini2023enhancing,
title={Enhancing {CLIP} with {CLIP}: Exploring Pseudolabeling for Limited-Label Prompt Tuning},
author={Cristina Menghini and Andrew Delworth and Stephen Bach},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
url={https://openreview.net/forum?id=2b9aY2NgXE}
}