Skip to content

SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models

Notifications You must be signed in to change notification settings

omipan/svl_adapter

Repository files navigation

SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models

Official implementation of SVL-Adapter, presented in our BMVC 2022 paper. A summary video can be found here.

Overview of SVL-Adapter approach

Quantitative Results

Here you can find the numerical results of the low-shot and zero-shot evaluation of SVL-Adapter across all datasets explored. The numbers correspond to Figures 3 and 4 of the paper.

Prerequisites

Set up Environment

We start by creating a conda environment with the required dependencies.

git clone https://github.com/omipan/svl_adapter/

# Create conda environment
conda create -n svl_adapter python=3.8
conda activate svl_adapter

# Install Requirements
pip install -r requirements.txt
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

# Install CLIP
pip install git+https://github.com/openai/CLIP.git

Fetch Datasets

Follow DATASET.md for instructions on setting up the 16 datsets of the paper. Data loading follows the detailed procedure described in the CoOp repository. We thank its authors for their great work.

Data Preparation

First, we generate a metadata file with common format across datasets to handle data across all stages (self-supervised pretraining, low-shot learning etc.)

python data_preparation.py --dataset eurosat

Self-Supervised Learning Pretraining

We train a self-supervised learning model (here SimCLR) given the images that are available for each task at hand:

python  ssl_pretraining.py --train_loss simclr --backbone resnet50 --dataset eurosat --output_dir data/eurosat/ssl_pretraining/models/

Low-Shot Learning

Feature Extraction

We extract features from the self-supervised pretrained model and CLIP visual encoders for any given dataset only once to save time during low-shot adaptation. We should note that these features are kept frozen during training of SVL-Adapter. For SSL features we run:

python feature_extraction.py --dataset 'eurosat'  --model_type 'ssl'  --model_subtype 'RN50' --feature_path 'data/eurosat/pretrained_features/'  --model_dir 'data/eurosat/ssl_pretraining/models/'

and for CLIP visual features:

python feature_extraction.py --dataset 'eurosat'  --model_type 'clip' --model_subtype 'RN50' --feature_path 'data/eurosat/pretrained_features/'

SVL-Adapter and SVL-Adapter* for Low-Shot Learning

Given the features extracted with the self-supervised learning (i.e. SimCLR) encoder trained with the data of the given task and features from the general CLIP visual encoder, we train our SVL-Adapter model:

python low_shot_adapt.py --dataset 'eurosat' --feature_path 'data/eurosat/pretrained_features/' --pretrained_model 'simclr_RN50' --finetune_type 'mlp_adapter' --epochs 50 --tune_alpha

Similarly, to train SVL-Adapter*, the version of SVL-Adapter that does not need an additional validation set to decide on the balancing parameter between CLIP and Self-Supervised Learning Adapter, we run the following:

python low_shot_adapt.py --dataset 'eurosat'  --feature_path 'data/eurosat/pretrained_features/' --pretrained_model 'simclr_RN50' --finetune_type 'mlp_adapter' --epochs 50 --confidence_alpha

Zero-Shot Learning

For zero-shot learning, we suggest exploitation of Zero-Shot CLIP to generate pseudolabels for SVL-Adapter* and treat similarly with Low-Shot Learning.

Generate metadatafile that uses pseudolabeled data during low-shot adaptation

Initially, we generate a metafile that considers the pseudolabels (zero-shot predictions) of CLIP as the ground truth labels that are going to be used during the adaptation stage. In the example below, we keep the 16 most confident pseudolabels for each of the predicted categories.

python generate_clip_pseudolabels.py --dataset eurosat --model_subtype RN50 --imgs_per_label 16

Feature Extraction

Again, we extract features from the self-supervised pretrained model and CLIP visual encoders for low-shot adaptation but now the training set is replaced with images and their pseudolabels as generated by CLIP.

python feature_extraction.py --dataset 'eurosat'  --model_type 'ssl' --model_subtype 'RN50' --feature_path 'data/eurosat/pretrained_features/' --use_pseudo --pseudo_conf '16shot' --pseudolabel_model 'clip_RN50' --model_dir 'data/eurosat/ssl_pretraining/models/'
python feature_extraction.py --dataset 'eurosat'  --model_type 'clip' --model_subtype 'RN50' --feature_path 'data/eurosat/pretrained_features/' --use_pseudo --pseudo_conf '16shot' --pseudolabel_model 'clip_RN50'

SVL-Adapter* for Zero-Shot Learning

python zero_shot_pseudo_adapt.py --dataset 'eurosat' --feature_path 'data/eurosat/pretrained_features/' --pretrained_model 'simclr_RN50'  --pseudolabel_model 'clip_RN50' --pseudo_conf '16shot' --finetune_type 'mlp_adapter'

Reference

If you find our work useful in your research please consider citing our paper:

@inproceedings{svladapterbmvc2022,
    author    = {Pantazis, Omiros and Brostow, Gabriel and Jones, Kate and Mac Aodha, Oisin},
    title     = {SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models},
    booktitle = {British Machine Vision Conference (BMVC)},
    year      = {2022}
}

Acknowledgements

This codebase has benefited from CLIP, CoOp and FocusOnThePositives GitHub repositories.

About

SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages