Authors: Ximeng Sun, Ping Hu, Kate Saenko
In this work, we utilize the strong alignment of textual and visual features pretrained with millions of auxiliary image-text pairs and propose Dual Context Optimization (DualCoOp) as a unified framework for partial-label MLR and zero-shot MLR. DualCoOp encodes positive and negative contexts with class names as part of the linguistic input (i.e. prompts). Since DualCoOp only introduces a very light learnable overhead upon the pretrained vision-language framework, it can quickly adapt to multi-label recognition tasks that have limited annotations and even unseen classes. Experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the advantages of our approach over state-of-the-art methods.
Welcome to cite our work if you find it is helpful to your research.
@inproceedings{
sun2022dualcoop,
title={DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations},
author={Ximeng Sun and Ping Hu and Kate Saenko},
booktitle={Advances in Neural Information Processing Systems},
editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
year={2022},
url={https://openreview.net/forum?id=QnajmHkhegH}
}
Our implementation is in Pytorch with python 3.9.
Use conda env create -f environment.yml
to create the conda environment.
In the conda environment, install pycocotools
and randaugment
with pip:
pip install pycocotools
pip install randaugment
And follow the link to install dassl
.
- MS-COCO: We use the official
train2014
(82K images) andval2014
(40K images) for training and test. - VOC2007: We use the official
trainval
(5K images) andtest
(5K images) splits for training and test.
- MS-COCO: We follow [1, 2] to split the dataset into
48 seen classes and 17 unseen classes. We provide the json files of the seen and unseen annotations on Google Drive. Download and move all files into
<coco_dataroot>/annotations/
for using in the training and inference. - NUS-WIDE: Following [2, 3] we use 81 human-annotated categories as unseen classes and an additional set of 925 labels
obtained from Flickr tags as seen classes. We provide the class split on Google Drive. Download and move those folders into
<nus_wide_dataroot>/annotations/
for using in the training and inference.
Use the following code to learn a model for MLR with Partial Labels
python train.py --config_file configs/models/rn101_ep50.yaml \
--datadir <your_dataset_path> --dataset_config_file configs/datasets/<dataset>.yaml \
--input_size 448 --lr <lr_value> --loss_w <loss_weight> \
-pp <porition_of_avail_label> --csc
Some Args:
dataset_config_file
: currently the code supportsconfigs/datasets/coco.yaml
andconfigs/datasets/voc2007.yaml
lr
:0.001
for VOC2007 and0.002
for MS-COCO.pp
: from 0 to 1. It specifies the portion of labels are available during the training.loss_w
: to balance the loss scale with differentpp
. We use largerloss_w
for smallerpp
.csc
: specify if you want to use class-specific prompts. We suggest to use class-agnostic prompts whenpp
is very small.
Please refer toopts.py
for the full argument list. For Example:
python train.py --config_file configs/models/rn101_ep50.yaml \
--datadir ../datasets/mscoco_2014/ --dataset_config_file configs/datasets/coco.yaml \
--input_size 448 --lr 0.002 --loss_w 0.03 -pp 0.5
python train_zsl.py --config_file configs/models/rn50_ep50.yaml \
--datadir <your_dataset_path> --dataset_config_file configs/datasets/<dataset>>.yaml \
--input_size 224 --lr <lr_value> --loss_w 0.01 --n_ctx_pos 64 --n_ctx_neg 64 \
--num_train_cls <some_value_or_not_specified>
Some Args:
lr
: 0.002 for MS-COCO and 0.001 for NUS-WIDEn_ctx_pos
: the length of learnable positive prompt templaten_ctx_neg
: the length of learnable negative prompt templatenum_train_cls
: set as an intn
. The algorithm randomly pickn
classes to compute ASL loss when the number of seen classes are very large during the training, e.g. NUS-WIDE
Note that csc
does not work for zero-shot MLR since some classes are never seen during the training.
For example:
python train_zsl.py --config_file configs/models/rn50_ep50.yaml \
--datadir ../datasets/mscoco_2014/ --dataset_config_file configs/datasets/coco.yaml \
--input_size 224 --lr 0.002 --loss_w 0.01 --n_ctx_pos 64 --n_ctx_neg 64
python val.py --config_file configs/models/rn101_ep50.yaml \
--datadir <your_dataset_path> --dataset_config_file configs/datasets/<dataset>>.yaml \
--input_size 224 --pretrained <ckpt_path> --csc
python val_zsl.py --config_file configs/models/rn50_ep50.yaml \
--datadir <your_dataset_path> --dataset_config_file configs/datasets/<dataset>>.yaml \
--input_size 224 --n_ctx_pos 64 --n_ctx_neg 64 --pretrained <ckpt_path> --top_k 5
[1] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. Zero-shot object
detection. In ECCV, 2018.
[2] Avi Ben-Cohen, Nadav Zamir, Emanuel Ben-Baruch, Itamar Friedman, and Lihi Zelnik-Manor. Semantic
diversity learning for zero-shot multi-label classification. In ICCV, 2021.
[3] Dat Huynh and Ehsan Elhamifar. A shared multi-attention framework for multi-label zero-shot learning.
In CVPR, 2020.
We would like to thank Kaiyang Zhou for providing code for CoOp. We borrowed and refactored a large portion of his code in the implementation of our work.