Abstract: Weakly-supervised object localization (WSOL) methods aim to capture the extent of the target object without full supervision such as bounding boxes or segmentation masks. Although numerous studies have been conducted in the research field of WSOL, we find that most existing methods are less effective at localizing small objects. In this paper, we first analyze why previous studies have overlooked this problem. Based on the analysis, we propose two remedies: 1) new evaluation metrics and a dataset to accurately measure localization performance for small objects, and 2) a novel consistency learning framework to zoom in on small objects so the model can perceive them more clearly. Our extensive experimental results demonstrate that the proposed method significantly improves small object localization on four different backbone networks and four different datasets, without sacrificing the performance of medium and large objects. In addition to these gains, our method can be easily applied to existing WSOL methods as it does not require any changes to the model architecture or data input pipeline.
Official implementation of "Small Object Matters in Weakly Supervised Object Localization"
Most part of our codes originates from this repository.
Run the following command to build the docker container.
Modify pytorch/pytorch:latest
tag in Dockerfile
, if necessary.
docker build . -t wsol_test
docker run -it -d --gpus '"device=0"' --shm-size=16G --name wsol_test wsol_test:latest
docker exec -it wsol_test /bin/bash
Environments (pip freeze
returns):
munch==2.5.0
sklearn==0.0
opencv-python==4.5.5.64
torch==1.11.0
torchvision==0.12.0
We borrowed the script for preparing datasets from the original repository.
ImageNet
To prepare ImageNet data, download ImageNet "train" and "val" split from here and put the downloaded file on dataset/ILSVRC2012_img_train.tar and dataset/ILSVRC2012_img_val.tar.
Then, run the following command on the root
directory to extract the images.
sh dataset/prepare_imagenet.sh
CUB
Run the following command to download the original CUB dataset and extract the image files on the root
directory.
sh dataset/prepare_cub.sh
Note: you can also download the CUBV2 dataset from here and CUBSmall dataset from here. Put the downloaded file on the dataset/CUBV2.tar
directory and then run the above script.
OpenImages
To download and extract files, run the following command on root
directory
sh dataset/prepare_openimages.sh
Note: you can also download the OpenImages30k dataset from here
(images
, masks).
Put the downloaded OpenImages_images.zip
and OpenImages_annotations.zip
files in dataset
directory and run the above script.
Run the following command when training the ResNet50 network on the ImageNet dataset.
sh scripts/resnet_imagenet_ours.sh
For the sake of reproducing our experimental results, we include all training scripts for the three backbones and three datasets in ./scripts/
.
- You must modify
--data_root, --mask_root
arguments by your own local path.
Training log and checkpoints will be saved in ./train_log/
.
In our paper, we applied our method to three state-of-the-art methods:
Domain Adaptation (DA), Bridging the Gap (Brid) and IVR.
For reproducing the results, first you need to download the pretrained models here.
Then, run the following command:
- Domain Adaptation (ResNet50, ImageNet):
sh scripts/resnet_imagenet_ours_with_da.sh
- Bridging the Gap (ResNet50, ImageNet):
sh scripts/resnet_imagenet_ours_with_brid.sh
- IVR (ResNet50, ImageNet):
sh scripts/resnet_imagenet_ours_with_ivr.sh
percentile
values in each datasets and architectures are report in following table.
ImageNet | ResNet50 | VGG16 | Inception V3 |
---|---|---|---|
percentile | 0.3 | 0.2 | 0.4 |
Add the following arguments in the script.
--checkpoint_path train_log/resnet_imagenet_ours/last_checkpoint.pth.tar \
--eval_on_val_and_test False \
--eval_size_ratio True
Architecture | ImageNet | CUB | OpenImages | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
λ₁ | λ₂ | λ₃ | τ | ν | λ₁ | λ₂ | λ₃ | τ | ν | λ₁ | λ₂ | λ₃ | τ | ν | |
ResNet50 | 0.90 | 0.10 | 0.90 | 0.15 | 0.30 | 0.50 | 0.20 | 0.80 | 0.20 | 0.80 | 1.50 | 0.50 | 0.50 | 0.30 | 1.00 |
VGG16 | 0.80 | 0.50 | 0.80 | 0.50 | 0.90 | 0.90 | 0.10 | 0.70 | 0.70 | 0.20 | 1.00 | 0.70 | 0.70 | 0.05 | 0.10 |
Inception | 0.60 | 0.70 | 0.70 | 0.50 | 0.60 | 1.00 | 0.20 | 0.80 | 0.40 | 0.10 | 0.20 | 1.30 | 0.90 | 0.15 | 0.10 |
eval_on_val_and_test
: evaluation on val or test dataset.eval_size_ratio
: print the scores that evaluated byMaxBoxAcc^S, MaxBoxAcc^mean
metric.
You can download pre-trained models here.
- Pretrained models trained on ImageNet, CUB, and OpenImages using three architectures (ResNet50, VGG16, InceptionV3) are available.
- File name example:
resnet_imagenet_ours.pth.tar
is trained on the ImageNet dataset using ResNet50. - We upload all the models on Google Drive of an anonymous account.
To evaluate the pre-trained model, modify the --checkpoint_path
argument by the path of the downloaded file.
Our method can be easily applied to other methods:
- First, copy and paste
./wsol/method/crop.py
into your code repository. - Next, add the following snippets into your code.
# main.py
self.crop_module = wsol.method.CropCAM(self.args.large_feature_map,
self.args.original_feature_map,
architecture=self.args.architecture,
# Hyperparameters
loss_ratio=self.args.loss_ratio,
loss_pos=self.args.loss_pos,
loss_neg=self.args.loss_neg,
crop_threshold=self.args.crop_threshold,
crop_ratio=self.args.crop_ratio,
# For CAAM
attention_cam=self.args.attention_cam,
# For attach the network freely.
wsol_method=self.args.wsol_method,
other_method_loss_ratio=self.args.other_method_loss_ratio,
crop_method_loss_ratio=self.args.crop_method_loss_ratio,
# For Several Norm Method.
norm_method=self.args.norm_method,
percentile=self.args.percentile,
crop_with_norm=self.args.crop_with_norm)
# main.py
if epoch >= self.args.crop_start_epoch:
output_dict = self.crop_module.forward(self.model, images, target)
logits = output_dict["logits"]
loss, att_loss, cls_loss = self.crop_module.get_loss(output_dict=output_dict, target=target)
return logits, loss, att_loss, cls_loss
# resnet.py
if crop:
return {'cam_weights': self.fc.weight[labels],
'logits': logits, 'feature_map': x}
- We assume that your model instance is in the variable
self.model
. - You might need to modify the above snippets for applying them to your code repository.
- In that case, you can refer to our implementation in
main.py
andwsol/resnet.py
.