Simple Image-level Classification Improves Open-vocabulary Object Detection (arXiv 2312.10439)
Our environment: Unbuntu18.04, cuda10.2, python3.8, pytorch1.12.0, detectron2.0.6
conda create -n cads python=3.8
conda activate cads
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=10.2 -c pytorch
git clone https://github.com/facebookresearch/detectron2.git
cd detectron2
pip install -e .
cd ..
git clone https://github.com/openai/CLIP.git
cd CLIP
pip install -e .
cd ..
pip install -r requirements.txt
Please follow prepare datasets
Download ImageNet-21K pretrained ResNet-50 from MIIL
convert the pretrained models into d2 style:
python tools/convert-thirdparty-pretrained-model-to-d2.py --path models/resnet50_miil_21k.pth
Download the pretrained BoxSup\ Detic for OV-LVIS and Detic for OV-COCO
Train on LVIS-Base
python train_net.py --resume --dist-url auto --num-gpus 4 \
--config configs/ovmlr_lvis_base.yaml OUTPUT_DIR training_dir/ovmlr_lvis_base
Train on LVIS-Base + ImageNet-21K
python train_net.py --resume --dist-url auto --num-gpus 4 \
--config configs/ovmlr_lvis_base_in21k.yaml OUTPUT_DIR training_dir/ovmlr_lvis_base_in21k
Train on COCO-Base + COCO Caption
python train_net.py --resume --dist-url auto --num-gpus 4 \
--config configs/ovmlr_coco_base_caption.yaml OUTPUT_DIR training_dir/ovmlr_coco_base_caption
Merge the weight of MLR model and pretrained OVOD models on OV-LVIS
# for Boxsup
python tools/merge_weights.py --weight_path_mlr training_dir/ovmlr_lvis_base/model_final.pth \
--weight_path_ovod models/BoxSup-C2_Lbase_CLIP_R5021k_640b64_4x.pth \
--output_path models/BoxSup-C2_Lbase_CLIP_R5021k_640b64_4x_mlr.pth
# for Detic
python tools/merge_weights.py --weight_path_mlr training_dir/ovmlr_lvis_base_in21k/model_final.pth \
--weight_path_ovod models/Detic_LbaseI_CLIP_R5021k_640b64_4x_ft4x_max-size.pth \
--output_path models/Detic_LbaseI_CLIP_R5021k_640b64_4x_ft4x_max-size_mlr.pth
Evaluation on OV-LVIS
python train_net.py --resume --eval-only --dist-url auto --num-gpus 4 \
--config configs/boxsup_cads_ovlvis.yaml OUTPUT_DIR training_dir/boxsup_lvis_base_cads \
MODEL.WEIGHTS models/BoxSup-C2_Lbase_CLIP_R5021k_640b64_4x_mlr.pth
python train_net.py --resume --eval-only --dist-url auto --num-gpus 4 \
--config configs/boxsup_cads_ovlvis.yaml OUTPUT_DIR training_dir/detic_lvis_cads \
MODEL.WEIGHTS models/Detic_LbaseI_CLIP_R5021k_640b64_4x_ft4x_max-size_mlr.pth
Merge the weight of MLR model and pretrained OVOD models on OV-COCO
python tools/merge_weights.py --weight_path_mlr training_dir/ovmlr_coco_base_caption/model_final.pth \
--weight_path_ovod models/Detic_OVCOCO_CLIP_R50_1x_max-size_caption.pth \
--output_path models/Detic_OVCOCO_CLIP_R50_1x_max-size_caption_mlr.pth
Evaluation on OV-COCO
python train_net.py --resume --eval-only --dist-url auto --num-gpus 4 \
--config configs/detic_cads_ovcoco.yaml OUTPUT_DIR training_dir/detic_ovcoco_cads \
MODEL.WEIGHTS models/Detic_OVCOCO_CLIP_R50_1x_max-size_caption_mlr.pth
Our code is mainly based on Detic and Detectron2.
@inproceedings{fang2023simple,
title={Simple Image-level Classification Improves Open-vocabulary Object Detection},
author={Fang, Ruohuan and Pang, Guansong and Bai, Xiao},
booktitle={The 38th Annual AAAI Conference on Artificial Intelligence},
year={2024}
}