This repository is a PyTorch implementation of On Equivariant and Invariant Learning of Object Landmark Representations by Zezhou Cheng, Jong-Chyi Su, Subhransu Maji. ICCV 2021.
[arXiv] [Project page] [Poster] [Supplementary material]
The implementation is based on DVE [Thewlis et al. ICCV 2019] and CMC [Tian et al. 2019]. (Dependencies: tensorboard-logger, pytorch=1.4.0, torchfile)
To install:
conda env create -f environment.yml
conda activate ContrastLandmark
- Please follow the instruction from DVE to download the datasets.
- iNaturalist Aves 2017 for training. [source images] [100K image list]
- CUB dataset for evaluation. [source images] [train/val/test set]
- CelebA
CUDA_VISIBLE_DEVICES=0,1,2,3 python train_moco.py --batch_size 256 --num_workers 12 --nce_k 4096 --cosine --epochs 800 --model resnet50 --image_crop 20 --image_size 136 --model_name moco_CelebA --model_path /path/to/save/model --dataset CelebA --data_folder datasets/celeba
- iNat Aves
CUDA_VISIBLE_DEVICES=0,1,2,3 python train_moco.py --batch_size 256 --num_workers 12 --nce_k 4096 --cosine --epochs 800 --model resnet50 --image_crop 0 --image_size 96 --model_name moco_InatAve --model_path /path/to/save/model --dataset InatAve --imagelist /path/to/imagelist/inat_train_100K.txt
- CelebA
CUDA_VISIBLE_DEVICES=0,1 python train_feature_projector.py --model resnet50 --feat_distill --image_crop 20 --image_size 136 --train_layer 4 --val_layer 4 --trained_model_path /path/to/pretrained_moco --adam --epochs 10 --cosine --batch_size 32 --log_path /path/to/logfile.log --model_name feature_projector --model_path /path/to/save/checkpoint --train_use_hypercol --val_use_hypercol --vis_path /path/to/save/visualization --train_out_size 24 --val_out_size 96 --distill_mode softmax --kernel_size 1 --out_dim 128 --softargmax_mul 7. --temperature 7.
Note:
--train_layer 4 --val_layer 4 --train_use_hypercol --val_use_hypercol
: use hypercolumn representations (which consists of features from 4 intermediate layers) as the input to the feature projector;- To visualize the landmark matching, add
--visualize_matching --vis_path /path/to/save/visualization
to the above command; --out_dim 128 --softargmax_mul 7. --temperature 7.
: project hypercolumn to 128 dimensional space. We use--softargmax_mul 7. --temperature 7.
for--out_dim 128
or--out_dim 256
, and--softargmax_mul 6.5 --temperature 8.
for--out_dim 64
. These hyperparameters are searched on a validation set.- We provide the logs of training feature projectos for ResNet50-half as a reference: [training logs]
- Use hypercolumn as representation (
--use_hypercol
) with activation from conv2_x to conv5_x (--layer 4
)
CUDA_VISIBLE_DEVICES=0,1 python eval_face.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path /path/to/pretrainedMoCo --learning_rate 0.001 --weight_decay 0.0005 --adam --epochs 200 --cosine --batch_size 32 --log_path /path/to/logfile --dataset AFLW_MTFL --model_name AFLW_M_regressor --model_path /path/to/save/regressor --image_crop 20 --image_size 136 --use_hypercol
- Limited annotations, e.g., only 50 annotated face images from AFLW_MTFL are available, we use thin-plate spline for data augmentation (
--TPS_aug
)
CUDA_VISIBLE_DEVICES=0 python eval_face.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path /path/to/pretrainedMoCo --learning_rate 0.01 --weight_decay 0.05 --adam --epochs 1000 --cosine --batch_size 32 --log_path /path/to/logfile --dataset AFLW_MTFL --model_name AFLW_M_regressor --model_path /path/to/save/regressor --image_crop 20 --image_size 136 --restrict_annos 50 --repeat --TPS_aug --use_hypercol
Note: the number of GPUs used to train the linear regressor has impact on the convergence rate, the possible reason is the batch normalization is conducted separately on different GPUs. We stop the training procedure at 120th, 45th, 80th epoch on MAFL, AFLW, and 300W benchmarks respectively on 2 GPUs (determined based on our initial results and kept fixed in our experiments). However, the stopping points may be suboptimal when you train the regressor on a different number of GPUs.
CUDA_VISIBLE_DEVICES=0,1 python eval_animal.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path /path/to/pretrainedMoCo --learning_rate 0.01 --weight_decay 0.005 --adam --epochs 2000 --cosine --batch_size 32 --log_path /path/to/logfile --dataset CUB --model_name CUB_regressor --model_path /path/to/save/regressor --image_crop 0 --image_size 96 --imagelist /path/to/trainlist/train.txt --use_hypercol
Note: check out data_loaders_animal.py
, place the annotation files (train.dat, val.data) and train/val/test text files under ./datasets/CUB-200-2011
. About hyperparameter settings on bird benchmarks, if the number of annotations is smaller or equal to 100 (e.g. 10,
50, 100), lr=0.01 and weight decay=0.05 for ResNet18, ResNet50, and DVE; if more annotations (e.g. 250, 500, 1241) are available, lr=0.01 and weight decay=0.005 for ResNet18 and ResNet50, but lr=0.01 and weight decay=0.0005 for DVE (because DVE has much better performance with WD=0.0005 than WD=0.05 or 0.005)
- CelebA
CUDA_VISIBLE_DEVICES=0,1 python train_feature_projector.py --model resnet50 --feat_distill --image_crop 20 --image_size 136 --train_layer 4 --val_layer 4 --trained_model_path /path/to/pretrained_moco --log_path /path/to/logfile.log --model_name feature_projector --model_path /path/to/save/tmpfile --train_use_hypercol --val_use_hypercol --train_out_size 24 --val_out_size 96 --distill_mode softmax --kernel_size 1 --out_dim 128 --softargmax_mul 7. --temperature 7. --evaluation_mode --trained_feat_model_path /path/to/pretrained-feature-projector --visualize_matching --vis_path /path/to/save/visualization
Note:
- You could assign any strings to some arguments:
--model_name feature_projector
--visualize_matching --vis_path /path/to/save/visualization
: visualize the landmark matching results, remove--visualize_matching
to turn off the visualization- To test the performance of hypercolumn without feature projection, remove
--feat_distill
- Modify
--out_dim 128 --softargmax_mul 7. --temperature 7.
accordingly when testing other feature projection dimensions (e.g. 64, 256).--softargmax_mul 7. --temperature 7.
for--out_dim 256
;--softargmax_mul 6.5 --temperature 8.
for--out_dim 64
. - See examples on how to run the landmark matching with hypercolumn or projected features.
- Contrastively learning models:
- Celeb: [MoCo-ResNet18-CelebA] [MoCo-ResNet50-CelebA] [MoCo-ResNet50-CelebA-In-the-Wild]
- iNat Aves: [MoCo-ResNet18-iNat] [MoCo-ResNet50-iNat] [DVE-Hourglass-iNat]
- Linear-regressor: [Face benchmarks] [Bird benchmarks]
Note: On face benchmarks, the numbers in Table 1 in the main text are reported at 120th, 45th, 80th epoch for MAFL, AFLW and 300W. The epoch is indexing from 0. However, the index was starting from 1 when we saved the model. This leads to different scores with the saved model from these in Table 1 (either slightly better or slightly worse).
- Pretrained feature projector [Feature projectors]
The feature projectors are trained under different network architectures (e.g. ResNet18, ResNet50, ResNet50-half, etc.) and pretraining methods (e.g. MoCo, ImageNet, Random Init etc.). These settings corresponds to Table 4 and Table 5 in the supplementary material.
After downloading the pretrained models, run the following commands to evaluate and visualize the pretrained models.
pretrained_MOCO_FACE=./Pretrained/ckpt_epoch_800_resnet50_celeba.pth
pretrained_MOCO_inat=./Pretrained/ckpt_epoch_800_resnet50_inat.pth
pretrained_AFLW_R=./Pretrained/ckpt_epoch_45_resnet50_AFLW_R.pth
pretrained_CUB=./Pretrained/CUB_resnet50.pth
visdir=./Visualization
- Face benchmarks:
CUDA_VISIBLE_DEVICES=0 python vis_face.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path $pretrained_MOCO_FACE --batch_size 32 --log_path $log_file --dataset AFLW --image_crop 20 --image_size 136 --ckpt_path $pretrained_AFLW_R --vis_path $visdir --use_hypercol --vis_keypoints
- Bird benchmarks:
CUDA_VISIBLE_DEVICES=0 python vis_animal.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path $pretrained_MOCO_inat --batch_size 32 --log_path $log_file --dataset CUB --image_crop 0 --image_size 96 --ckpt_path $pretrained_CUB --vis_path $visdir --use_hypercol --vis_keypoints
- Face benchmarks:
CUDA_VISIBLE_DEVICES=0 python vis_face.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path $pretrained_MOCO_FACE --batch_size 32 --log_path $log_file --dataset MAFLAligned --image_crop 20 --image_size 136 --vis_path $visdir --use_hypercol --vis_PCA
- Bird benchmarks:
CUDA_VISIBLE_DEVICES=0 python vis_animal.py --model resnet50 --num_workers 8 --layer 4 --trained_model_path $pretrained_MOCO_inat --batch_size 32 --log_path $log_file --dataset CUB --image_crop 0 --image_size 96 --vis_path $visdir --use_hypercol --vis_PCA
If you use this code for your research, please cite the following papers.
@inproceedings{cheng2021equivariant,
title={On Equivariant and Invariant Learning of Object Landmark Representations},
author={Cheng, Zezhou and Su, Jong-Chyi and Maji, Subhransu},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={9897--9906},
year={2021}
}