The models are summarized in the following table. Note that the performance reported is "raw", without any fine-tuning. For each dataset, we report the class-agnostic box AP@50, which measures how well the model finds the boxes mentioned in the text. All performances are reported on the respective validation sets of each dataset.
Backbone | GQA | Flickr | Refcoco | Url |
Size |
|||||
---|---|---|---|---|---|---|---|---|---|---|
AP | AP | R@1 | AP | Refcoco R@1 | Refcoco+ R@1 | Refcocog R@1 | ||||
1 | R101 | 58.9 | 75.6 | 82.5 | 60.3 | 72.1 | 58.0 | 55.7 | model | 3GB |
2 | ENB3 | 59.5 | 76.6 | 82.9 | 57.6 | 70.2 | 56.7 | 53.8 | model | 2.4GB |
3 | ENB5 | 59.9 | 76.4 | 83.7 | 61.8 | 73.4 | 58.8 | 57.1 | model | 2.7GB |
The config file for pretraining is configs/pretrain.json and looks like:
{
"combine_datasets": ["flickr", "mixed"],
"combine_datasets_val": ["flickr", "gqa", "refexp"],
"coco_path": "",
"vg_img_path": "",
"flickr_img_path": "",
"refexp_ann_path": "mdetr_annotations/",
"flickr_ann_path": "mdetr_annotations/",
"gqa_ann_path": "mdetr_annotations/",
"num_queries": 100,
"refexp_dataset_name": "all",
"GT_type": "separate",
"flickr_dataset_path": ""
}
- Download the original Flickr30k image dataset from : Flickr30K webpage and update the
flickr_img_path
to the folder containing the images. - Download the original Flickr30k entities annotations from: Flickr30k annotations and update the
flickr_dataset_path
to the folder with annotations. - Download the gqa images at GQA images and update
vg_img_path
to point to the folder containing the images. - Download COCO images Coco train2014. Update the
coco_path
to the folder containing the downloaded images. - Download our pre-processed annotations that are converted to coco format (all datasets present in the same zip folder for MDETR annotations): Pre-processed annotations and update the
flickr_ann_path
,gqa_ann_path
andrefexp_ann_path
to this folder with pre-processed annotations.
This command will reproduce the training of the resnet 101.
python run_with_submitit.py --dataset_config configs/pretrain.json --ngpus 8 --nodes 4 --ema
To run on a single node with 8 gpus
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --dataset_config configs/pretrain.json --ema
We provide an interface to the Timm Library.
Most stride 32 models that support the "features_only" mode should be supported out of the box. That includes Resnet variants as well as EfficientNets. Simply use --backbone timm_modelname
where the modelname is taken from the list of Timm model here.
In the paper, we train an efficientnet-b3 with noisy-student pre-training as follows (note that the learning rate of the backbone is slightly different):
python run_with_submitit.py --dataset_config configs/pretrain.json --ngpus 8 --nodes 4 --ema --backbone timm_tf_efficientnet_b3_ns --lr_backbone 5e-5
To run on a single node with 8 gpus
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --dataset_config configs/pretrain.json --ema --backbone timm_tf_efficientnet_b3_ns --lr_backbone 5e-5