diff --git a/README.md b/README.md
index 09e20cf70fe..15f71dad5fb 100644
--- a/README.md
+++ b/README.md
@@ -103,50 +103,16 @@ Apart from MMDetection, we also released [MMEngine](https://github.com/open-mmla
### Highlight
-**v3.2.0** was released in 12/10/2023:
+**v3.3.0** was released in 5/1/2024:
-**1. Detection Transformer SOTA Model Collection**
-(1) Supported four updated and stronger SOTA Transformer models: [DDQ](configs/ddq/README.md), [CO-DETR](projects/CO-DETR/README.md), [AlignDETR](projects/AlignDETR/README.md), and [H-DINO](projects/HDINO/README.md).
-(2) Based on CO-DETR, MMDet released a model with a COCO performance of 64.1 mAP.
-(3) Algorithms such as DINO support `AMP/Checkpoint/FrozenBN`, which can effectively reduce memory usage.
+**[MM-Grounding-DINO: An Open and Comprehensive Pipeline for Unified Object Grounding and Detection](https://arxiv.org/abs/2401.02361)**
-**2. [Comprehensive Performance Comparison between CNN and Transformer](<(projects/RF100-Benchmark/README.md)>)**
-RF100 consists of a dataset collection of 100 real-world datasets, including 7 domains. It can be used to assess the performance differences of Transformer models like DINO and CNN-based algorithms under different scenarios and data volumes. Users can utilize this benchmark to quickly evaluate the robustness of their algorithms in various scenarios.
+Grounding DINO is a grounding pre-training model that unifies 2d open vocabulary object detection and phrase grounding, with wide applications. However, its training part has not been open sourced. Therefore, we propose MM-Grounding-DINO, which not only serves as an open source replication version of Grounding DINO, but also achieves significant performance improvement based on reconstructed data types, exploring different dataset combinations and initialization strategies. Moreover, we conduct evaluations from multiple dimensions, including OOD, REC, Phrase Grounding, OVD, and Fine-tune, to fully excavate the advantages and disadvantages of Grounding pre-training, hoping to provide inspiration for future work.
-
-
-
-
-**3. Support for [GLIP](configs/glip/README.md) and [Grounding DINO](configs/grounding_dino/README.md) fine-tuning, the only algorithm library that supports Grounding DINO fine-tuning**
-The Grounding DINO algorithm in MMDet is the only library that supports fine-tuning. Its performance is one point higher than the official version, and of course, GLIP also outperforms the official version.
-We also provide a detailed process for training and evaluating Grounding DINO on custom datasets. Everyone is welcome to give it a try.
-
-| Model | Backbone | Style | COCO mAP | Official COCO mAP |
-| :----------------: | :------: | :-------: | :--------: | :---------------: |
-| Grounding DINO-T | Swin-T | Zero-shot | 48.5 | 48.4 |
-| Grounding DINO-T | Swin-T | Finetune | 58.1(+0.9) | 57.2 |
-| Grounding DINO-B | Swin-B | Zero-shot | 56.9 | 56.7 |
-| Grounding DINO-B | Swin-B | Finetune | 59.7 | |
-| Grounding DINO-R50 | R50 | Scratch | 48.9(+0.8) | 48.1 |
-
-**4. Support for the open-vocabulary detection algorithm [Detic](projects/Detic_new/README.md) and multi-dataset joint training.**
-**5. Training detection models using [FSDP and DeepSpeed](<(projects/example_largemodel/README.md)>).**
-
-| ID | AMP | GC of Backbone | GC of Encoder | FSDP | Peak Mem (GB) | Iter Time (s) |
-| :-: | :-: | :------------: | :-----------: | :--: | :-----------: | :-----------: |
-| 1 | | | | | 49 (A100) | 0.9 |
-| 2 | √ | | | | 39 (A100) | 1.2 |
-| 3 | | √ | | | 33 (A100) | 1.1 |
-| 4 | √ | √ | | | 25 (A100) | 1.3 |
-| 5 | | √ | √ | | 18 | 2.2 |
-| 6 | √ | √ | √ | | 13 | 1.6 |
-| 7 | | √ | √ | √ | 14 | 2.9 |
-| 8 | √ | √ | √ | √ | 8.5 | 2.4 |
-
-**6. Support for the [V3Det](configs/v3det/README.md) dataset, a large-scale detection dataset with over 13,000 categories.**
+code: [mm_grounding_dino/README.md](configs/mm_grounding_dino/README.md)
-
+
We are excited to announce our latest work on real-time object recognition tasks, **RTMDet**, a family of fully convolutional single-stage detectors. RTMDet not only achieves the best parameter-accuracy trade-off on object detection from tiny to extra-large model sizes but also obtains new state-of-the-art performance on instance segmentation and rotated object detection tasks. Details can be found in the [technical report](https://arxiv.org/abs/2212.07784). Pre-trained models are [here](configs/rtmdet).
diff --git a/README_zh-CN.md b/README_zh-CN.md
index ccf1cbf0082..885d1f22617 100644
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -102,51 +102,18 @@ MMDetection 是一个基于 PyTorch 的目标检测开源工具箱。它是 [Ope
### 亮点
-**v3.2.0** 版本已经在 2023.10.12 发布:
+**v3.3.0** 版本已经在 2024.1.5 发布:
-**1. 检测 Transformer SOTA 模型大合集**
-(1) 支持了 [DDQ](configs/ddq/README.md)、[CO-DETR](projects/CO-DETR/README.md)、[AlignDETR](projects/AlignDETR/README.md) 和 [H-DINO](projects/HDINO/README.md) 4 个更新更强的 SOTA Transformer 模型
-(2) 基于 CO-DETR, MMDet 中发布了 COCO 性能为 64.1 mAP 的模型
-(3) DINO 等算法支持 AMP/Checkpoint/FrozenBN,可以有效降低显存
+**MM-Grounding-DINO: 轻松涨点,数据到评测全面开源**
-**2. [提供了全面的 CNN 和 Transformer 的性能对比](projects/RF100-Benchmark/README_zh-CN.md)**
-RF100 是由 100 个现实收集的数据集组成,包括 7 个域,可以验证 DINO 等 Transformer 模型和 CNN 类算法在不同场景不同数据量下的性能差异。用户可以用这个 Benchmark 快速验证自己的算法在不同场景下的鲁棒性。
+Grounding DINO 是一个统一了 2d 开放词汇目标检测和 Phrase Grounding 的检测预训练模型,应用广泛,但是其训练部分并未开源,为此提出了 MM-Grounding-DINO。其不仅作为 Grounding DINO 的开源复现版,MM-Grounding-DINO 基于重新构建的数据类型出发,在探索了不同数据集组合和初始化策略基础上实现了 Grounding DINO 的性能极大提升,并且从多个维度包括 OOD、REC、Phrase Grounding、OVD 和 Finetune 等方面进行评测,充分挖掘 Grounding 预训练优缺点,希望能为后续工作提供启发。
-
-
-
-
-**3. 支持了 [GLIP](configs/glip/README.md) 和 [Grounding DINO](configs/grounding_dino/README.md) 微调,全网唯一支持 Grounding DINO 微调**
-MMDet 中的 Grounding DINO 是全网唯一支持微调的算法库,且性能高于官方 1 个点,当然 GLIP 也比官方高。
-我们还提供了详细的 Grounding DINO 在自定义数据集上训练评估的流程,欢迎大家试用。
-
-| Model | Backbone | Style | COCO mAP | Official COCO mAP |
-| :----------------: | :------: | :-------: | :--------: | :---------------: |
-| Grounding DINO-T | Swin-T | Zero-shot | 48.5 | 48.4 |
-| Grounding DINO-T | Swin-T | Finetune | 58.1(+0.9) | 57.2 |
-| Grounding DINO-B | Swin-B | Zero-shot | 56.9 | 56.7 |
-| Grounding DINO-B | Swin-B | Finetune | 59.7 | |
-| Grounding DINO-R50 | R50 | Scratch | 48.9(+0.8) | 48.1 |
-
-**4. 支持开放词汇检测算法 [Detic](projects/Detic_new/README.md) 并提供多数据集联合训练可能**
-
-**5. 轻松使用 [FSDP 和 DeepSpeed 训练检测模型](projects/example_largemodel/README_zh-CN.md)**
-
-| ID | AMP | GC of Backbone | GC of Encoder | FSDP | Peak Mem (GB) | Iter Time (s) |
-| :-: | :-: | :------------: | :-----------: | :--: | :-----------: | :-----------: |
-| 1 | | | | | 49 (A100) | 0.9 |
-| 2 | √ | | | | 39 (A100) | 1.2 |
-| 3 | | √ | | | 33 (A100) | 1.1 |
-| 4 | √ | √ | | | 25 (A100) | 1.3 |
-| 5 | | √ | √ | | 18 | 2.2 |
-| 6 | √ | √ | √ | | 13 | 1.6 |
-| 7 | | √ | √ | √ | 14 | 2.9 |
-| 8 | √ | √ | √ | √ | 8.5 | 2.4 |
+arxiv 技术报告:https://arxiv.org/abs/2401.02361
-**6. 支持了 [V3Det](configs/v3det/README.md) 1.3w+ 类别的超大词汇检测数据集**
+代码地址: [mm_grounding_dino/README.md](configs/mm_grounding_dino/README.md)
-
+
我们很高兴向大家介绍我们在实时目标识别任务方面的最新成果 RTMDet,包含了一系列的全卷积单阶段检测模型。 RTMDet 不仅在从 tiny 到 extra-large 尺寸的目标检测模型上实现了最佳的参数量和精度的平衡,而且在实时实例分割和旋转目标检测任务上取得了最先进的成果。 更多细节请参阅[技术报告](https://arxiv.org/abs/2212.07784)。 预训练模型可以在[这里](configs/rtmdet)找到。
diff --git a/configs/faster_rcnn/README.md b/configs/faster_rcnn/README.md
index 0d9912db29d..8bcdcf6d512 100644
--- a/configs/faster_rcnn/README.md
+++ b/configs/faster_rcnn/README.md
@@ -14,50 +14,50 @@ State-of-the-art object detection networks depend on region proposal algorithms
## Results and Models
-| Backbone | Style | Lr schd | Mem (GB) | Inf time (fps) | box AP | Config | Download |
-| :-------------: | :-----: | :-----: | :------: | :------------: | :----: | :-----------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| R-50-C4 | caffe | 1x | - | - | 35.6 | [config](./faster-rcnn_r50-caffe_c4-1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_c4_1x_coco/faster_rcnn_r50_caffe_c4_1x_coco_20220316_150152-3f885b85.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_c4_1x_coco/faster_rcnn_r50_caffe_c4_1x_coco_20220316_150152.log.json) |
-| R-50-DC5 | caffe | 1x | - | - | 37.2 | [config](./faster-rcnn_r50-caffe-dc5_1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_dc5_1x_coco/faster_rcnn_r50_caffe_dc5_1x_coco_20201030_151909-531f0f43.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_dc5_1x_coco/faster_rcnn_r50_caffe_dc5_1x_coco_20201030_151909.log.json) |
-| R-50-FPN | caffe | 1x | 3.8 | | 37.8 | [config](./faster-rcnn_r50-caffe_fpn_1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_fpn_1x_coco/faster_rcnn_r50_caffe_fpn_1x_coco_bbox_mAP-0.378_20200504_180032-c5925ee5.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_fpn_1x_coco/faster_rcnn_r50_caffe_fpn_1x_coco_20200504_180032.log.json) |
-| R-50-FPN | pytorch | 1x | 4.0 | 21.4 | 37.4 | [config](./faster-rcnn_r50_fpn_1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130_204655.log.json) |
-| R-50-FPN (FP16) | pytorch | 1x | 3.4 | 28.8 | 37.5 | [config](./faster-rcnn_r50_fpn_amp-1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/fp16/faster_rcnn_r50_fpn_fp16_1x_coco/faster_rcnn_r50_fpn_fp16_1x_coco_20200204-d4dc1471.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/fp16/faster_rcnn_r50_fpn_fp16_1x_coco/faster_rcnn_r50_fpn_fp16_1x_coco_20200204_143530.log.json) |
-| R-50-FPN | pytorch | 2x | - | - | 38.4 | [config](./faster-rcnn_r50_fpn_2x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_2x_coco/faster_rcnn_r50_fpn_2x_coco_bbox_mAP-0.384_20200504_210434-a5d8aa15.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_2x_coco/faster_rcnn_r50_fpn_2x_coco_20200504_210434.log.json) |
-| R-101-FPN | caffe | 1x | 5.7 | | 39.8 | [config](./faster-rcnn_r101-caffe_fpn_1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_caffe_fpn_1x_coco/faster_rcnn_r101_caffe_fpn_1x_coco_bbox_mAP-0.398_20200504_180057-b269e9dd.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_caffe_fpn_1x_coco/faster_rcnn_r101_caffe_fpn_1x_coco_20200504_180057.log.json) |
-| R-101-FPN | pytorch | 1x | 6.0 | 15.6 | 39.4 | [config](./faster-rcnn_r101_fpn_1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_fpn_1x_coco/faster_rcnn_r101_fpn_1x_coco_20200130-f513f705.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_fpn_1x_coco/faster_rcnn_r101_fpn_1x_coco_20200130_204655.log.json) |
-| R-101-FPN | pytorch | 2x | - | - | 39.8 | [config](./faster-rcnn_r101_fpn_2x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_fpn_2x_coco/faster_rcnn_r101_fpn_2x_coco_bbox_mAP-0.398_20200504_210455-1d2dac9c.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_fpn_2x_coco/faster_rcnn_r101_fpn_2x_coco_20200504_210455.log.json) |
-| X-101-32x4d-FPN | pytorch | 1x | 7.2 | 13.8 | 41.2 | [config](./faster-rcnn_x101-32x4d_fpn_1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_32x4d_fpn_1x_coco/faster_rcnn_x101_32x4d_fpn_1x_coco_20200203-cff10310.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_32x4d_fpn_1x_coco/faster_rcnn_x101_32x4d_fpn_1x_coco_20200203_000520.log.json) |
-| X-101-32x4d-FPN | pytorch | 2x | - | - | 41.2 | [config](./faster-rcnn_x101-32x4d_fpn_2x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_32x4d_fpn_2x_coco/faster_rcnn_x101_32x4d_fpn_2x_coco_bbox_mAP-0.412_20200506_041400-64a12c0b.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_32x4d_fpn_2x_coco/faster_rcnn_x101_32x4d_fpn_2x_coco_20200506_041400.log.json) |
-| X-101-64x4d-FPN | pytorch | 1x | 10.3 | 9.4 | 42.1 | [config](./faster-rcnn_x101-64x4d_fpn_1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_64x4d_fpn_1x_coco/faster_rcnn_x101_64x4d_fpn_1x_coco_20200204-833ee192.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_64x4d_fpn_1x_coco/faster_rcnn_x101_64x4d_fpn_1x_coco_20200204_134340.log.json) |
-| X-101-64x4d-FPN | pytorch | 2x | - | - | 41.6 | [config](./faster-rcnn_x101-64x4d_fpn_2x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_64x4d_fpn_2x_coco/faster_rcnn_x101_64x4d_fpn_2x_coco_20200512_161033-5961fa95.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_64x4d_fpn_2x_coco/faster_rcnn_x101_64x4d_fpn_2x_coco_20200512_161033.log.json) |
+| Backbone | Style | Lr schd | Mem (GB) | Inf time (fps) | box AP | Config | Download |
+| :-------------: | :-----: | :-----: | :------: | :------------: | :----: | :-----------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| R-50-C4 | caffe | 1x | - | - | 35.6 | [config](./faster-rcnn_r50-caffe_c4-1x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r50-caffe-c4_1x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_c4_1x_coco/faster_rcnn_r50_caffe_c4_1x_coco_20220316_150152.log.json) |
+| R-50-DC5 | caffe | 1x | - | - | 37.2 | [config](./faster-rcnn_r50-caffe-dc5_1x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r50-caffe-dc5_1x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_dc5_1x_coco/faster_rcnn_r50_caffe_dc5_1x_coco_20201030_151909.log.json) |
+| R-50-FPN | caffe | 1x | 3.8 | | 37.8 | [config](./faster-rcnn_r50-caffe_fpn_1x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r50-caffe_fpn_1x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_fpn_1x_coco/faster_rcnn_r50_caffe_fpn_1x_coco_20200504_180032.log.json) |
+| R-50-FPN | pytorch | 1x | 4.0 | 21.4 | 37.4 | [config](./faster-rcnn_r50_fpn_1x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r50_fpn_1x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130_204655.log.json) |
+| R-50-FPN (FP16) | pytorch | 1x | 3.4 | 28.8 | 37.5 | [config](./faster-rcnn_r50_fpn_amp-1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/fp16/faster_rcnn_r50_fpn_fp16_1x_coco/faster_rcnn_r50_fpn_fp16_1x_coco_20200204-d4dc1471.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/fp16/faster_rcnn_r50_fpn_fp16_1x_coco/faster_rcnn_r50_fpn_fp16_1x_coco_20200204_143530.log.json) |
+| R-50-FPN | pytorch | 2x | - | - | 38.4 | [config](./faster-rcnn_r50_fpn_2x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r50_fpn_2x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_2x_coco/faster_rcnn_r50_fpn_2x_coco_20200504_210434.log.json) |
+| R-101-FPN | caffe | 1x | 5.7 | | 39.8 | [config](./faster-rcnn_r101-caffe_fpn_1x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r101-caffe_fpn_1x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_caffe_fpn_1x_coco/faster_rcnn_r101_caffe_fpn_1x_coco_20200504_180057.log.json) |
+| R-101-FPN | pytorch | 1x | 6.0 | 15.6 | 39.4 | [config](./faster-rcnn_r101_fpn_1x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r101_fpn_1x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_fpn_1x_coco/faster_rcnn_r101_fpn_1x_coco_20200130_204655.log.json) |
+| R-101-FPN | pytorch | 2x | - | - | 39.8 | [config](./faster-rcnn_r101_fpn_2x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r101_fpn_2x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_fpn_2x_coco/faster_rcnn_r101_fpn_2x_coco_20200504_210455.log.json) |
+| X-101-32x4d-FPN | pytorch | 1x | 7.2 | 13.8 | 41.2 | [config](./faster-rcnn_x101-32x4d_fpn_1x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_x101-32x4d_fpn_1x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_32x4d_fpn_1x_coco/faster_rcnn_x101_32x4d_fpn_1x_coco_20200203_000520.log.json) |
+| X-101-32x4d-FPN | pytorch | 2x | - | - | 41.2 | [config](./faster-rcnn_x101-32x4d_fpn_2x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_x101-32x4d_fpn_2x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_32x4d_fpn_2x_coco/faster_rcnn_x101_32x4d_fpn_2x_coco_20200506_041400.log.json) |
+| X-101-64x4d-FPN | pytorch | 1x | 10.3 | 9.4 | 42.1 | [config](./faster-rcnn_x101-64x4d_fpn_1x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_x101-64x4d_fpn_1x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_64x4d_fpn_1x_coco/faster_rcnn_x101_64x4d_fpn_1x_coco_20200204_134340.log.json) |
+| X-101-64x4d-FPN | pytorch | 2x | - | - | 41.6 | [config](./faster-rcnn_x101-64x4d_fpn_2x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_x101-64x4d_fpn_2x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_64x4d_fpn_2x_coco/faster_rcnn_x101_64x4d_fpn_2x_coco_20200512_161033.log.json) |
## Different regression loss
We trained with R-50-FPN pytorch style backbone for 1x schedule.
-| Backbone | Loss type | Mem (GB) | Inf time (fps) | box AP | Config | Download |
-| :------: | :------------: | :------: | :------------: | :----: | :----------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| R-50-FPN | L1Loss | 4.0 | 21.4 | 37.4 | [config](./faster-rcnn_r50_fpn_1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130_204655.log.json) |
-| R-50-FPN | IoULoss | | | 37.9 | [config](./faster-rcnn_r50_fpn_iou_1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_iou_1x_coco/faster_rcnn_r50_fpn_iou_1x_coco_20200506_095954-938e81f0.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_iou_1x_coco/faster_rcnn_r50_fpn_iou_1x_coco_20200506_095954.log.json) |
-| R-50-FPN | GIoULoss | | | 37.6 | [config](./faster-rcnn_r50_fpn_giou_1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_giou_1x_coco-0eada910.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_giou_1x_coco_20200505_161120.log.json) |
-| R-50-FPN | BoundedIoULoss | | | 37.4 | [config](./faster-rcnn_r50_fpn_bounded-iou_1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_bounded_iou_1x_coco-98ad993b.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_bounded_iou_1x_coco_20200505_160738.log.json) |
+| Backbone | Loss type | Mem (GB) | Inf time (fps) | box AP | Config | Download |
+| :------: | :------------: | :------: | :------------: | :----: | :----------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| R-50-FPN | L1Loss | 4.0 | 21.4 | 37.4 | [config](./faster-rcnn_r50_fpn_1x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r50_fpn_1x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130_204655.log.json) |
+| R-50-FPN | IoULoss | | | 37.9 | [config](./faster-rcnn_r50_fpn_iou_1x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r50_fpn_iou_1x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_iou_1x_coco/faster_rcnn_r50_fpn_iou_1x_coco_20200506_095954.log.json) |
+| R-50-FPN | GIoULoss | | | 37.6 | [config](./faster-rcnn_r50_fpn_giou_1x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r50_fpn_giou_1x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_giou_1x_coco_20200505_161120.log.json) |
+| R-50-FPN | BoundedIoULoss | | | 37.4 | [config](./faster-rcnn_r50_fpn_bounded-iou_1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_bounded_iou_1x_coco-98ad993b.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_bounded_iou_1x_coco_20200505_160738.log.json) |
## Pre-trained Models
We also train some models with longer schedules and multi-scale training. The users could finetune them for downstream tasks.
-| Backbone | Style | Lr schd | Mem (GB) | Inf time (fps) | box AP | Config | Download |
-| :-----------------------------------------------------------: | :-----: | :-----: | :------: | :------------: | :----: | :--------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| [R-50-C4](./faster-rcnn_r50-caffe-c4_ms-1x_coco.py) | caffe | 1x | - | | 35.9 | [config](./faster-rcnn_r50-caffe-c4_ms-1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_c4_mstrain_1x_coco/faster_rcnn_r50_caffe_c4_mstrain_1x_coco_20220316_150527-db276fed.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_c4_mstrain_1x_coco/faster_rcnn_r50_caffe_c4_mstrain_1x_coco_20220316_150527.log.json) |
-| [R-50-DC5](./faster-rcnn_r50-caffe-dc5_ms-1x_coco.py) | caffe | 1x | - | | 37.4 | [config](./faster-rcnn_r50-caffe-dc5_ms-1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_dc5_mstrain_1x_coco/faster_rcnn_r50_caffe_dc5_mstrain_1x_coco_20201028_233851-b33d21b9.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_dc5_mstrain_1x_coco/faster_rcnn_r50_caffe_dc5_mstrain_1x_coco_20201028_233851.log.json) |
-| [R-50-DC5](./faster-rcnn_r50-caffe-dc5_ms-3x_coco.py) | caffe | 3x | - | | 38.7 | [config](./faster-rcnn_r50-caffe-dc5_ms-3x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_dc5_mstrain_3x_coco/faster_rcnn_r50_caffe_dc5_mstrain_3x_coco_20201028_002107-34a53b2c.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_dc5_mstrain_3x_coco/faster_rcnn_r50_caffe_dc5_mstrain_3x_coco_20201028_002107.log.json) |
-| [R-50-FPN](./faster-rcnn_r50-caffe_fpn_ms-2x_coco.py) | caffe | 2x | 3.7 | | 39.7 | [config](./faster-rcnn_r50-caffe_fpn_ms-2x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_fpn_mstrain_2x_coco/faster_rcnn_r50_caffe_fpn_mstrain_2x_coco_bbox_mAP-0.397_20200504_231813-10b2de58.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_fpn_mstrain_2x_coco/faster_rcnn_r50_caffe_fpn_mstrain_2x_coco_20200504_231813.log.json) |
-| [R-50-FPN](./faster-rcnn_r50-caffe_fpn_ms-3x_coco.py) | caffe | 3x | 3.7 | | 39.9 | [config](./faster-rcnn_r50-caffe_fpn_ms-3x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_fpn_mstrain_3x_coco/faster_rcnn_r50_caffe_fpn_mstrain_3x_coco_20210526_095054-1f77628b.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_fpn_mstrain_3x_coco/faster_rcnn_r50_caffe_fpn_mstrain_3x_coco_20210526_095054.log.json) |
-| [R-50-FPN](./faster-rcnn_r50_fpn_ms-3x_coco.py) | pytorch | 3x | 3.9 | | 40.3 | [config](./faster-rcnn_r50_fpn_ms-3x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_mstrain_3x_coco/faster_rcnn_r50_fpn_mstrain_3x_coco_20210524_110822-e10bd31c.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_mstrain_3x_coco/faster_rcnn_r50_fpn_mstrain_3x_coco_20210524_110822.log.json) |
-| [R-101-FPN](./faster-rcnn_r101-caffe_fpn_ms-3x_coco.py) | caffe | 3x | 5.6 | | 42.0 | [config](./faster-rcnn_r101-caffe_fpn_ms-3x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_caffe_fpn_mstrain_3x_coco/faster_rcnn_r101_caffe_fpn_mstrain_3x_coco_20210526_095742-a7ae426d.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_caffe_fpn_mstrain_3x_coco/faster_rcnn_r101_caffe_fpn_mstrain_3x_coco_20210526_095742.log.json) |
-| [R-101-FPN](./faster-rcnn_r101_fpn_ms-3x_coco.py) | pytorch | 3x | 5.8 | | 41.8 | [config](./faster-rcnn_r101_fpn_ms-3x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_fpn_mstrain_3x_coco/faster_rcnn_r101_fpn_mstrain_3x_coco_20210524_110822-4d4d2ca8.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_fpn_mstrain_3x_coco/faster_rcnn_r101_fpn_mstrain_3x_coco_20210524_110822.log.json) |
-| [X-101-32x4d-FPN](./faster-rcnn_x101-32x4d_fpn_ms-3x_coco.py) | pytorch | 3x | 7.0 | | 42.5 | [config](./faster-rcnn_x101-32x4d_fpn_ms-3x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_32x4d_fpn_mstrain_3x_coco/faster_rcnn_x101_32x4d_fpn_mstrain_3x_coco_20210524_124151-16b9b260.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_32x4d_fpn_mstrain_3x_coco/faster_rcnn_x101_32x4d_fpn_mstrain_3x_coco_20210524_124151.log.json) |
-| [X-101-32x8d-FPN](./faster-rcnn_x101-32x8d_fpn_ms-3x_coco.py) | pytorch | 3x | 10.1 | | 42.4 | [config](./faster-rcnn_x101-32x8d_fpn_ms-3x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_32x8d_fpn_mstrain_3x_coco/faster_rcnn_x101_32x8d_fpn_mstrain_3x_coco_20210604_182954-002e082a.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_32x8d_fpn_mstrain_3x_coco/faster_rcnn_x101_32x8d_fpn_mstrain_3x_coco_20210604_182954.log.json) |
-| [X-101-64x4d-FPN](./faster-rcnn_x101-64x4d_fpn_ms-3x_coco.py) | pytorch | 3x | 10.0 | | 43.1 | [config](./faster-rcnn_x101-64x4d_fpn_ms-3x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_64x4d_fpn_mstrain_3x_coco/faster_rcnn_x101_64x4d_fpn_mstrain_3x_coco_20210524_124528-26c63de6.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_64x4d_fpn_mstrain_3x_coco/faster_rcnn_x101_64x4d_fpn_mstrain_3x_coco_20210524_124528.log.json) |
+| Backbone | Style | Lr schd | Mem (GB) | Inf time (fps) | box AP | Config | Download |
+| :-----------------------------------------------------------: | :-----: | :-----: | :------: | :------------: | :----: | :--------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| [R-50-C4](./faster-rcnn_r50-caffe-c4_ms-1x_coco.py) | caffe | 1x | - | | 35.9 | [config](./faster-rcnn_r50-caffe-c4_ms-1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_c4_mstrain_1x_coco/faster_rcnn_r50_caffe_c4_mstrain_1x_coco_20220316_150527-db276fed.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_c4_mstrain_1x_coco/faster_rcnn_r50_caffe_c4_mstrain_1x_coco_20220316_150527.log.json) |
+| [R-50-DC5](./faster-rcnn_r50-caffe-dc5_ms-1x_coco.py) | caffe | 1x | - | | 37.4 | [config](./faster-rcnn_r50-caffe-dc5_ms-1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_dc5_mstrain_1x_coco/faster_rcnn_r50_caffe_dc5_mstrain_1x_coco_20201028_233851-b33d21b9.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_dc5_mstrain_1x_coco/faster_rcnn_r50_caffe_dc5_mstrain_1x_coco_20201028_233851.log.json) |
+| [R-50-DC5](./faster-rcnn_r50-caffe-dc5_ms-3x_coco.py) | caffe | 3x | - | | 38.7 | [config](./faster-rcnn_r50-caffe-dc5_ms-3x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_dc5_mstrain_3x_coco/faster_rcnn_r50_caffe_dc5_mstrain_3x_coco_20201028_002107-34a53b2c.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_dc5_mstrain_3x_coco/faster_rcnn_r50_caffe_dc5_mstrain_3x_coco_20201028_002107.log.json) |
+| [R-50-FPN](./faster-rcnn_r50-caffe_fpn_ms-2x_coco.py) | caffe | 2x | 3.7 | | 39.7 | [config](./faster-rcnn_r50-caffe_fpn_ms-2x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r50-caffe_fpn_ms-2x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_fpn_mstrain_2x_coco/faster_rcnn_r50_caffe_fpn_mstrain_2x_coco_20200504_231813.log.json) |
+| [R-50-FPN](./faster-rcnn_r50-caffe_fpn_ms-3x_coco.py) | caffe | 3x | 3.7 | | 39.9 | [config](./faster-rcnn_r50-caffe_fpn_ms-3x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r50-caffe_fpn_ms-3x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_caffe_fpn_mstrain_3x_coco/faster_rcnn_r50_caffe_fpn_mstrain_3x_coco_20210526_095054.log.json) |
+| [R-50-FPN](./faster-rcnn_r50_fpn_ms-3x_coco.py) | pytorch | 3x | 3.9 | | 40.3 | [config](./faster-rcnn_r50_fpn_ms-3x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_mstrain_3x_coco/faster_rcnn_r50_fpn_mstrain_3x_coco_20210524_110822-e10bd31c.pth) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_mstrain_3x_coco/faster_rcnn_r50_fpn_mstrain_3x_coco_20210524_110822.log.json) |
+| [R-101-FPN](./faster-rcnn_r101-caffe_fpn_ms-3x_coco.py) | caffe | 3x | 5.6 | | 42.0 | [config](./faster-rcnn_r101-caffe_fpn_ms-3x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r101-caffe_fpn_ms-3x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_caffe_fpn_mstrain_3x_coco/faster_rcnn_r101_caffe_fpn_mstrain_3x_coco_20210526_095742.log.json) |
+| [R-101-FPN](./faster-rcnn_r101_fpn_ms-3x_coco.py) | pytorch | 3x | 5.8 | | 41.8 | [config](./faster-rcnn_r101_fpn_ms-3x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_r101_fpn_ms-3x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_fpn_mstrain_3x_coco/faster_rcnn_r101_fpn_mstrain_3x_coco_20210524_110822.log.json) |
+| [X-101-32x4d-FPN](./faster-rcnn_x101-32x4d_fpn_ms-3x_coco.py) | pytorch | 3x | 7.0 | | 42.5 | [config](./faster-rcnn_x101-32x4d_fpn_ms-3x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_x101-32x4d_fpn_ms-3x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_32x4d_fpn_mstrain_3x_coco/faster_rcnn_x101_32x4d_fpn_mstrain_3x_coco_20210524_124151.log.json) |
+| [X-101-32x8d-FPN](./faster-rcnn_x101-32x8d_fpn_ms-3x_coco.py) | pytorch | 3x | 10.1 | | 42.4 | [config](./faster-rcnn_x101-32x8d_fpn_ms-3x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_x101-32x8d_fpn_ms-3x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_32x8d_fpn_mstrain_3x_coco/faster_rcnn_x101_32x8d_fpn_mstrain_3x_coco_20210604_182954.log.json) |
+| [X-101-64x4d-FPN](./faster-rcnn_x101-64x4d_fpn_ms-3x_coco.py) | pytorch | 3x | 10.0 | | 43.1 | [config](./faster-rcnn_x101-64x4d_fpn_ms-3x_coco.py) | [model](https://download.openxlab.org.cn/models/mmdetection/FasterR-CNN/weight/faster-rcnn_x101-64x4d_fpn_ms-3x_coco) \| [log](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_x101_64x4d_fpn_mstrain_3x_coco/faster_rcnn_x101_64x4d_fpn_mstrain_3x_coco_20210524_124528.log.json) |
We further finetune some pre-trained models on the COCO subsets, which only contain only a few of the 80 categories.
diff --git a/configs/glip/README.md b/configs/glip/README.md
index 1252d922ac8..e74e98d1b57 100644
--- a/configs/glip/README.md
+++ b/configs/glip/README.md
@@ -56,7 +56,7 @@ model.save_pretrained("your path/bert-base-uncased")
tokenizer.save_pretrained("your path/bert-base-uncased")
```
-## Results and Models
+## COCO Results and Models
| Model | Zero-shot or Finetune | COCO mAP | Official COCO mAP | Pre-Train Data | Config | Download |
| :--------: | :-------------------: | :------: | ----------------: | :------------------------: | :---------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
@@ -78,3 +78,96 @@ Note:
3. Taking the GLIP-T(A) model as an example, I trained it twice using the official code, and the fine-tuning mAP were 52.5 and 52.6. Therefore, the mAP we achieved in our reproduction is higher than the official results. The main reason is that we modified the `weight_decay` parameter.
4. Our experiments revealed that training for 24 epochs leads to overfitting. Therefore, we chose the best-performing model. If users want to train on a custom dataset, it is advisable to shorten the number of epochs and save the best-performing model.
5. Due to the official absence of fine-tuning hyperparameters for the GLIP-L model, we have not yet reproduced the official accuracy. I have found that overfitting can also occur, so it may be necessary to consider custom modifications to data augmentation and model enhancement. Given the high cost of training, we have not conducted any research on this matter at the moment.
+
+## LVIS Results
+
+| Model | Official | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP | Pre-Train Data | Config | Download |
+| :--------: | :------: | :---------: | :---------: | :---------: | :--------: | :--------: | :--------: | :--------: | :-------: | :------------------------: | :---------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| GLIP-T (A) | ✔ | | | | | | | | | O365 | [config](lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_a_mmdet-b3654169.pth) |
+| GLIP-T (A) | | 12.1 | 15.5 | 25.8 | 20.2 | 6.2 | 10.9 | 22.8 | 14.7 | O365 | [config](lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_a_mmdet-b3654169.pth) |
+| GLIP-T (B) | ✔ | | | | | | | | | O365 | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_b_mmdet-6dfbd102.pth) |
+| GLIP-T (B) | | 8.6 | 13.9 | 26.0 | 19.3 | 4.6 | 9.8 | 22.6 | 13.9 | O365 | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_b_mmdet-6dfbd102.pth) |
+| GLIP-T (C) | ✔ | 14.3 | 19.4 | 31.1 | 24.6 | | | | | O365,GoldG | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_c_mmdet-2fc427dd.pth) |
+| GLIP-T (C) | | 14.4 | 19.8 | 31.9 | 25.2 | 8.3 | 13.2 | 28.1 | 18.2 | O365,GoldG | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_c_mmdet-2fc427dd.pth) |
+| GLIP-T | ✔ | | | | | | | | | O365,GoldG,CC3M,SBU | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_mmdet-c24ce662.pth) |
+| GLIP-T | | 18.1 | 21.2 | 33.1 | 26.7 | 10.8 | 14.7 | 29.0 | 19.6 | O365,GoldG,CC3M,SBU | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_mmdet-c24ce662.pth) |
+| GLIP-L | ✔ | 29.2 | 34.9 | 42.1 | 37.9 | | | | | FourODs,GoldG,CC3M+12M,SBU | [config](lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_l_mmdet-abfe026b.pth) |
+| GLIP-L | | 27.9 | 33.7 | 39.7 | 36.1 | 20.2 | 25.8 | 35.3 | 28.5 | FourODs,GoldG,CC3M+12M,SBU | [config](lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_l_mmdet-abfe026b.pth) |
+
+Note:
+
+1. The above are zero-shot evaluation results.
+2. The evaluation metric we used is LVIS FixAP. For specific details, please refer to [Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details](https://arxiv.org/pdf/2102.01066.pdf).
+3. We found that the performance on small models is better than the official results, but it is lower on large models. This is mainly due to the incomplete alignment of the GLIP post-processing.
+
+## ODinW (Object Detection in the Wild) Results
+
+Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets and tasks. However, it remains challenging to evaluate the transferablity of these models due to the lack of easy-to-use evaluation toolkits and public benchmarks. To tackle this, we build ELEVATER 1 , the first benchmark and toolkit for evaluating (pre-trained) language-augmented visual models. ELEVATER is composed of three components. (i) Datasets. As downstream evaluation suites, it consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. (ii) Toolkit. An automatic hyper-parameter tuning toolkit is developed to facilitate model evaluation on downstream tasks. (iii) Metrics. A variety of evaluation metrics are used to measure sample-efficiency (zero-shot and few-shot) and parameter-efficiency (linear probing and full model fine-tuning). ELEVATER is platform for Computer Vision in the Wild (CVinW), and is publicly released at https://computer-vision-in-the-wild.github.io/ELEVATER/
+
+### Results and models of ODinW13
+
+| Method | GLIP-T(A) | Official | GLIP-T(B) | Official | GLIP-T(C) | Official | GroundingDINO-T | GroundingDINO-B |
+| --------------------- | --------- | --------- | --------- | --------- | --------- | --------- | --------------- | --------------- |
+| AerialMaritimeDrone | 0.123 | 0.122 | 0.110 | 0.110 | 0.130 | 0.130 | 0.173 | 0.281 |
+| Aquarium | 0.175 | 0.174 | 0.173 | 0.169 | 0.191 | 0.190 | 0.195 | 0.445 |
+| CottontailRabbits | 0.686 | 0.686 | 0.688 | 0.688 | 0.744 | 0.744 | 0.799 | 0.808 |
+| EgoHands | 0.013 | 0.013 | 0.003 | 0.004 | 0.314 | 0.315 | 0.608 | 0.764 |
+| NorthAmericaMushrooms | 0.502 | 0.502 | 0.367 | 0.367 | 0.297 | 0.296 | 0.507 | 0.675 |
+| Packages | 0.589 | 0.589 | 0.083 | 0.083 | 0.699 | 0.699 | 0.687 | 0.670 |
+| PascalVOC | 0.512 | 0.512 | 0.541 | 0.540 | 0.565 | 0.565 | 0.563 | 0.711 |
+| pistols | 0.339 | 0.339 | 0.502 | 0.501 | 0.503 | 0.504 | 0.726 | 0.771 |
+| pothole | 0.007 | 0.007 | 0.030 | 0.030 | 0.058 | 0.058 | 0.215 | 0.478 |
+| Raccoon | 0.075 | 0.074 | 0.285 | 0.288 | 0.241 | 0.244 | 0.549 | 0.541 |
+| ShellfishOpenImages | 0.253 | 0.253 | 0.337 | 0.338 | 0.300 | 0.302 | 0.393 | 0.650 |
+| thermalDogsAndPeople | 0.372 | 0.372 | 0.475 | 0.475 | 0.510 | 0.510 | 0.657 | 0.633 |
+| VehiclesOpenImages | 0.574 | 0.566 | 0.562 | 0.547 | 0.549 | 0.534 | 0.613 | 0.647 |
+| Average | **0.325** | **0.324** | **0.320** | **0.318** | **0.392** | **0.392** | **0.514** | **0.621** |
+
+### Results and models of ODinW35
+
+| Method | GLIP-T(A) | Official | GLIP-T(B) | Official | GLIP-T(C) | Official | GroundingDINO-T | GroundingDINO-B |
+| --------------------------- | --------- | --------- | --------- | --------- | --------- | --------- | --------------- | --------------- |
+| AerialMaritimeDrone_large | 0.123 | 0.122 | 0.110 | 0.110 | 0.130 | 0.130 | 0.173 | 0.281 |
+| AerialMaritimeDrone_tiled | 0.174 | 0.174 | 0.172 | 0.172 | 0.172 | 0.172 | 0.206 | 0.364 |
+| AmericanSignLanguageLetters | 0.001 | 0.001 | 0.003 | 0.003 | 0.009 | 0.009 | 0.002 | 0.096 |
+| Aquarium | 0.175 | 0.175 | 0.173 | 0.171 | 0.192 | 0.182 | 0.195 | 0.445 |
+| BCCD | 0.016 | 0.016 | 0.001 | 0.001 | 0.000 | 0.000 | 0.161 | 0.584 |
+| boggleBoards | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.134 |
+| brackishUnderwater | 0.016 | 0..013 | 0.021 | 0.027 | 0.020 | 0.022 | 0.021 | 0.454 |
+| ChessPieces | 0.001 | 0.001 | 0.000 | 0.000 | 0.001 | 0.001 | 0.000 | 0.000 |
+| CottontailRabbits | 0.710 | 0.709 | 0.683 | 0.683 | 0.752 | 0.752 | 0.806 | 0.797 |
+| dice | 0.005 | 0.005 | 0.004 | 0.004 | 0.004 | 0.004 | 0.004 | 0.082 |
+| DroneControl | 0.016 | 0.017 | 0.006 | 0.008 | 0.005 | 0.007 | 0.042 | 0.638 |
+| EgoHands_generic | 0.009 | 0.010 | 0.005 | 0.006 | 0.510 | 0.508 | 0.608 | 0.764 |
+| EgoHands_specific | 0.001 | 0.001 | 0.004 | 0.006 | 0.003 | 0.004 | 0.002 | 0.687 |
+| HardHatWorkers | 0.029 | 0.029 | 0.023 | 0.023 | 0.033 | 0.033 | 0.046 | 0.439 |
+| MaskWearing | 0.007 | 0.007 | 0.003 | 0.002 | 0.005 | 0.005 | 0.004 | 0.406 |
+| MountainDewCommercial | 0.218 | 0.227 | 0.199 | 0.197 | 0.478 | 0.463 | 0.430 | 0.580 |
+| NorthAmericaMushrooms | 0.502 | 0.502 | 0.450 | 0.450 | 0.497 | 0.497 | 0.471 | 0.501 |
+| openPoetryVision | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.051 |
+| OxfordPets_by_breed | 0.001 | 0.002 | 0.002 | 0.004 | 0.001 | 0.002 | 0.003 | 0.799 |
+| OxfordPets_by_species | 0.016 | 0.011 | 0.012 | 0.009 | 0.013 | 0.009 | 0.011 | 0.872 |
+| PKLot | 0.002 | 0.002 | 0.000 | 0.000 | 0.000 | 0.000 | 0.001 | 0.774 |
+| Packages | 0.569 | 0.569 | 0.279 | 0.279 | 0.712 | 0.712 | 0.695 | 0.728 |
+| PascalVOC | 0.512 | 0.512 | 0.541 | 0.540 | 0.565 | 0.565 | 0.563 | 0.711 |
+| pistols | 0.339 | 0.339 | 0.502 | 0.501 | 0.503 | 0.504 | 0.726 | 0.771 |
+| plantdoc | 0.002 | 0.002 | 0.007 | 0.007 | 0.009 | 0.009 | 0.005 | 0.376 |
+| pothole | 0.007 | 0.010 | 0.024 | 0.025 | 0.085 | 0.101 | 0.215 | 0.478 |
+| Raccoons | 0.075 | 0.074 | 0.285 | 0.288 | 0.241 | 0.244 | 0.549 | 0.541 |
+| selfdrivingCar | 0.071 | 0.072 | 0.074 | 0.074 | 0.081 | 0.080 | 0.089 | 0.318 |
+| ShellfishOpenImages | 0.253 | 0.253 | 0.337 | 0.338 | 0.300 | 0.302 | 0.393 | 0.650 |
+| ThermalCheetah | 0.028 | 0.028 | 0.000 | 0.000 | 0.028 | 0.028 | 0.087 | 0.290 |
+| thermalDogsAndPeople | 0.372 | 0.372 | 0.475 | 0.475 | 0.510 | 0.510 | 0.657 | 0.633 |
+| UnoCards | 0.000 | 0.000 | 0.000 | 0.001 | 0.002 | 0.003 | 0.006 | 0.754 |
+| VehiclesOpenImages | 0.574 | 0.566 | 0.562 | 0.547 | 0.549 | 0.534 | 0.613 | 0.647 |
+| WildfireSmoke | 0.000 | 0.000 | 0.000 | 0.000 | 0.017 | 0.017 | 0.134 | 0.410 |
+| websiteScreenshots | 0.003 | 0.004 | 0.003 | 0.005 | 0.005 | 0.006 | 0.012 | 0.175 |
+| Average | **0.134** | **0.134** | **0.138** | **0.138** | **0.179** | **0.178** | **0.227** | **0.492** |
+
+### Results on Flickr30k
+
+| Model | Official | Pre-Train Data | Val R@1 | Val R@5 | Val R@10 | Test R@1 | Test R@5 | Test R@10 |
+| ------------- | -------- | ------------------- | ------- | ------- | -------- | -------- | -------- | --------- |
+| **GLIP-T(C)** | ✔ | O365, GoldG | 84.8 | 94.9 | 96.3 | 85.5 | 95.4 | 96.6 |
+| **GLIP-T(C)** | | O365, GoldG | 84.9 | 94.9 | 96.3 | 85.6 | 95.4 | 96.7 |
+| **GLIP-T** | | O365,GoldG,CC3M,SBU | 85.3 | 95.5 | 96.9 | 86.0 | 95.9 | 97.2 |
diff --git a/configs/glip/flickr30k/glip_atss_swin-t_c_fpn_dyhead_pretrain_obj365-goldg_zeroshot_flickr30k.py b/configs/glip/flickr30k/glip_atss_swin-t_c_fpn_dyhead_pretrain_obj365-goldg_zeroshot_flickr30k.py
new file mode 100644
index 00000000000..14d6e8aaa63
--- /dev/null
+++ b/configs/glip/flickr30k/glip_atss_swin-t_c_fpn_dyhead_pretrain_obj365-goldg_zeroshot_flickr30k.py
@@ -0,0 +1,61 @@
+_base_ = '../glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365.py'
+
+lang_model_name = 'bert-base-uncased'
+
+model = dict(bbox_head=dict(early_fuse=True))
+
+dataset_type = 'Flickr30kDataset'
+data_root = 'data/flickr30k_entities/'
+
+test_pipeline = [
+ dict(
+ type='LoadImageFromFile', backend_args=None,
+ imdecode_backend='pillow'),
+ dict(
+ type='FixScaleResize',
+ scale=(800, 1333),
+ keep_ratio=True,
+ backend='pillow'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'text', 'custom_entities',
+ 'tokens_positive', 'phrase_ids', 'phrases'))
+]
+
+dataset_Flickr30k_val = dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='final_flickr_separateGT_val.json',
+ data_prefix=dict(img='flickr30k_images/'),
+ pipeline=test_pipeline,
+)
+
+dataset_Flickr30k_test = dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='final_flickr_separateGT_test.json',
+ data_prefix=dict(img='flickr30k_images/'),
+ pipeline=test_pipeline,
+)
+
+val_evaluator_Flickr30k = dict(type='Flickr30kMetric', )
+
+test_evaluator_Flickr30k = dict(type='Flickr30kMetric', )
+
+# ----------Config---------- #
+dataset_prefixes = ['Flickr30kVal', 'Flickr30kTest']
+datasets = [dataset_Flickr30k_val, dataset_Flickr30k_test]
+metrics = [val_evaluator_Flickr30k, test_evaluator_Flickr30k]
+
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/glip/lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_lvis.py b/configs/glip/lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_lvis.py
new file mode 100644
index 00000000000..1f79e447d3f
--- /dev/null
+++ b/configs/glip/lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_lvis.py
@@ -0,0 +1,12 @@
+_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py'
+
+model = dict(
+ backbone=dict(
+ embed_dims=192,
+ depths=[2, 2, 18, 2],
+ num_heads=[6, 12, 24, 48],
+ window_size=12,
+ drop_path_rate=0.4,
+ ),
+ neck=dict(in_channels=[384, 768, 1536]),
+ bbox_head=dict(early_fuse=True, num_dyhead_blocks=8))
diff --git a/configs/glip/lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_mini-lvis.py b/configs/glip/lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_mini-lvis.py
new file mode 100644
index 00000000000..13f1a69082b
--- /dev/null
+++ b/configs/glip/lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_mini-lvis.py
@@ -0,0 +1,12 @@
+_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_mini-lvis.py'
+
+model = dict(
+ backbone=dict(
+ embed_dims=192,
+ depths=[2, 2, 18, 2],
+ num_heads=[6, 12, 24, 48],
+ window_size=12,
+ drop_path_rate=0.4,
+ ),
+ neck=dict(in_channels=[384, 768, 1536]),
+ bbox_head=dict(early_fuse=True, num_dyhead_blocks=8))
diff --git a/configs/glip/lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py b/configs/glip/lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py
new file mode 100644
index 00000000000..4d526d59008
--- /dev/null
+++ b/configs/glip/lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py
@@ -0,0 +1,24 @@
+_base_ = '../glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365.py'
+
+model = dict(test_cfg=dict(
+ max_per_img=300,
+ chunked_size=40,
+))
+
+dataset_type = 'LVISV1Dataset'
+data_root = 'data/coco/'
+
+val_dataloader = dict(
+ dataset=dict(
+ data_root=data_root,
+ type=dataset_type,
+ ann_file='annotations/lvis_od_val.json',
+ data_prefix=dict(img='')))
+test_dataloader = val_dataloader
+
+# numpy < 1.24.0
+val_evaluator = dict(
+ _delete_=True,
+ type='LVISFixedAPMetric',
+ ann_file=data_root + 'annotations/lvis_od_val.json')
+test_evaluator = val_evaluator
diff --git a/configs/glip/lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_mini-lvis.py b/configs/glip/lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_mini-lvis.py
new file mode 100644
index 00000000000..70a57a3f581
--- /dev/null
+++ b/configs/glip/lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_mini-lvis.py
@@ -0,0 +1,25 @@
+_base_ = '../glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365.py'
+
+model = dict(test_cfg=dict(
+ max_per_img=300,
+ chunked_size=40,
+))
+
+dataset_type = 'LVISV1Dataset'
+data_root = 'data/coco/'
+
+val_dataloader = dict(
+ dataset=dict(
+ data_root=data_root,
+ type=dataset_type,
+ ann_file='annotations/lvis_v1_minival_inserted_image_name.json',
+ data_prefix=dict(img='')))
+test_dataloader = val_dataloader
+
+# numpy < 1.24.0
+val_evaluator = dict(
+ _delete_=True,
+ type='LVISFixedAPMetric',
+ ann_file=data_root +
+ 'annotations/lvis_v1_minival_inserted_image_name.json')
+test_evaluator = val_evaluator
diff --git a/configs/glip/lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py b/configs/glip/lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py
new file mode 100644
index 00000000000..6dc712b3bcb
--- /dev/null
+++ b/configs/glip/lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py
@@ -0,0 +1,3 @@
+_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py'
+
+model = dict(bbox_head=dict(early_fuse=True))
diff --git a/configs/glip/lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_mini-lvis.py b/configs/glip/lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_mini-lvis.py
new file mode 100644
index 00000000000..3babb91101a
--- /dev/null
+++ b/configs/glip/lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_mini-lvis.py
@@ -0,0 +1,3 @@
+_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_mini-lvis.py'
+
+model = dict(bbox_head=dict(early_fuse=True))
diff --git a/configs/glip/odinw/glip_atss_swin-t_a_fpn_dyhead_pretrain_odinw13.py b/configs/glip/odinw/glip_atss_swin-t_a_fpn_dyhead_pretrain_odinw13.py
new file mode 100644
index 00000000000..d38effba8c1
--- /dev/null
+++ b/configs/glip/odinw/glip_atss_swin-t_a_fpn_dyhead_pretrain_odinw13.py
@@ -0,0 +1,338 @@
+_base_ = '../glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365.py'
+
+dataset_type = 'CocoDataset'
+data_root = 'data/odinw/'
+
+base_test_pipeline = _base_.test_pipeline
+base_test_pipeline[-1]['meta_keys'] = ('img_id', 'img_path', 'ori_shape',
+ 'img_shape', 'scale_factor', 'text',
+ 'custom_entities', 'caption_prompt')
+
+# ---------------------1 AerialMaritimeDrone---------------------#
+class_name = ('boat', 'car', 'dock', 'jetski', 'lift')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AerialMaritimeDrone/large/'
+dataset_AerialMaritimeDrone = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ test_mode=True,
+ pipeline=base_test_pipeline,
+ return_classes=True)
+val_evaluator_AerialMaritimeDrone = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------2 Aquarium---------------------#
+class_name = ('fish', 'jellyfish', 'penguin', 'puffin', 'shark', 'starfish',
+ 'stingray')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Aquarium/Aquarium Combined.v2-raw-1024.coco/'
+
+caption_prompt = None
+# caption_prompt = {
+# 'penguin': {
+# 'suffix': ', which is black and white'
+# },
+# 'puffin': {
+# 'suffix': ' with orange beaks'
+# },
+# 'stingray': {
+# 'suffix': ' which is flat and round'
+# },
+# }
+dataset_Aquarium = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Aquarium = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------3 CottontailRabbits---------------------#
+class_name = ('Cottontail-Rabbit', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'CottontailRabbits/'
+
+caption_prompt = None
+# caption_prompt = {'Cottontail-Rabbit': {'name': 'rabbit'}}
+
+dataset_CottontailRabbits = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_CottontailRabbits = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------4 EgoHands---------------------#
+class_name = ('hand', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'EgoHands/generic/'
+
+caption_prompt = None
+# caption_prompt = {'hand': {'suffix': ' of a person'}}
+
+dataset_EgoHands = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_EgoHands = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------5 NorthAmericaMushrooms---------------------#
+class_name = ('CoW', 'chanterelle')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'NorthAmericaMushrooms/North American Mushrooms.v1-416x416.coco/' # noqa
+
+caption_prompt = None
+# caption_prompt = {
+# 'CoW': {
+# 'name': 'flat mushroom'
+# },
+# 'chanterelle': {
+# 'name': 'yellow mushroom'
+# }
+# }
+
+dataset_NorthAmericaMushrooms = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_NorthAmericaMushrooms = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------6 Packages---------------------#
+class_name = ('package', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Packages/Raw/'
+
+caption_prompt = None
+# caption_prompt = {
+# 'package': {
+# 'prefix': 'there is a ',
+# 'suffix': ' on the porch'
+# }
+# }
+
+dataset_Packages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Packages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------7 PascalVOC---------------------#
+class_name = ('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car',
+ 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse',
+ 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train',
+ 'tvmonitor')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'PascalVOC/'
+dataset_PascalVOC = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_PascalVOC = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------8 pistols---------------------#
+class_name = ('pistol', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pistols/export/'
+dataset_pistols = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pistols = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------9 pothole---------------------#
+class_name = ('pothole', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pothole/'
+
+caption_prompt = None
+# caption_prompt = {
+# 'pothole': {
+# 'prefix': 'there are some ',
+# 'name': 'holes',
+# 'suffix': ' on the road'
+# }
+# }
+
+dataset_pothole = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pothole = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------10 Raccoon---------------------#
+class_name = ('raccoon', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Raccoon/Raccoon.v2-raw.coco/'
+dataset_Raccoon = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Raccoon = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------11 ShellfishOpenImages---------------------#
+class_name = ('Crab', 'Lobster', 'Shrimp')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ShellfishOpenImages/raw/'
+dataset_ShellfishOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ShellfishOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------12 thermalDogsAndPeople---------------------#
+class_name = ('dog', 'person')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'thermalDogsAndPeople/'
+dataset_thermalDogsAndPeople = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_thermalDogsAndPeople = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------13 VehiclesOpenImages---------------------#
+class_name = ('Ambulance', 'Bus', 'Car', 'Motorcycle', 'Truck')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'VehiclesOpenImages/416x416/'
+dataset_VehiclesOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_VehiclesOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# --------------------- Config---------------------#
+dataset_prefixes = [
+ 'AerialMaritimeDrone', 'Aquarium', 'CottontailRabbits', 'EgoHands',
+ 'NorthAmericaMushrooms', 'Packages', 'PascalVOC', 'pistols', 'pothole',
+ 'Raccoon', 'ShellfishOpenImages', 'thermalDogsAndPeople',
+ 'VehiclesOpenImages'
+]
+datasets = [
+ dataset_AerialMaritimeDrone, dataset_Aquarium, dataset_CottontailRabbits,
+ dataset_EgoHands, dataset_NorthAmericaMushrooms, dataset_Packages,
+ dataset_PascalVOC, dataset_pistols, dataset_pothole, dataset_Raccoon,
+ dataset_ShellfishOpenImages, dataset_thermalDogsAndPeople,
+ dataset_VehiclesOpenImages
+]
+metrics = [
+ val_evaluator_AerialMaritimeDrone, val_evaluator_Aquarium,
+ val_evaluator_CottontailRabbits, val_evaluator_EgoHands,
+ val_evaluator_NorthAmericaMushrooms, val_evaluator_Packages,
+ val_evaluator_PascalVOC, val_evaluator_pistols, val_evaluator_pothole,
+ val_evaluator_Raccoon, val_evaluator_ShellfishOpenImages,
+ val_evaluator_thermalDogsAndPeople, val_evaluator_VehiclesOpenImages
+]
+
+# -------------------------------------------------#
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/glip/odinw/glip_atss_swin-t_a_fpn_dyhead_pretrain_odinw35.py b/configs/glip/odinw/glip_atss_swin-t_a_fpn_dyhead_pretrain_odinw35.py
new file mode 100644
index 00000000000..2eaf09ed771
--- /dev/null
+++ b/configs/glip/odinw/glip_atss_swin-t_a_fpn_dyhead_pretrain_odinw35.py
@@ -0,0 +1,794 @@
+_base_ = '../glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365.py'
+
+dataset_type = 'CocoDataset'
+data_root = 'data/odinw/'
+
+base_test_pipeline = _base_.test_pipeline
+base_test_pipeline[-1]['meta_keys'] = ('img_id', 'img_path', 'ori_shape',
+ 'img_shape', 'scale_factor', 'text',
+ 'custom_entities', 'caption_prompt')
+
+# ---------------------1 AerialMaritimeDrone_large---------------------#
+class_name = ('boat', 'car', 'dock', 'jetski', 'lift')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AerialMaritimeDrone/large/'
+dataset_AerialMaritimeDrone_large = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_AerialMaritimeDrone_large = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------2 AerialMaritimeDrone_tiled---------------------#
+class_name = ('boat', 'car', 'dock', 'jetski', 'lift')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AerialMaritimeDrone/tiled/'
+dataset_AerialMaritimeDrone_tiled = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_AerialMaritimeDrone_tiled = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------3 AmericanSignLanguageLetters---------------------#
+class_name = ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
+ 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AmericanSignLanguageLetters/American Sign Language Letters.v1-v1.coco/' # noqa
+dataset_AmericanSignLanguageLetters = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_AmericanSignLanguageLetters = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------4 Aquarium---------------------#
+class_name = ('fish', 'jellyfish', 'penguin', 'puffin', 'shark', 'starfish',
+ 'stingray')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Aquarium/Aquarium Combined.v2-raw-1024.coco/'
+dataset_Aquarium = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Aquarium = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------5 BCCD---------------------#
+class_name = ('Platelets', 'RBC', 'WBC')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'BCCD/BCCD.v3-raw.coco/'
+dataset_BCCD = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_BCCD = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------6 boggleBoards---------------------#
+class_name = ('Q', 'a', 'an', 'b', 'c', 'd', 'e', 'er', 'f', 'g', 'h', 'he',
+ 'i', 'in', 'j', 'k', 'l', 'm', 'n', 'o', 'o ', 'p', 'q', 'qu',
+ 'r', 's', 't', 't\\', 'th', 'u', 'v', 'w', 'wild', 'x', 'y', 'z')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'boggleBoards/416x416AutoOrient/export/'
+dataset_boggleBoards = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_boggleBoards = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------7 brackishUnderwater---------------------#
+class_name = ('crab', 'fish', 'jellyfish', 'shrimp', 'small_fish', 'starfish')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'brackishUnderwater/960x540/'
+dataset_brackishUnderwater = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_brackishUnderwater = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------8 ChessPieces---------------------#
+class_name = (' ', 'black bishop', 'black king', 'black knight', 'black pawn',
+ 'black queen', 'black rook', 'white bishop', 'white king',
+ 'white knight', 'white pawn', 'white queen', 'white rook')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ChessPieces/Chess Pieces.v23-raw.coco/'
+dataset_ChessPieces = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/new_annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ChessPieces = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/new_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------9 CottontailRabbits---------------------#
+class_name = ('rabbit', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'CottontailRabbits/'
+dataset_CottontailRabbits = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/new_annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_CottontailRabbits = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/new_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------10 dice---------------------#
+class_name = ('1', '2', '3', '4', '5', '6')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'dice/mediumColor/export/'
+dataset_dice = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_dice = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------11 DroneControl---------------------#
+class_name = ('follow', 'follow_hand', 'land', 'land_hand', 'null', 'object',
+ 'takeoff', 'takeoff-hand')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'DroneControl/Drone Control.v3-raw.coco/'
+dataset_DroneControl = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_DroneControl = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------12 EgoHands_generic---------------------#
+class_name = ('hand', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'EgoHands/generic/'
+caption_prompt = {'hand': {'suffix': ' of a person'}}
+dataset_EgoHands_generic = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_EgoHands_generic = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------13 EgoHands_specific---------------------#
+class_name = ('myleft', 'myright', 'yourleft', 'yourright')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'EgoHands/specific/'
+dataset_EgoHands_specific = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_EgoHands_specific = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------14 HardHatWorkers---------------------#
+class_name = ('head', 'helmet', 'person')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'HardHatWorkers/raw/'
+dataset_HardHatWorkers = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_HardHatWorkers = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------15 MaskWearing---------------------#
+class_name = ('mask', 'no-mask')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'MaskWearing/raw/'
+dataset_MaskWearing = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_MaskWearing = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------16 MountainDewCommercial---------------------#
+class_name = ('bottle', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'MountainDewCommercial/'
+dataset_MountainDewCommercial = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_MountainDewCommercial = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------17 NorthAmericaMushrooms---------------------#
+class_name = ('flat mushroom', 'yellow mushroom')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'NorthAmericaMushrooms/North American Mushrooms.v1-416x416.coco/' # noqa
+dataset_NorthAmericaMushrooms = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/new_annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_NorthAmericaMushrooms = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/new_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------18 openPoetryVision---------------------#
+class_name = ('American Typewriter', 'Andale Mono', 'Apple Chancery', 'Arial',
+ 'Avenir', 'Baskerville', 'Big Caslon', 'Bradley Hand',
+ 'Brush Script MT', 'Chalkboard', 'Comic Sans MS', 'Copperplate',
+ 'Courier', 'Didot', 'Futura', 'Geneva', 'Georgia', 'Gill Sans',
+ 'Helvetica', 'Herculanum', 'Impact', 'Kefa', 'Lucida Grande',
+ 'Luminari', 'Marker Felt', 'Menlo', 'Monaco', 'Noteworthy',
+ 'Optima', 'PT Sans', 'PT Serif', 'Palatino', 'Papyrus',
+ 'Phosphate', 'Rockwell', 'SF Pro', 'SignPainter', 'Skia',
+ 'Snell Roundhand', 'Tahoma', 'Times New Roman', 'Trebuchet MS',
+ 'Verdana')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'openPoetryVision/512x512/'
+dataset_openPoetryVision = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_openPoetryVision = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------19 OxfordPets_by_breed---------------------#
+class_name = ('cat-Abyssinian', 'cat-Bengal', 'cat-Birman', 'cat-Bombay',
+ 'cat-British_Shorthair', 'cat-Egyptian_Mau', 'cat-Maine_Coon',
+ 'cat-Persian', 'cat-Ragdoll', 'cat-Russian_Blue', 'cat-Siamese',
+ 'cat-Sphynx', 'dog-american_bulldog',
+ 'dog-american_pit_bull_terrier', 'dog-basset_hound',
+ 'dog-beagle', 'dog-boxer', 'dog-chihuahua',
+ 'dog-english_cocker_spaniel', 'dog-english_setter',
+ 'dog-german_shorthaired', 'dog-great_pyrenees', 'dog-havanese',
+ 'dog-japanese_chin', 'dog-keeshond', 'dog-leonberger',
+ 'dog-miniature_pinscher', 'dog-newfoundland', 'dog-pomeranian',
+ 'dog-pug', 'dog-saint_bernard', 'dog-samoyed',
+ 'dog-scottish_terrier', 'dog-shiba_inu',
+ 'dog-staffordshire_bull_terrier', 'dog-wheaten_terrier',
+ 'dog-yorkshire_terrier')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'OxfordPets/by-breed/' # noqa
+dataset_OxfordPets_by_breed = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_OxfordPets_by_breed = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------20 OxfordPets_by_species---------------------#
+class_name = ('cat', 'dog')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'OxfordPets/by-species/' # noqa
+dataset_OxfordPets_by_species = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_OxfordPets_by_species = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------21 PKLot---------------------#
+class_name = ('space-empty', 'space-occupied')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'PKLot/640/' # noqa
+dataset_PKLot = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_PKLot = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------22 Packages---------------------#
+class_name = ('package', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Packages/Raw/'
+caption_prompt = {
+ 'package': {
+ 'prefix': 'there is a ',
+ 'suffix': ' on the porch'
+ }
+}
+dataset_Packages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Packages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------23 PascalVOC---------------------#
+class_name = ('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car',
+ 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse',
+ 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train',
+ 'tvmonitor')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'PascalVOC/'
+dataset_PascalVOC = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_PascalVOC = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------24 pistols---------------------#
+class_name = ('pistol', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pistols/export/'
+dataset_pistols = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pistols = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------25 plantdoc---------------------#
+class_name = ('Apple Scab Leaf', 'Apple leaf', 'Apple rust leaf',
+ 'Bell_pepper leaf', 'Bell_pepper leaf spot', 'Blueberry leaf',
+ 'Cherry leaf', 'Corn Gray leaf spot', 'Corn leaf blight',
+ 'Corn rust leaf', 'Peach leaf', 'Potato leaf',
+ 'Potato leaf early blight', 'Potato leaf late blight',
+ 'Raspberry leaf', 'Soyabean leaf', 'Soybean leaf',
+ 'Squash Powdery mildew leaf', 'Strawberry leaf',
+ 'Tomato Early blight leaf', 'Tomato Septoria leaf spot',
+ 'Tomato leaf', 'Tomato leaf bacterial spot',
+ 'Tomato leaf late blight', 'Tomato leaf mosaic virus',
+ 'Tomato leaf yellow virus', 'Tomato mold leaf',
+ 'Tomato two spotted spider mites leaf', 'grape leaf',
+ 'grape leaf black rot')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'plantdoc/416x416/'
+dataset_plantdoc = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_plantdoc = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------26 pothole---------------------#
+class_name = ('pothole', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pothole/'
+caption_prompt = {
+ 'pothole': {
+ 'name': 'holes',
+ 'prefix': 'there are some ',
+ 'suffix': ' on the road'
+ }
+}
+dataset_pothole = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ caption_prompt=caption_prompt,
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pothole = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------27 Raccoon---------------------#
+class_name = ('raccoon', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Raccoon/Raccoon.v2-raw.coco/'
+dataset_Raccoon = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Raccoon = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------28 selfdrivingCar---------------------#
+class_name = ('biker', 'car', 'pedestrian', 'trafficLight',
+ 'trafficLight-Green', 'trafficLight-GreenLeft',
+ 'trafficLight-Red', 'trafficLight-RedLeft',
+ 'trafficLight-Yellow', 'trafficLight-YellowLeft', 'truck')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'selfdrivingCar/fixedLarge/export/'
+dataset_selfdrivingCar = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_selfdrivingCar = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------29 ShellfishOpenImages---------------------#
+class_name = ('Crab', 'Lobster', 'Shrimp')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ShellfishOpenImages/raw/'
+dataset_ShellfishOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ShellfishOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------30 ThermalCheetah---------------------#
+class_name = ('cheetah', 'human')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ThermalCheetah/'
+dataset_ThermalCheetah = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ThermalCheetah = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------31 thermalDogsAndPeople---------------------#
+class_name = ('dog', 'person')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'thermalDogsAndPeople/'
+dataset_thermalDogsAndPeople = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_thermalDogsAndPeople = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------32 UnoCards---------------------#
+class_name = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11',
+ '12', '13', '14')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'UnoCards/raw/'
+dataset_UnoCards = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_UnoCards = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------33 VehiclesOpenImages---------------------#
+class_name = ('Ambulance', 'Bus', 'Car', 'Motorcycle', 'Truck')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'VehiclesOpenImages/416x416/'
+dataset_VehiclesOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_VehiclesOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------34 WildfireSmoke---------------------#
+class_name = ('smoke', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'WildfireSmoke/'
+dataset_WildfireSmoke = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_WildfireSmoke = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------35 websiteScreenshots---------------------#
+class_name = ('button', 'field', 'heading', 'iframe', 'image', 'label', 'link',
+ 'text')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'websiteScreenshots/'
+dataset_websiteScreenshots = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_websiteScreenshots = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# --------------------- Config---------------------#
+
+dataset_prefixes = [
+ 'AerialMaritimeDrone_large',
+ 'AerialMaritimeDrone_tiled',
+ 'AmericanSignLanguageLetters',
+ 'Aquarium',
+ 'BCCD',
+ 'boggleBoards',
+ 'brackishUnderwater',
+ 'ChessPieces',
+ 'CottontailRabbits',
+ 'dice',
+ 'DroneControl',
+ 'EgoHands_generic',
+ 'EgoHands_specific',
+ 'HardHatWorkers',
+ 'MaskWearing',
+ 'MountainDewCommercial',
+ 'NorthAmericaMushrooms',
+ 'openPoetryVision',
+ 'OxfordPets_by_breed',
+ 'OxfordPets_by_species',
+ 'PKLot',
+ 'Packages',
+ 'PascalVOC',
+ 'pistols',
+ 'plantdoc',
+ 'pothole',
+ 'Raccoons',
+ 'selfdrivingCar',
+ 'ShellfishOpenImages',
+ 'ThermalCheetah',
+ 'thermalDogsAndPeople',
+ 'UnoCards',
+ 'VehiclesOpenImages',
+ 'WildfireSmoke',
+ 'websiteScreenshots',
+]
+
+datasets = [
+ dataset_AerialMaritimeDrone_large, dataset_AerialMaritimeDrone_tiled,
+ dataset_AmericanSignLanguageLetters, dataset_Aquarium, dataset_BCCD,
+ dataset_boggleBoards, dataset_brackishUnderwater, dataset_ChessPieces,
+ dataset_CottontailRabbits, dataset_dice, dataset_DroneControl,
+ dataset_EgoHands_generic, dataset_EgoHands_specific,
+ dataset_HardHatWorkers, dataset_MaskWearing, dataset_MountainDewCommercial,
+ dataset_NorthAmericaMushrooms, dataset_openPoetryVision,
+ dataset_OxfordPets_by_breed, dataset_OxfordPets_by_species, dataset_PKLot,
+ dataset_Packages, dataset_PascalVOC, dataset_pistols, dataset_plantdoc,
+ dataset_pothole, dataset_Raccoon, dataset_selfdrivingCar,
+ dataset_ShellfishOpenImages, dataset_ThermalCheetah,
+ dataset_thermalDogsAndPeople, dataset_UnoCards, dataset_VehiclesOpenImages,
+ dataset_WildfireSmoke, dataset_websiteScreenshots
+]
+
+metrics = [
+ val_evaluator_AerialMaritimeDrone_large,
+ val_evaluator_AerialMaritimeDrone_tiled,
+ val_evaluator_AmericanSignLanguageLetters, val_evaluator_Aquarium,
+ val_evaluator_BCCD, val_evaluator_boggleBoards,
+ val_evaluator_brackishUnderwater, val_evaluator_ChessPieces,
+ val_evaluator_CottontailRabbits, val_evaluator_dice,
+ val_evaluator_DroneControl, val_evaluator_EgoHands_generic,
+ val_evaluator_EgoHands_specific, val_evaluator_HardHatWorkers,
+ val_evaluator_MaskWearing, val_evaluator_MountainDewCommercial,
+ val_evaluator_NorthAmericaMushrooms, val_evaluator_openPoetryVision,
+ val_evaluator_OxfordPets_by_breed, val_evaluator_OxfordPets_by_species,
+ val_evaluator_PKLot, val_evaluator_Packages, val_evaluator_PascalVOC,
+ val_evaluator_pistols, val_evaluator_plantdoc, val_evaluator_pothole,
+ val_evaluator_Raccoon, val_evaluator_selfdrivingCar,
+ val_evaluator_ShellfishOpenImages, val_evaluator_ThermalCheetah,
+ val_evaluator_thermalDogsAndPeople, val_evaluator_UnoCards,
+ val_evaluator_VehiclesOpenImages, val_evaluator_WildfireSmoke,
+ val_evaluator_websiteScreenshots
+]
+
+# -------------------------------------------------#
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/glip/odinw/glip_atss_swin-t_bc_fpn_dyhead_pretrain_odinw13.py b/configs/glip/odinw/glip_atss_swin-t_bc_fpn_dyhead_pretrain_odinw13.py
new file mode 100644
index 00000000000..c3479b62b78
--- /dev/null
+++ b/configs/glip/odinw/glip_atss_swin-t_bc_fpn_dyhead_pretrain_odinw13.py
@@ -0,0 +1,3 @@
+_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_odinw13.py'
+
+model = dict(bbox_head=dict(early_fuse=True))
diff --git a/configs/glip/odinw/glip_atss_swin-t_bc_fpn_dyhead_pretrain_odinw35.py b/configs/glip/odinw/glip_atss_swin-t_bc_fpn_dyhead_pretrain_odinw35.py
new file mode 100644
index 00000000000..182afc66c93
--- /dev/null
+++ b/configs/glip/odinw/glip_atss_swin-t_bc_fpn_dyhead_pretrain_odinw35.py
@@ -0,0 +1,3 @@
+_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_odinw35.py'
+
+model = dict(bbox_head=dict(early_fuse=True))
diff --git a/configs/glip/odinw/override_category.py b/configs/glip/odinw/override_category.py
new file mode 100644
index 00000000000..9ff05fc6e5e
--- /dev/null
+++ b/configs/glip/odinw/override_category.py
@@ -0,0 +1,109 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+
+import mmengine
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description='Override Category')
+ parser.add_argument('data_root')
+ return parser.parse_args()
+
+
+def main():
+ args = parse_args()
+
+ ChessPieces = [{
+ 'id': 1,
+ 'name': ' ',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 2,
+ 'name': 'black bishop',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 3,
+ 'name': 'black king',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 4,
+ 'name': 'black knight',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 5,
+ 'name': 'black pawn',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 6,
+ 'name': 'black queen',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 7,
+ 'name': 'black rook',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 8,
+ 'name': 'white bishop',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 9,
+ 'name': 'white king',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 10,
+ 'name': 'white knight',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 11,
+ 'name': 'white pawn',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 12,
+ 'name': 'white queen',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 13,
+ 'name': 'white rook',
+ 'supercategory': 'pieces'
+ }]
+
+ _data_root = args.data_root + 'ChessPieces/Chess Pieces.v23-raw.coco/'
+ json_data = mmengine.load(_data_root +
+ 'valid/annotations_without_background.json')
+ json_data['categories'] = ChessPieces
+ mmengine.dump(json_data,
+ _data_root + 'valid/new_annotations_without_background.json')
+
+ CottontailRabbits = [{
+ 'id': 1,
+ 'name': 'rabbit',
+ 'supercategory': 'Cottontail-Rabbit'
+ }]
+
+ _data_root = args.data_root + 'CottontailRabbits/'
+ json_data = mmengine.load(_data_root +
+ 'valid/annotations_without_background.json')
+ json_data['categories'] = CottontailRabbits
+ mmengine.dump(json_data,
+ _data_root + 'valid/new_annotations_without_background.json')
+
+ NorthAmericaMushrooms = [{
+ 'id': 1,
+ 'name': 'flat mushroom',
+ 'supercategory': 'mushroom'
+ }, {
+ 'id': 2,
+ 'name': 'yellow mushroom',
+ 'supercategory': 'mushroom'
+ }]
+
+ _data_root = args.data_root + 'NorthAmericaMushrooms/North American Mushrooms.v1-416x416.coco/' # noqa
+ json_data = mmengine.load(_data_root +
+ 'valid/annotations_without_background.json')
+ json_data['categories'] = NorthAmericaMushrooms
+ mmengine.dump(json_data,
+ _data_root + 'valid/new_annotations_without_background.json')
+
+
+if __name__ == '__main__':
+ main()
diff --git a/configs/grounding_dino/README.md b/configs/grounding_dino/README.md
index 715b630cc79..2a527828a46 100644
--- a/configs/grounding_dino/README.md
+++ b/configs/grounding_dino/README.md
@@ -59,7 +59,7 @@ python demo/image_demo.py \
-## Results and Models
+## COCO Results and Models
| Model | Backbone | Style | COCO mAP | Official COCO mAP | Pre-Train Data | Config | Download |
| :----------------: | :------: | :-------: | :--------: | :---------------: | :----------------------------------------------: | :------------------------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
@@ -75,6 +75,151 @@ Note:
2. Finetune refers to fine-tuning on the COCO 2017 dataset. The R50 model is trained using 8 NVIDIA GeForce 3090 GPUs, while the remaining models are trained using 16 NVIDIA GeForce 3090 GPUs. The GPU memory usage is approximately 8.5GB.
3. Our performance is higher than the official model due to two reasons: we modified the initialization strategy and introduced a log scaler.
+## LVIS Results
+
+| Model | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP | Pre-Train Data | Config | Download |
+| :--------------: | :---------: | :---------: | :---------: | :--------: | :--------: | :--------: | :--------: | :-------: | :----------------------------------------------: | :-----------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------: |
+| Grounding DINO-T | 18.8 | 24.2 | 34.7 | 28.8 | 10.1 | 15.3 | 29.9 | 20.1 | O365,GoldG,Cap4M | [config](lvis/grounding_dino_swin-t_pretrain_zeroshot_mini-lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/grounding_dino/groundingdino_swint_ogc_mmdet-822d7e9d.pth) |
+| Grounding DINO-B | 27.9 | 33.4 | 37.2 | 34.7 | 19.0 | 24.1 | 32.9 | 26.7 | COCO,O365,GoldG,Cap4M,OpenImage,ODinW-35,RefCOCO | [config](lvis/grounding_dino_swin-b_pretrain_zeroshot_mini-lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/grounding_dino/groundingdino_swinb_cogcoor_mmdet-55949c9c.pth) |
+
+Note:
+
+1. The above are zero-shot evaluation results.
+2. The evaluation metric we used is LVIS FixAP. For specific details, please refer to [Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details](https://arxiv.org/pdf/2102.01066.pdf).
+
+## ODinW (Object Detection in the Wild) Results
+
+Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets and tasks. However, it remains challenging to evaluate the transferablity of these models due to the lack of easy-to-use evaluation toolkits and public benchmarks. To tackle this, we build ELEVATER 1 , the first benchmark and toolkit for evaluating (pre-trained) language-augmented visual models. ELEVATER is composed of three components. (i) Datasets. As downstream evaluation suites, it consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. (ii) Toolkit. An automatic hyper-parameter tuning toolkit is developed to facilitate model evaluation on downstream tasks. (iii) Metrics. A variety of evaluation metrics are used to measure sample-efficiency (zero-shot and few-shot) and parameter-efficiency (linear probing and full model fine-tuning). ELEVATER is platform for Computer Vision in the Wild (CVinW), and is publicly released at https://computer-vision-in-the-wild.github.io/ELEVATER/
+
+### Results and models of ODinW13
+
+| Method | GLIP-T(A) | Official | GLIP-T(B) | Official | GLIP-T(C) | Official | GroundingDINO-T | GroundingDINO-B |
+| --------------------- | --------- | --------- | --------- | --------- | --------- | --------- | --------------- | --------------- |
+| AerialMaritimeDrone | 0.123 | 0.122 | 0.110 | 0.110 | 0.130 | 0.130 | 0.173 | 0.281 |
+| Aquarium | 0.175 | 0.174 | 0.173 | 0.169 | 0.191 | 0.190 | 0.195 | 0.445 |
+| CottontailRabbits | 0.686 | 0.686 | 0.688 | 0.688 | 0.744 | 0.744 | 0.799 | 0.808 |
+| EgoHands | 0.013 | 0.013 | 0.003 | 0.004 | 0.314 | 0.315 | 0.608 | 0.764 |
+| NorthAmericaMushrooms | 0.502 | 0.502 | 0.367 | 0.367 | 0.297 | 0.296 | 0.507 | 0.675 |
+| Packages | 0.589 | 0.589 | 0.083 | 0.083 | 0.699 | 0.699 | 0.687 | 0.670 |
+| PascalVOC | 0.512 | 0.512 | 0.541 | 0.540 | 0.565 | 0.565 | 0.563 | 0.711 |
+| pistols | 0.339 | 0.339 | 0.502 | 0.501 | 0.503 | 0.504 | 0.726 | 0.771 |
+| pothole | 0.007 | 0.007 | 0.030 | 0.030 | 0.058 | 0.058 | 0.215 | 0.478 |
+| Raccoon | 0.075 | 0.074 | 0.285 | 0.288 | 0.241 | 0.244 | 0.549 | 0.541 |
+| ShellfishOpenImages | 0.253 | 0.253 | 0.337 | 0.338 | 0.300 | 0.302 | 0.393 | 0.650 |
+| thermalDogsAndPeople | 0.372 | 0.372 | 0.475 | 0.475 | 0.510 | 0.510 | 0.657 | 0.633 |
+| VehiclesOpenImages | 0.574 | 0.566 | 0.562 | 0.547 | 0.549 | 0.534 | 0.613 | 0.647 |
+| Average | **0.325** | **0.324** | **0.320** | **0.318** | **0.392** | **0.392** | **0.514** | **0.621** |
+
+### Results and models of ODinW35
+
+| Method | GLIP-T(A) | Official | GLIP-T(B) | Official | GLIP-T(C) | Official | GroundingDINO-T | GroundingDINO-B |
+| --------------------------- | --------- | --------- | --------- | --------- | --------- | --------- | --------------- | --------------- |
+| AerialMaritimeDrone_large | 0.123 | 0.122 | 0.110 | 0.110 | 0.130 | 0.130 | 0.173 | 0.281 |
+| AerialMaritimeDrone_tiled | 0.174 | 0.174 | 0.172 | 0.172 | 0.172 | 0.172 | 0.206 | 0.364 |
+| AmericanSignLanguageLetters | 0.001 | 0.001 | 0.003 | 0.003 | 0.009 | 0.009 | 0.002 | 0.096 |
+| Aquarium | 0.175 | 0.175 | 0.173 | 0.171 | 0.192 | 0.182 | 0.195 | 0.445 |
+| BCCD | 0.016 | 0.016 | 0.001 | 0.001 | 0.000 | 0.000 | 0.161 | 0.584 |
+| boggleBoards | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.134 |
+| brackishUnderwater | 0.016 | 0..013 | 0.021 | 0.027 | 0.020 | 0.022 | 0.021 | 0.454 |
+| ChessPieces | 0.001 | 0.001 | 0.000 | 0.000 | 0.001 | 0.001 | 0.000 | 0.000 |
+| CottontailRabbits | 0.710 | 0.709 | 0.683 | 0.683 | 0.752 | 0.752 | 0.806 | 0.797 |
+| dice | 0.005 | 0.005 | 0.004 | 0.004 | 0.004 | 0.004 | 0.004 | 0.082 |
+| DroneControl | 0.016 | 0.017 | 0.006 | 0.008 | 0.005 | 0.007 | 0.042 | 0.638 |
+| EgoHands_generic | 0.009 | 0.010 | 0.005 | 0.006 | 0.510 | 0.508 | 0.608 | 0.764 |
+| EgoHands_specific | 0.001 | 0.001 | 0.004 | 0.006 | 0.003 | 0.004 | 0.002 | 0.687 |
+| HardHatWorkers | 0.029 | 0.029 | 0.023 | 0.023 | 0.033 | 0.033 | 0.046 | 0.439 |
+| MaskWearing | 0.007 | 0.007 | 0.003 | 0.002 | 0.005 | 0.005 | 0.004 | 0.406 |
+| MountainDewCommercial | 0.218 | 0.227 | 0.199 | 0.197 | 0.478 | 0.463 | 0.430 | 0.580 |
+| NorthAmericaMushrooms | 0.502 | 0.502 | 0.450 | 0.450 | 0.497 | 0.497 | 0.471 | 0.501 |
+| openPoetryVision | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.051 |
+| OxfordPets_by_breed | 0.001 | 0.002 | 0.002 | 0.004 | 0.001 | 0.002 | 0.003 | 0.799 |
+| OxfordPets_by_species | 0.016 | 0.011 | 0.012 | 0.009 | 0.013 | 0.009 | 0.011 | 0.872 |
+| PKLot | 0.002 | 0.002 | 0.000 | 0.000 | 0.000 | 0.000 | 0.001 | 0.774 |
+| Packages | 0.569 | 0.569 | 0.279 | 0.279 | 0.712 | 0.712 | 0.695 | 0.728 |
+| PascalVOC | 0.512 | 0.512 | 0.541 | 0.540 | 0.565 | 0.565 | 0.563 | 0.711 |
+| pistols | 0.339 | 0.339 | 0.502 | 0.501 | 0.503 | 0.504 | 0.726 | 0.771 |
+| plantdoc | 0.002 | 0.002 | 0.007 | 0.007 | 0.009 | 0.009 | 0.005 | 0.376 |
+| pothole | 0.007 | 0.010 | 0.024 | 0.025 | 0.085 | 0.101 | 0.215 | 0.478 |
+| Raccoons | 0.075 | 0.074 | 0.285 | 0.288 | 0.241 | 0.244 | 0.549 | 0.541 |
+| selfdrivingCar | 0.071 | 0.072 | 0.074 | 0.074 | 0.081 | 0.080 | 0.089 | 0.318 |
+| ShellfishOpenImages | 0.253 | 0.253 | 0.337 | 0.338 | 0.300 | 0.302 | 0.393 | 0.650 |
+| ThermalCheetah | 0.028 | 0.028 | 0.000 | 0.000 | 0.028 | 0.028 | 0.087 | 0.290 |
+| thermalDogsAndPeople | 0.372 | 0.372 | 0.475 | 0.475 | 0.510 | 0.510 | 0.657 | 0.633 |
+| UnoCards | 0.000 | 0.000 | 0.000 | 0.001 | 0.002 | 0.003 | 0.006 | 0.754 |
+| VehiclesOpenImages | 0.574 | 0.566 | 0.562 | 0.547 | 0.549 | 0.534 | 0.613 | 0.647 |
+| WildfireSmoke | 0.000 | 0.000 | 0.000 | 0.000 | 0.017 | 0.017 | 0.134 | 0.410 |
+| websiteScreenshots | 0.003 | 0.004 | 0.003 | 0.005 | 0.005 | 0.006 | 0.012 | 0.175 |
+| Average | **0.134** | **0.134** | **0.138** | **0.138** | **0.179** | **0.178** | **0.227** | **0.492** |
+
+## Flickr30k Results
+
+| Model | Pre-Train Data | Val R@1 | Val R@5 | Val R@10 | Tesst R@1 | Test R@5 | Test R@10 | Config | Download |
+| :--------------: | :--------------: | ------- | ------- | -------- | --------- | -------- | --------- | :-------------------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| Grounding DINO-T | O365,GoldG,Cap4M | 87.8 | 96.6 | 98.0 | 88.1 | 96.9 | 98.2 | [config](grounding_dino_swin-t_finetune_16xb2_1x_coco.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/grounding_dino/grounding_dino_swin-t_finetune_16xb2_1x_coco/grounding_dino_swin-t_finetune_16xb2_1x_coco_20230921_152544-5f234b20.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/grounding_dino/grounding_dino_swin-t_finetune_16xb2_1x_coco/grounding_dino_swin-t_finetune_16xb2_1x_coco_20230921_152544.log.json) |
+
+Note:
+
+1. `@1,5,10` refers to precision at the top 1, 5, and 10 positions in a predicted ranked list.
+2. The pretraining data used by Grounding DINO-T is `O365,GoldG,Cap4M`, and the corresponding evaluation configuration is (grounding_dino_swin-t_pretrain_zeroshot_refcoco)\[refcoco/grounding_dino_swin-t_pretrain_zeroshot_refcoco.py\].
+
+Test Command
+
+```shell
+cd mmdetection
+bash tools/dist_test.sh configs/grounding_dino/flickr30k/grounding_dino_swin-t-pretrain_zeroshot_flickr30k.py checkpoints/groundingdino_swint_ogc_mmdet-822d7e9d.pth 8
+```
+
+## Referring Expression Comprehension Results
+
+| Method | Grounding DINO-T (O365,GoldG,Cap4M) | Grounding DINO-B (COCO,O365,GoldG,Cap4M,OpenImage,ODinW-35,RefCOCO) |
+| --------------------------------------- | ----------------------------------------- | ------------------------------------------------------------------------- |
+| RefCOCO val @1,5,10 | 50.77/89.45/94.86 | 84.61/97.88/99.10 |
+| RefCOCO testA @1,5,10 | 57.45/91.29/95.62 | 88.65/98.89/99.63 |
+| RefCOCO testB @1,5,10 | 44.97/86.54/92.88 | 80.51/96.64/98.51 |
+| RefCOCO+ val @1,5,10 | 51.64/86.35/92.57 | 73.67/96.60/98.65 |
+| RefCOCO+ testA @1,5,10 | 57.25/86.74/92.65 | 82.19/97.92/99.09 |
+| RefCOCO+ testB @1,5,10 | 46.35/84.05/90.67 | 64.10/94.25/97.46 |
+| RefCOCOg val @1,5,10 | 60.42/92.10/96.18 | 78.33/97.28/98.57 |
+| RefCOCOg test @1,5,10 | 59.74/92.08/96.28 | 78.11/97.06/98.65 |
+| gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc | 41.32/91.82 | 46.18/81.44 |
+| gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc | 27.23/90.24 | 38.60/76.06 |
+| gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc | 29.70/93.49 | 35.87/80.58 |
+
+Note:
+
+1. `@1,5,10` refers to precision at the top 1, 5, and 10 positions in a predicted ranked list.
+2. `Pr@(F1=1, IoU≥0.5),N-acc` from the paper [GREC: Generalized Referring Expression Comprehension](https://arxiv.org/pdf/2308.16182.pdf)
+3. The pretraining data used by Grounding DINO-T is `O365,GoldG,Cap4M`, and the corresponding evaluation configuration is (grounding_dino_swin-t_pretrain_zeroshot_refcoco)\[refcoco/grounding_dino_swin-t_pretrain_zeroshot_refcoco.py\].
+4. The pretraining data used by Grounding DINO-B is `COCO,O365,GoldG,Cap4M,OpenImage,ODinW-35,RefCOCO`, and the corresponding evaluation configuration is (grounding_dino_swin-t_pretrain_zeroshot_refcoco)\[refcoco/grounding_dino_swin-b_pretrain_zeroshot_refcoco.py\].
+
+Test Command
+
+```shell
+cd mmdetection
+./tools/dist_test.sh configs/grounding_dino/refcoco/grounding_dino_swin-t_pretrain_zeroshot_refexp.py https://download.openmmlab.com/mmdetection/v3.0/grounding_dino/groundingdino_swint_ogc_mmdet-822d7e9d.pth 8
+./tools/dist_test.sh configs/grounding_dino/refcoco/grounding_dino_swin-b_pretrain_zeroshot_refexp.py https://download.openmmlab.com/mmdetection/v3.0/grounding_dino/groundingdino_swinb_cogcoor_mmdet-55949c9c.pth 8
+```
+
+## Description Detection Dataset
+
+```shell
+pip install ddd-dataset
+```
+
+| Method | mode | Grounding DINO-T (O365,GoldG,Cap4M) | Grounding DINO-B (COCO,O365,GoldG,Cap4M,OpenImage,ODinW-35,RefCOCO) |
+| -------------------------------- | -------- | ----------------------------------------- | ------------------------------------------------------------------------- |
+| FULL/short/middle/long/very long | concat | 17.2/18.0/18.7/14.8/16.3 | 20.2/20.4/21.1/18.8/19.8 |
+| FULL/short/middle/long/very long | parallel | 22.3/28.2/24.8/19.1/13.9 | 25.0/26.4/27.2/23.5/19.7 |
+| PRES/short/middle/long/very long | concat | 17.8/18.3/19.2/15.2/17.3 | 20.7/21.7/21.4/19.1/20.3 |
+| PRES/short/middle/long/very long | parallel | 21.0/27.0/22.8/17.5/12.5 | 23.7/25.8/25.1/21.9/19.3 |
+| ABS/short/middle/long/very long | concat | 15.4/17.1/16.4/13.6/14.9 | 18.6/16.1/19.7/18.1/19.1 |
+| ABS/short/middle/long/very long | parallel | 26.0/32.0/33.0/23.6/15.5 | 28.8/28.1/35.8/28.2/20.2 |
+
+Note:
+
+1. Considering that the evaluation time for Inter-scenario is very long and the performance is low, it is temporarily not supported. The mentioned metrics are for Intra-scenario.
+2. `concat` is the default inference mode for Grounding DINO, where it concatenates multiple sub-sentences with "." to form a single sentence for inference. On the other hand, "parallel" performs inference on each sub-sentence in a for-loop.
+
## Custom Dataset
To facilitate fine-tuning on custom datasets, we use a simple cat dataset as an example, as shown in the following steps.
diff --git a/configs/grounding_dino/dod/grounding_dino_swin-b_pretrain_zeroshot_concat_dod.py b/configs/grounding_dino/dod/grounding_dino_swin-b_pretrain_zeroshot_concat_dod.py
new file mode 100644
index 00000000000..ac655b74aa6
--- /dev/null
+++ b/configs/grounding_dino/dod/grounding_dino_swin-b_pretrain_zeroshot_concat_dod.py
@@ -0,0 +1,14 @@
+_base_ = 'grounding_dino_swin-t_pretrain_zeroshot_concat_dod.py'
+
+model = dict(
+ type='GroundingDINO',
+ backbone=dict(
+ pretrain_img_size=384,
+ embed_dims=128,
+ depths=[2, 2, 18, 2],
+ num_heads=[4, 8, 16, 32],
+ window_size=12,
+ drop_path_rate=0.3,
+ patch_norm=True),
+ neck=dict(in_channels=[256, 512, 1024]),
+)
diff --git a/configs/grounding_dino/dod/grounding_dino_swin-b_pretrain_zeroshot_parallel_dod.py b/configs/grounding_dino/dod/grounding_dino_swin-b_pretrain_zeroshot_parallel_dod.py
new file mode 100644
index 00000000000..9a1c8f2ac74
--- /dev/null
+++ b/configs/grounding_dino/dod/grounding_dino_swin-b_pretrain_zeroshot_parallel_dod.py
@@ -0,0 +1,3 @@
+_base_ = 'grounding_dino_swin-b_pretrain_zeroshot_concat_dod.py'
+
+model = dict(test_cfg=dict(chunked_size=1))
diff --git a/configs/grounding_dino/dod/grounding_dino_swin-t_pretrain_zeroshot_concat_dod.py b/configs/grounding_dino/dod/grounding_dino_swin-t_pretrain_zeroshot_concat_dod.py
new file mode 100644
index 00000000000..bb418011bf4
--- /dev/null
+++ b/configs/grounding_dino/dod/grounding_dino_swin-t_pretrain_zeroshot_concat_dod.py
@@ -0,0 +1,78 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py'
+
+data_root = 'data/d3/'
+
+test_pipeline = [
+ dict(
+ type='LoadImageFromFile', backend_args=None,
+ imdecode_backend='pillow'),
+ dict(
+ type='FixScaleResize',
+ scale=(800, 1333),
+ keep_ratio=True,
+ backend='pillow'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'text', 'custom_entities', 'sent_ids'))
+]
+
+# -------------------------------------------------#
+val_dataset_full = dict(
+ type='DODDataset',
+ data_root=data_root,
+ ann_file='d3_json/d3_full_annotations.json',
+ data_prefix=dict(img='d3_images/', anno='d3_pkl'),
+ pipeline=test_pipeline,
+ test_mode=True,
+ backend_args=None,
+ return_classes=True)
+
+val_evaluator_full = dict(
+ type='DODCocoMetric',
+ ann_file=data_root + 'd3_json/d3_full_annotations.json')
+
+# -------------------------------------------------#
+val_dataset_pres = dict(
+ type='DODDataset',
+ data_root=data_root,
+ ann_file='d3_json/d3_pres_annotations.json',
+ data_prefix=dict(img='d3_images/', anno='d3_pkl'),
+ pipeline=test_pipeline,
+ test_mode=True,
+ backend_args=None,
+ return_classes=True)
+val_evaluator_pres = dict(
+ type='DODCocoMetric',
+ ann_file=data_root + 'd3_json/d3_pres_annotations.json')
+
+# -------------------------------------------------#
+val_dataset_abs = dict(
+ type='DODDataset',
+ data_root=data_root,
+ ann_file='d3_json/d3_abs_annotations.json',
+ data_prefix=dict(img='d3_images/', anno='d3_pkl'),
+ pipeline=test_pipeline,
+ test_mode=True,
+ backend_args=None,
+ return_classes=True)
+val_evaluator_abs = dict(
+ type='DODCocoMetric',
+ ann_file=data_root + 'd3_json/d3_abs_annotations.json')
+
+# -------------------------------------------------#
+datasets = [val_dataset_full, val_dataset_pres, val_dataset_abs]
+dataset_prefixes = ['FULL', 'PRES', 'ABS']
+metrics = [val_evaluator_full, val_evaluator_pres, val_evaluator_abs]
+
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/grounding_dino/dod/grounding_dino_swin-t_pretrain_zeroshot_parallel_dod.py b/configs/grounding_dino/dod/grounding_dino_swin-t_pretrain_zeroshot_parallel_dod.py
new file mode 100644
index 00000000000..3d680091162
--- /dev/null
+++ b/configs/grounding_dino/dod/grounding_dino_swin-t_pretrain_zeroshot_parallel_dod.py
@@ -0,0 +1,3 @@
+_base_ = 'grounding_dino_swin-t_pretrain_zeroshot_concat_dod.py'
+
+model = dict(test_cfg=dict(chunked_size=1))
diff --git a/configs/grounding_dino/flickr30k/grounding_dino_swin-t-pretrain_zeroshot_flickr30k.py b/configs/grounding_dino/flickr30k/grounding_dino_swin-t-pretrain_zeroshot_flickr30k.py
new file mode 100644
index 00000000000..c1996567588
--- /dev/null
+++ b/configs/grounding_dino/flickr30k/grounding_dino_swin-t-pretrain_zeroshot_flickr30k.py
@@ -0,0 +1,57 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py'
+
+dataset_type = 'Flickr30kDataset'
+data_root = 'data/flickr30k_entities/'
+
+test_pipeline = [
+ dict(
+ type='LoadImageFromFile', backend_args=None,
+ imdecode_backend='pillow'),
+ dict(
+ type='FixScaleResize',
+ scale=(800, 1333),
+ keep_ratio=True,
+ backend='pillow'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'text', 'custom_entities',
+ 'tokens_positive', 'phrase_ids', 'phrases'))
+]
+
+dataset_Flickr30k_val = dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='final_flickr_separateGT_val.json',
+ data_prefix=dict(img='flickr30k_images/'),
+ pipeline=test_pipeline,
+)
+
+dataset_Flickr30k_test = dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='final_flickr_separateGT_test.json',
+ data_prefix=dict(img='flickr30k_images/'),
+ pipeline=test_pipeline,
+)
+
+val_evaluator_Flickr30k = dict(type='Flickr30kMetric')
+
+test_evaluator_Flickr30k = dict(type='Flickr30kMetric')
+
+# ----------Config---------- #
+dataset_prefixes = ['Flickr30kVal', 'Flickr30kTest']
+datasets = [dataset_Flickr30k_val, dataset_Flickr30k_test]
+metrics = [val_evaluator_Flickr30k, test_evaluator_Flickr30k]
+
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py b/configs/grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py
index 1117cb06d39..7448764ef7e 100644
--- a/configs/grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py
+++ b/configs/grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py
@@ -119,7 +119,8 @@
dict(
type='PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
- 'scale_factor', 'text', 'custom_entities'))
+ 'scale_factor', 'text', 'custom_entities',
+ 'tokens_positive'))
]
val_dataloader = dict(
diff --git a/configs/grounding_dino/lvis/grounding_dino_swin-b_pretrain_zeroshot_lvis.py b/configs/grounding_dino/lvis/grounding_dino_swin-b_pretrain_zeroshot_lvis.py
new file mode 100644
index 00000000000..6084159044e
--- /dev/null
+++ b/configs/grounding_dino/lvis/grounding_dino_swin-b_pretrain_zeroshot_lvis.py
@@ -0,0 +1,14 @@
+_base_ = './grounding_dino_swin-t_pretrain_zeroshot_lvis.py'
+
+model = dict(
+ type='GroundingDINO',
+ backbone=dict(
+ pretrain_img_size=384,
+ embed_dims=128,
+ depths=[2, 2, 18, 2],
+ num_heads=[4, 8, 16, 32],
+ window_size=12,
+ drop_path_rate=0.3,
+ patch_norm=True),
+ neck=dict(in_channels=[256, 512, 1024]),
+)
diff --git a/configs/grounding_dino/lvis/grounding_dino_swin-b_pretrain_zeroshot_mini-lvis.py b/configs/grounding_dino/lvis/grounding_dino_swin-b_pretrain_zeroshot_mini-lvis.py
new file mode 100644
index 00000000000..68467a7237c
--- /dev/null
+++ b/configs/grounding_dino/lvis/grounding_dino_swin-b_pretrain_zeroshot_mini-lvis.py
@@ -0,0 +1,14 @@
+_base_ = './grounding_dino_swin-t_pretrain_zeroshot_mini-lvis.py'
+
+model = dict(
+ type='GroundingDINO',
+ backbone=dict(
+ pretrain_img_size=384,
+ embed_dims=128,
+ depths=[2, 2, 18, 2],
+ num_heads=[4, 8, 16, 32],
+ window_size=12,
+ drop_path_rate=0.3,
+ patch_norm=True),
+ neck=dict(in_channels=[256, 512, 1024]),
+)
diff --git a/configs/grounding_dino/lvis/grounding_dino_swin-t_pretrain_zeroshot_lvis.py b/configs/grounding_dino/lvis/grounding_dino_swin-t_pretrain_zeroshot_lvis.py
new file mode 100644
index 00000000000..3d05f0ce1c0
--- /dev/null
+++ b/configs/grounding_dino/lvis/grounding_dino_swin-t_pretrain_zeroshot_lvis.py
@@ -0,0 +1,24 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py'
+
+model = dict(test_cfg=dict(
+ max_per_img=300,
+ chunked_size=40,
+))
+
+dataset_type = 'LVISV1Dataset'
+data_root = 'data/coco/'
+
+val_dataloader = dict(
+ dataset=dict(
+ data_root=data_root,
+ type=dataset_type,
+ ann_file='annotations/lvis_od_val.json',
+ data_prefix=dict(img='')))
+test_dataloader = val_dataloader
+
+# numpy < 1.24.0
+val_evaluator = dict(
+ _delete_=True,
+ type='LVISFixedAPMetric',
+ ann_file=data_root + 'annotations/lvis_od_val.json')
+test_evaluator = val_evaluator
diff --git a/configs/grounding_dino/lvis/grounding_dino_swin-t_pretrain_zeroshot_mini-lvis.py b/configs/grounding_dino/lvis/grounding_dino_swin-t_pretrain_zeroshot_mini-lvis.py
new file mode 100644
index 00000000000..0aac6cf33a9
--- /dev/null
+++ b/configs/grounding_dino/lvis/grounding_dino_swin-t_pretrain_zeroshot_mini-lvis.py
@@ -0,0 +1,25 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py'
+
+model = dict(test_cfg=dict(
+ max_per_img=300,
+ chunked_size=40,
+))
+
+dataset_type = 'LVISV1Dataset'
+data_root = 'data/coco/'
+
+val_dataloader = dict(
+ dataset=dict(
+ data_root=data_root,
+ type=dataset_type,
+ ann_file='annotations/lvis_v1_minival_inserted_image_name.json',
+ data_prefix=dict(img='')))
+test_dataloader = val_dataloader
+
+# numpy < 1.24.0
+val_evaluator = dict(
+ _delete_=True,
+ type='LVISFixedAPMetric',
+ ann_file=data_root +
+ 'annotations/lvis_v1_minival_inserted_image_name.json')
+test_evaluator = val_evaluator
diff --git a/configs/grounding_dino/odinw/grounding_dino_swin-b_pretrain_odinw13.py b/configs/grounding_dino/odinw/grounding_dino_swin-b_pretrain_odinw13.py
new file mode 100644
index 00000000000..65a6bc2a078
--- /dev/null
+++ b/configs/grounding_dino/odinw/grounding_dino_swin-b_pretrain_odinw13.py
@@ -0,0 +1,338 @@
+_base_ = '../grounding_dino_swin-b_pretrain_mixeddata.py'
+
+dataset_type = 'CocoDataset'
+data_root = 'data/odinw/'
+
+base_test_pipeline = _base_.test_pipeline
+base_test_pipeline[-1]['meta_keys'] = ('img_id', 'img_path', 'ori_shape',
+ 'img_shape', 'scale_factor', 'text',
+ 'custom_entities', 'caption_prompt')
+
+# ---------------------1 AerialMaritimeDrone---------------------#
+class_name = ('boat', 'car', 'dock', 'jetski', 'lift')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AerialMaritimeDrone/large/'
+dataset_AerialMaritimeDrone = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ test_mode=True,
+ pipeline=base_test_pipeline,
+ return_classes=True)
+val_evaluator_AerialMaritimeDrone = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------2 Aquarium---------------------#
+class_name = ('fish', 'jellyfish', 'penguin', 'puffin', 'shark', 'starfish',
+ 'stingray')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Aquarium/Aquarium Combined.v2-raw-1024.coco/'
+
+caption_prompt = None
+# caption_prompt = {
+# 'penguin': {
+# 'suffix': ', which is black and white'
+# },
+# 'puffin': {
+# 'suffix': ' with orange beaks'
+# },
+# 'stingray': {
+# 'suffix': ' which is flat and round'
+# },
+# }
+dataset_Aquarium = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Aquarium = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------3 CottontailRabbits---------------------#
+class_name = ('Cottontail-Rabbit', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'CottontailRabbits/'
+
+caption_prompt = None
+# caption_prompt = {'Cottontail-Rabbit': {'name': 'rabbit'}}
+
+dataset_CottontailRabbits = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_CottontailRabbits = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------4 EgoHands---------------------#
+class_name = ('hand', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'EgoHands/generic/'
+
+caption_prompt = None
+# caption_prompt = {'hand': {'suffix': ' of a person'}}
+
+dataset_EgoHands = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_EgoHands = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------5 NorthAmericaMushrooms---------------------#
+class_name = ('CoW', 'chanterelle')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'NorthAmericaMushrooms/North American Mushrooms.v1-416x416.coco/' # noqa
+
+caption_prompt = None
+# caption_prompt = {
+# 'CoW': {
+# 'name': 'flat mushroom'
+# },
+# 'chanterelle': {
+# 'name': 'yellow mushroom'
+# }
+# }
+
+dataset_NorthAmericaMushrooms = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_NorthAmericaMushrooms = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------6 Packages---------------------#
+class_name = ('package', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Packages/Raw/'
+
+caption_prompt = None
+# caption_prompt = {
+# 'package': {
+# 'prefix': 'there is a ',
+# 'suffix': ' on the porch'
+# }
+# }
+
+dataset_Packages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Packages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------7 PascalVOC---------------------#
+class_name = ('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car',
+ 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse',
+ 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train',
+ 'tvmonitor')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'PascalVOC/'
+dataset_PascalVOC = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_PascalVOC = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------8 pistols---------------------#
+class_name = ('pistol', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pistols/export/'
+dataset_pistols = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pistols = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------9 pothole---------------------#
+class_name = ('pothole', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pothole/'
+
+caption_prompt = None
+# caption_prompt = {
+# 'pothole': {
+# 'prefix': 'there are some ',
+# 'name': 'holes',
+# 'suffix': ' on the road'
+# }
+# }
+
+dataset_pothole = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pothole = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------10 Raccoon---------------------#
+class_name = ('raccoon', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Raccoon/Raccoon.v2-raw.coco/'
+dataset_Raccoon = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Raccoon = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------11 ShellfishOpenImages---------------------#
+class_name = ('Crab', 'Lobster', 'Shrimp')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ShellfishOpenImages/raw/'
+dataset_ShellfishOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ShellfishOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------12 thermalDogsAndPeople---------------------#
+class_name = ('dog', 'person')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'thermalDogsAndPeople/'
+dataset_thermalDogsAndPeople = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_thermalDogsAndPeople = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------13 VehiclesOpenImages---------------------#
+class_name = ('Ambulance', 'Bus', 'Car', 'Motorcycle', 'Truck')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'VehiclesOpenImages/416x416/'
+dataset_VehiclesOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_VehiclesOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# --------------------- Config---------------------#
+dataset_prefixes = [
+ 'AerialMaritimeDrone', 'Aquarium', 'CottontailRabbits', 'EgoHands',
+ 'NorthAmericaMushrooms', 'Packages', 'PascalVOC', 'pistols', 'pothole',
+ 'Raccoon', 'ShellfishOpenImages', 'thermalDogsAndPeople',
+ 'VehiclesOpenImages'
+]
+datasets = [
+ dataset_AerialMaritimeDrone, dataset_Aquarium, dataset_CottontailRabbits,
+ dataset_EgoHands, dataset_NorthAmericaMushrooms, dataset_Packages,
+ dataset_PascalVOC, dataset_pistols, dataset_pothole, dataset_Raccoon,
+ dataset_ShellfishOpenImages, dataset_thermalDogsAndPeople,
+ dataset_VehiclesOpenImages
+]
+metrics = [
+ val_evaluator_AerialMaritimeDrone, val_evaluator_Aquarium,
+ val_evaluator_CottontailRabbits, val_evaluator_EgoHands,
+ val_evaluator_NorthAmericaMushrooms, val_evaluator_Packages,
+ val_evaluator_PascalVOC, val_evaluator_pistols, val_evaluator_pothole,
+ val_evaluator_Raccoon, val_evaluator_ShellfishOpenImages,
+ val_evaluator_thermalDogsAndPeople, val_evaluator_VehiclesOpenImages
+]
+
+# -------------------------------------------------#
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/grounding_dino/odinw/grounding_dino_swin-b_pretrain_odinw35.py b/configs/grounding_dino/odinw/grounding_dino_swin-b_pretrain_odinw35.py
new file mode 100644
index 00000000000..e73cd8e61ba
--- /dev/null
+++ b/configs/grounding_dino/odinw/grounding_dino_swin-b_pretrain_odinw35.py
@@ -0,0 +1,796 @@
+_base_ = '../grounding_dino_swin-b_pretrain_mixeddata.py'
+
+dataset_type = 'CocoDataset'
+data_root = 'data/odinw/'
+
+base_test_pipeline = _base_.test_pipeline
+base_test_pipeline[-1]['meta_keys'] = ('img_id', 'img_path', 'ori_shape',
+ 'img_shape', 'scale_factor', 'text',
+ 'custom_entities', 'caption_prompt')
+
+# ---------------------1 AerialMaritimeDrone_large---------------------#
+class_name = ('boat', 'car', 'dock', 'jetski', 'lift')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AerialMaritimeDrone/large/'
+dataset_AerialMaritimeDrone_large = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_AerialMaritimeDrone_large = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------2 AerialMaritimeDrone_tiled---------------------#
+class_name = ('boat', 'car', 'dock', 'jetski', 'lift')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AerialMaritimeDrone/tiled/'
+dataset_AerialMaritimeDrone_tiled = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_AerialMaritimeDrone_tiled = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------3 AmericanSignLanguageLetters---------------------#
+class_name = ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
+ 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AmericanSignLanguageLetters/American Sign Language Letters.v1-v1.coco/' # noqa
+dataset_AmericanSignLanguageLetters = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_AmericanSignLanguageLetters = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------4 Aquarium---------------------#
+class_name = ('fish', 'jellyfish', 'penguin', 'puffin', 'shark', 'starfish',
+ 'stingray')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Aquarium/Aquarium Combined.v2-raw-1024.coco/'
+dataset_Aquarium = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Aquarium = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------5 BCCD---------------------#
+class_name = ('Platelets', 'RBC', 'WBC')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'BCCD/BCCD.v3-raw.coco/'
+dataset_BCCD = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_BCCD = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------6 boggleBoards---------------------#
+class_name = ('Q', 'a', 'an', 'b', 'c', 'd', 'e', 'er', 'f', 'g', 'h', 'he',
+ 'i', 'in', 'j', 'k', 'l', 'm', 'n', 'o', 'o ', 'p', 'q', 'qu',
+ 'r', 's', 't', 't\\', 'th', 'u', 'v', 'w', 'wild', 'x', 'y', 'z')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'boggleBoards/416x416AutoOrient/export/'
+dataset_boggleBoards = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_boggleBoards = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------7 brackishUnderwater---------------------#
+class_name = ('crab', 'fish', 'jellyfish', 'shrimp', 'small_fish', 'starfish')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'brackishUnderwater/960x540/'
+dataset_brackishUnderwater = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_brackishUnderwater = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------8 ChessPieces---------------------#
+class_name = (' ', 'black bishop', 'black king', 'black knight', 'black pawn',
+ 'black queen', 'black rook', 'white bishop', 'white king',
+ 'white knight', 'white pawn', 'white queen', 'white rook')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ChessPieces/Chess Pieces.v23-raw.coco/'
+dataset_ChessPieces = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/new_annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ChessPieces = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/new_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------9 CottontailRabbits---------------------#
+class_name = ('rabbit', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'CottontailRabbits/'
+dataset_CottontailRabbits = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/new_annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_CottontailRabbits = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/new_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------10 dice---------------------#
+class_name = ('1', '2', '3', '4', '5', '6')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'dice/mediumColor/export/'
+dataset_dice = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_dice = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------11 DroneControl---------------------#
+class_name = ('follow', 'follow_hand', 'land', 'land_hand', 'null', 'object',
+ 'takeoff', 'takeoff-hand')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'DroneControl/Drone Control.v3-raw.coco/'
+dataset_DroneControl = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_DroneControl = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------12 EgoHands_generic---------------------#
+class_name = ('hand', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'EgoHands/generic/'
+caption_prompt = {'hand': {'suffix': ' of a person'}}
+dataset_EgoHands_generic = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ # NOTE w. prompt 0.548; wo. prompt 0.764
+ # caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_EgoHands_generic = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------13 EgoHands_specific---------------------#
+class_name = ('myleft', 'myright', 'yourleft', 'yourright')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'EgoHands/specific/'
+dataset_EgoHands_specific = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_EgoHands_specific = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------14 HardHatWorkers---------------------#
+class_name = ('head', 'helmet', 'person')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'HardHatWorkers/raw/'
+dataset_HardHatWorkers = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_HardHatWorkers = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------15 MaskWearing---------------------#
+class_name = ('mask', 'no-mask')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'MaskWearing/raw/'
+dataset_MaskWearing = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_MaskWearing = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------16 MountainDewCommercial---------------------#
+class_name = ('bottle', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'MountainDewCommercial/'
+dataset_MountainDewCommercial = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_MountainDewCommercial = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------17 NorthAmericaMushrooms---------------------#
+class_name = ('flat mushroom', 'yellow mushroom')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'NorthAmericaMushrooms/North American Mushrooms.v1-416x416.coco/' # noqa
+dataset_NorthAmericaMushrooms = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/new_annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_NorthAmericaMushrooms = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/new_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------18 openPoetryVision---------------------#
+class_name = ('American Typewriter', 'Andale Mono', 'Apple Chancery', 'Arial',
+ 'Avenir', 'Baskerville', 'Big Caslon', 'Bradley Hand',
+ 'Brush Script MT', 'Chalkboard', 'Comic Sans MS', 'Copperplate',
+ 'Courier', 'Didot', 'Futura', 'Geneva', 'Georgia', 'Gill Sans',
+ 'Helvetica', 'Herculanum', 'Impact', 'Kefa', 'Lucida Grande',
+ 'Luminari', 'Marker Felt', 'Menlo', 'Monaco', 'Noteworthy',
+ 'Optima', 'PT Sans', 'PT Serif', 'Palatino', 'Papyrus',
+ 'Phosphate', 'Rockwell', 'SF Pro', 'SignPainter', 'Skia',
+ 'Snell Roundhand', 'Tahoma', 'Times New Roman', 'Trebuchet MS',
+ 'Verdana')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'openPoetryVision/512x512/'
+dataset_openPoetryVision = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_openPoetryVision = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------19 OxfordPets_by_breed---------------------#
+class_name = ('cat-Abyssinian', 'cat-Bengal', 'cat-Birman', 'cat-Bombay',
+ 'cat-British_Shorthair', 'cat-Egyptian_Mau', 'cat-Maine_Coon',
+ 'cat-Persian', 'cat-Ragdoll', 'cat-Russian_Blue', 'cat-Siamese',
+ 'cat-Sphynx', 'dog-american_bulldog',
+ 'dog-american_pit_bull_terrier', 'dog-basset_hound',
+ 'dog-beagle', 'dog-boxer', 'dog-chihuahua',
+ 'dog-english_cocker_spaniel', 'dog-english_setter',
+ 'dog-german_shorthaired', 'dog-great_pyrenees', 'dog-havanese',
+ 'dog-japanese_chin', 'dog-keeshond', 'dog-leonberger',
+ 'dog-miniature_pinscher', 'dog-newfoundland', 'dog-pomeranian',
+ 'dog-pug', 'dog-saint_bernard', 'dog-samoyed',
+ 'dog-scottish_terrier', 'dog-shiba_inu',
+ 'dog-staffordshire_bull_terrier', 'dog-wheaten_terrier',
+ 'dog-yorkshire_terrier')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'OxfordPets/by-breed/' # noqa
+dataset_OxfordPets_by_breed = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_OxfordPets_by_breed = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------20 OxfordPets_by_species---------------------#
+class_name = ('cat', 'dog')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'OxfordPets/by-species/' # noqa
+dataset_OxfordPets_by_species = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_OxfordPets_by_species = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------21 PKLot---------------------#
+class_name = ('space-empty', 'space-occupied')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'PKLot/640/' # noqa
+dataset_PKLot = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_PKLot = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------22 Packages---------------------#
+class_name = ('package', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Packages/Raw/'
+caption_prompt = {
+ 'package': {
+ 'prefix': 'there is a ',
+ 'suffix': ' on the porch'
+ }
+}
+dataset_Packages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt, # NOTE w. prompt 0.728; wo. prompt 0.670
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Packages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------23 PascalVOC---------------------#
+class_name = ('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car',
+ 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse',
+ 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train',
+ 'tvmonitor')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'PascalVOC/'
+dataset_PascalVOC = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_PascalVOC = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------24 pistols---------------------#
+class_name = ('pistol', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pistols/export/'
+dataset_pistols = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pistols = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------25 plantdoc---------------------#
+class_name = ('Apple Scab Leaf', 'Apple leaf', 'Apple rust leaf',
+ 'Bell_pepper leaf', 'Bell_pepper leaf spot', 'Blueberry leaf',
+ 'Cherry leaf', 'Corn Gray leaf spot', 'Corn leaf blight',
+ 'Corn rust leaf', 'Peach leaf', 'Potato leaf',
+ 'Potato leaf early blight', 'Potato leaf late blight',
+ 'Raspberry leaf', 'Soyabean leaf', 'Soybean leaf',
+ 'Squash Powdery mildew leaf', 'Strawberry leaf',
+ 'Tomato Early blight leaf', 'Tomato Septoria leaf spot',
+ 'Tomato leaf', 'Tomato leaf bacterial spot',
+ 'Tomato leaf late blight', 'Tomato leaf mosaic virus',
+ 'Tomato leaf yellow virus', 'Tomato mold leaf',
+ 'Tomato two spotted spider mites leaf', 'grape leaf',
+ 'grape leaf black rot')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'plantdoc/416x416/'
+dataset_plantdoc = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_plantdoc = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------26 pothole---------------------#
+class_name = ('pothole', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pothole/'
+caption_prompt = {
+ 'pothole': {
+ 'name': 'holes',
+ 'prefix': 'there are some ',
+ 'suffix': ' on the road'
+ }
+}
+dataset_pothole = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ # NOTE w. prompt 0.221; wo. prompt 0.478
+ # caption_prompt=caption_prompt,
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pothole = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------27 Raccoon---------------------#
+class_name = ('raccoon', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Raccoon/Raccoon.v2-raw.coco/'
+dataset_Raccoon = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Raccoon = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------28 selfdrivingCar---------------------#
+class_name = ('biker', 'car', 'pedestrian', 'trafficLight',
+ 'trafficLight-Green', 'trafficLight-GreenLeft',
+ 'trafficLight-Red', 'trafficLight-RedLeft',
+ 'trafficLight-Yellow', 'trafficLight-YellowLeft', 'truck')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'selfdrivingCar/fixedLarge/export/'
+dataset_selfdrivingCar = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_selfdrivingCar = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------29 ShellfishOpenImages---------------------#
+class_name = ('Crab', 'Lobster', 'Shrimp')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ShellfishOpenImages/raw/'
+dataset_ShellfishOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ShellfishOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------30 ThermalCheetah---------------------#
+class_name = ('cheetah', 'human')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ThermalCheetah/'
+dataset_ThermalCheetah = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ThermalCheetah = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------31 thermalDogsAndPeople---------------------#
+class_name = ('dog', 'person')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'thermalDogsAndPeople/'
+dataset_thermalDogsAndPeople = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_thermalDogsAndPeople = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------32 UnoCards---------------------#
+class_name = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11',
+ '12', '13', '14')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'UnoCards/raw/'
+dataset_UnoCards = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_UnoCards = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------33 VehiclesOpenImages---------------------#
+class_name = ('Ambulance', 'Bus', 'Car', 'Motorcycle', 'Truck')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'VehiclesOpenImages/416x416/'
+dataset_VehiclesOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_VehiclesOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------34 WildfireSmoke---------------------#
+class_name = ('smoke', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'WildfireSmoke/'
+dataset_WildfireSmoke = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_WildfireSmoke = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------35 websiteScreenshots---------------------#
+class_name = ('button', 'field', 'heading', 'iframe', 'image', 'label', 'link',
+ 'text')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'websiteScreenshots/'
+dataset_websiteScreenshots = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_websiteScreenshots = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# --------------------- Config---------------------#
+
+dataset_prefixes = [
+ 'AerialMaritimeDrone_large',
+ 'AerialMaritimeDrone_tiled',
+ 'AmericanSignLanguageLetters',
+ 'Aquarium',
+ 'BCCD',
+ 'boggleBoards',
+ 'brackishUnderwater',
+ 'ChessPieces',
+ 'CottontailRabbits',
+ 'dice',
+ 'DroneControl',
+ 'EgoHands_generic',
+ 'EgoHands_specific',
+ 'HardHatWorkers',
+ 'MaskWearing',
+ 'MountainDewCommercial',
+ 'NorthAmericaMushrooms',
+ 'openPoetryVision',
+ 'OxfordPets_by_breed',
+ 'OxfordPets_by_species',
+ 'PKLot',
+ 'Packages',
+ 'PascalVOC',
+ 'pistols',
+ 'plantdoc',
+ 'pothole',
+ 'Raccoons',
+ 'selfdrivingCar',
+ 'ShellfishOpenImages',
+ 'ThermalCheetah',
+ 'thermalDogsAndPeople',
+ 'UnoCards',
+ 'VehiclesOpenImages',
+ 'WildfireSmoke',
+ 'websiteScreenshots',
+]
+
+datasets = [
+ dataset_AerialMaritimeDrone_large, dataset_AerialMaritimeDrone_tiled,
+ dataset_AmericanSignLanguageLetters, dataset_Aquarium, dataset_BCCD,
+ dataset_boggleBoards, dataset_brackishUnderwater, dataset_ChessPieces,
+ dataset_CottontailRabbits, dataset_dice, dataset_DroneControl,
+ dataset_EgoHands_generic, dataset_EgoHands_specific,
+ dataset_HardHatWorkers, dataset_MaskWearing, dataset_MountainDewCommercial,
+ dataset_NorthAmericaMushrooms, dataset_openPoetryVision,
+ dataset_OxfordPets_by_breed, dataset_OxfordPets_by_species, dataset_PKLot,
+ dataset_Packages, dataset_PascalVOC, dataset_pistols, dataset_plantdoc,
+ dataset_pothole, dataset_Raccoon, dataset_selfdrivingCar,
+ dataset_ShellfishOpenImages, dataset_ThermalCheetah,
+ dataset_thermalDogsAndPeople, dataset_UnoCards, dataset_VehiclesOpenImages,
+ dataset_WildfireSmoke, dataset_websiteScreenshots
+]
+
+metrics = [
+ val_evaluator_AerialMaritimeDrone_large,
+ val_evaluator_AerialMaritimeDrone_tiled,
+ val_evaluator_AmericanSignLanguageLetters, val_evaluator_Aquarium,
+ val_evaluator_BCCD, val_evaluator_boggleBoards,
+ val_evaluator_brackishUnderwater, val_evaluator_ChessPieces,
+ val_evaluator_CottontailRabbits, val_evaluator_dice,
+ val_evaluator_DroneControl, val_evaluator_EgoHands_generic,
+ val_evaluator_EgoHands_specific, val_evaluator_HardHatWorkers,
+ val_evaluator_MaskWearing, val_evaluator_MountainDewCommercial,
+ val_evaluator_NorthAmericaMushrooms, val_evaluator_openPoetryVision,
+ val_evaluator_OxfordPets_by_breed, val_evaluator_OxfordPets_by_species,
+ val_evaluator_PKLot, val_evaluator_Packages, val_evaluator_PascalVOC,
+ val_evaluator_pistols, val_evaluator_plantdoc, val_evaluator_pothole,
+ val_evaluator_Raccoon, val_evaluator_selfdrivingCar,
+ val_evaluator_ShellfishOpenImages, val_evaluator_ThermalCheetah,
+ val_evaluator_thermalDogsAndPeople, val_evaluator_UnoCards,
+ val_evaluator_VehiclesOpenImages, val_evaluator_WildfireSmoke,
+ val_evaluator_websiteScreenshots
+]
+
+# -------------------------------------------------#
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw13.py b/configs/grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw13.py
new file mode 100644
index 00000000000..216b8059726
--- /dev/null
+++ b/configs/grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw13.py
@@ -0,0 +1,338 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py' # noqa
+
+dataset_type = 'CocoDataset'
+data_root = 'data/odinw/'
+
+base_test_pipeline = _base_.test_pipeline
+base_test_pipeline[-1]['meta_keys'] = ('img_id', 'img_path', 'ori_shape',
+ 'img_shape', 'scale_factor', 'text',
+ 'custom_entities', 'caption_prompt')
+
+# ---------------------1 AerialMaritimeDrone---------------------#
+class_name = ('boat', 'car', 'dock', 'jetski', 'lift')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AerialMaritimeDrone/large/'
+dataset_AerialMaritimeDrone = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ test_mode=True,
+ pipeline=base_test_pipeline,
+ return_classes=True)
+val_evaluator_AerialMaritimeDrone = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------2 Aquarium---------------------#
+class_name = ('fish', 'jellyfish', 'penguin', 'puffin', 'shark', 'starfish',
+ 'stingray')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Aquarium/Aquarium Combined.v2-raw-1024.coco/'
+
+caption_prompt = None
+# caption_prompt = {
+# 'penguin': {
+# 'suffix': ', which is black and white'
+# },
+# 'puffin': {
+# 'suffix': ' with orange beaks'
+# },
+# 'stingray': {
+# 'suffix': ' which is flat and round'
+# },
+# }
+dataset_Aquarium = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Aquarium = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------3 CottontailRabbits---------------------#
+class_name = ('Cottontail-Rabbit', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'CottontailRabbits/'
+
+caption_prompt = None
+# caption_prompt = {'Cottontail-Rabbit': {'name': 'rabbit'}}
+
+dataset_CottontailRabbits = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_CottontailRabbits = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------4 EgoHands---------------------#
+class_name = ('hand', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'EgoHands/generic/'
+
+caption_prompt = None
+# caption_prompt = {'hand': {'suffix': ' of a person'}}
+
+dataset_EgoHands = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_EgoHands = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------5 NorthAmericaMushrooms---------------------#
+class_name = ('CoW', 'chanterelle')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'NorthAmericaMushrooms/North American Mushrooms.v1-416x416.coco/' # noqa
+
+caption_prompt = None
+# caption_prompt = {
+# 'CoW': {
+# 'name': 'flat mushroom'
+# },
+# 'chanterelle': {
+# 'name': 'yellow mushroom'
+# }
+# }
+
+dataset_NorthAmericaMushrooms = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_NorthAmericaMushrooms = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------6 Packages---------------------#
+class_name = ('package', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Packages/Raw/'
+
+caption_prompt = None
+# caption_prompt = {
+# 'package': {
+# 'prefix': 'there is a ',
+# 'suffix': ' on the porch'
+# }
+# }
+
+dataset_Packages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Packages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------7 PascalVOC---------------------#
+class_name = ('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car',
+ 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse',
+ 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train',
+ 'tvmonitor')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'PascalVOC/'
+dataset_PascalVOC = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_PascalVOC = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------8 pistols---------------------#
+class_name = ('pistol', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pistols/export/'
+dataset_pistols = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pistols = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------9 pothole---------------------#
+class_name = ('pothole', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pothole/'
+
+caption_prompt = None
+# caption_prompt = {
+# 'pothole': {
+# 'prefix': 'there are some ',
+# 'name': 'holes',
+# 'suffix': ' on the road'
+# }
+# }
+
+dataset_pothole = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pothole = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------10 Raccoon---------------------#
+class_name = ('raccoon', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Raccoon/Raccoon.v2-raw.coco/'
+dataset_Raccoon = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Raccoon = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------11 ShellfishOpenImages---------------------#
+class_name = ('Crab', 'Lobster', 'Shrimp')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ShellfishOpenImages/raw/'
+dataset_ShellfishOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ShellfishOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------12 thermalDogsAndPeople---------------------#
+class_name = ('dog', 'person')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'thermalDogsAndPeople/'
+dataset_thermalDogsAndPeople = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_thermalDogsAndPeople = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------13 VehiclesOpenImages---------------------#
+class_name = ('Ambulance', 'Bus', 'Car', 'Motorcycle', 'Truck')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'VehiclesOpenImages/416x416/'
+dataset_VehiclesOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_VehiclesOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# --------------------- Config---------------------#
+dataset_prefixes = [
+ 'AerialMaritimeDrone', 'Aquarium', 'CottontailRabbits', 'EgoHands',
+ 'NorthAmericaMushrooms', 'Packages', 'PascalVOC', 'pistols', 'pothole',
+ 'Raccoon', 'ShellfishOpenImages', 'thermalDogsAndPeople',
+ 'VehiclesOpenImages'
+]
+datasets = [
+ dataset_AerialMaritimeDrone, dataset_Aquarium, dataset_CottontailRabbits,
+ dataset_EgoHands, dataset_NorthAmericaMushrooms, dataset_Packages,
+ dataset_PascalVOC, dataset_pistols, dataset_pothole, dataset_Raccoon,
+ dataset_ShellfishOpenImages, dataset_thermalDogsAndPeople,
+ dataset_VehiclesOpenImages
+]
+metrics = [
+ val_evaluator_AerialMaritimeDrone, val_evaluator_Aquarium,
+ val_evaluator_CottontailRabbits, val_evaluator_EgoHands,
+ val_evaluator_NorthAmericaMushrooms, val_evaluator_Packages,
+ val_evaluator_PascalVOC, val_evaluator_pistols, val_evaluator_pothole,
+ val_evaluator_Raccoon, val_evaluator_ShellfishOpenImages,
+ val_evaluator_thermalDogsAndPeople, val_evaluator_VehiclesOpenImages
+]
+
+# -------------------------------------------------#
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw35.py b/configs/grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw35.py
new file mode 100644
index 00000000000..3df0394a204
--- /dev/null
+++ b/configs/grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw35.py
@@ -0,0 +1,796 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py' # noqa
+
+dataset_type = 'CocoDataset'
+data_root = 'data/odinw/'
+
+base_test_pipeline = _base_.test_pipeline
+base_test_pipeline[-1]['meta_keys'] = ('img_id', 'img_path', 'ori_shape',
+ 'img_shape', 'scale_factor', 'text',
+ 'custom_entities', 'caption_prompt')
+
+# ---------------------1 AerialMaritimeDrone_large---------------------#
+class_name = ('boat', 'car', 'dock', 'jetski', 'lift')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AerialMaritimeDrone/large/'
+dataset_AerialMaritimeDrone_large = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_AerialMaritimeDrone_large = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------2 AerialMaritimeDrone_tiled---------------------#
+class_name = ('boat', 'car', 'dock', 'jetski', 'lift')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AerialMaritimeDrone/tiled/'
+dataset_AerialMaritimeDrone_tiled = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_AerialMaritimeDrone_tiled = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------3 AmericanSignLanguageLetters---------------------#
+class_name = ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
+ 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AmericanSignLanguageLetters/American Sign Language Letters.v1-v1.coco/' # noqa
+dataset_AmericanSignLanguageLetters = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_AmericanSignLanguageLetters = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------4 Aquarium---------------------#
+class_name = ('fish', 'jellyfish', 'penguin', 'puffin', 'shark', 'starfish',
+ 'stingray')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Aquarium/Aquarium Combined.v2-raw-1024.coco/'
+dataset_Aquarium = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Aquarium = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------5 BCCD---------------------#
+class_name = ('Platelets', 'RBC', 'WBC')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'BCCD/BCCD.v3-raw.coco/'
+dataset_BCCD = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_BCCD = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------6 boggleBoards---------------------#
+class_name = ('Q', 'a', 'an', 'b', 'c', 'd', 'e', 'er', 'f', 'g', 'h', 'he',
+ 'i', 'in', 'j', 'k', 'l', 'm', 'n', 'o', 'o ', 'p', 'q', 'qu',
+ 'r', 's', 't', 't\\', 'th', 'u', 'v', 'w', 'wild', 'x', 'y', 'z')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'boggleBoards/416x416AutoOrient/export/'
+dataset_boggleBoards = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_boggleBoards = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------7 brackishUnderwater---------------------#
+class_name = ('crab', 'fish', 'jellyfish', 'shrimp', 'small_fish', 'starfish')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'brackishUnderwater/960x540/'
+dataset_brackishUnderwater = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_brackishUnderwater = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------8 ChessPieces---------------------#
+class_name = (' ', 'black bishop', 'black king', 'black knight', 'black pawn',
+ 'black queen', 'black rook', 'white bishop', 'white king',
+ 'white knight', 'white pawn', 'white queen', 'white rook')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ChessPieces/Chess Pieces.v23-raw.coco/'
+dataset_ChessPieces = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/new_annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ChessPieces = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/new_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------9 CottontailRabbits---------------------#
+class_name = ('rabbit', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'CottontailRabbits/'
+dataset_CottontailRabbits = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/new_annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_CottontailRabbits = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/new_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------10 dice---------------------#
+class_name = ('1', '2', '3', '4', '5', '6')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'dice/mediumColor/export/'
+dataset_dice = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_dice = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------11 DroneControl---------------------#
+class_name = ('follow', 'follow_hand', 'land', 'land_hand', 'null', 'object',
+ 'takeoff', 'takeoff-hand')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'DroneControl/Drone Control.v3-raw.coco/'
+dataset_DroneControl = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_DroneControl = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------12 EgoHands_generic---------------------#
+class_name = ('hand', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'EgoHands/generic/'
+caption_prompt = {'hand': {'suffix': ' of a person'}}
+dataset_EgoHands_generic = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ # NOTE w. prompt 0.526, wo. prompt 0.608
+ # caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_EgoHands_generic = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------13 EgoHands_specific---------------------#
+class_name = ('myleft', 'myright', 'yourleft', 'yourright')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'EgoHands/specific/'
+dataset_EgoHands_specific = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_EgoHands_specific = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------14 HardHatWorkers---------------------#
+class_name = ('head', 'helmet', 'person')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'HardHatWorkers/raw/'
+dataset_HardHatWorkers = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_HardHatWorkers = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------15 MaskWearing---------------------#
+class_name = ('mask', 'no-mask')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'MaskWearing/raw/'
+dataset_MaskWearing = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_MaskWearing = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------16 MountainDewCommercial---------------------#
+class_name = ('bottle', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'MountainDewCommercial/'
+dataset_MountainDewCommercial = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_MountainDewCommercial = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------17 NorthAmericaMushrooms---------------------#
+class_name = ('flat mushroom', 'yellow mushroom')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'NorthAmericaMushrooms/North American Mushrooms.v1-416x416.coco/' # noqa
+dataset_NorthAmericaMushrooms = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/new_annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_NorthAmericaMushrooms = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/new_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------18 openPoetryVision---------------------#
+class_name = ('American Typewriter', 'Andale Mono', 'Apple Chancery', 'Arial',
+ 'Avenir', 'Baskerville', 'Big Caslon', 'Bradley Hand',
+ 'Brush Script MT', 'Chalkboard', 'Comic Sans MS', 'Copperplate',
+ 'Courier', 'Didot', 'Futura', 'Geneva', 'Georgia', 'Gill Sans',
+ 'Helvetica', 'Herculanum', 'Impact', 'Kefa', 'Lucida Grande',
+ 'Luminari', 'Marker Felt', 'Menlo', 'Monaco', 'Noteworthy',
+ 'Optima', 'PT Sans', 'PT Serif', 'Palatino', 'Papyrus',
+ 'Phosphate', 'Rockwell', 'SF Pro', 'SignPainter', 'Skia',
+ 'Snell Roundhand', 'Tahoma', 'Times New Roman', 'Trebuchet MS',
+ 'Verdana')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'openPoetryVision/512x512/'
+dataset_openPoetryVision = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_openPoetryVision = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------19 OxfordPets_by_breed---------------------#
+class_name = ('cat-Abyssinian', 'cat-Bengal', 'cat-Birman', 'cat-Bombay',
+ 'cat-British_Shorthair', 'cat-Egyptian_Mau', 'cat-Maine_Coon',
+ 'cat-Persian', 'cat-Ragdoll', 'cat-Russian_Blue', 'cat-Siamese',
+ 'cat-Sphynx', 'dog-american_bulldog',
+ 'dog-american_pit_bull_terrier', 'dog-basset_hound',
+ 'dog-beagle', 'dog-boxer', 'dog-chihuahua',
+ 'dog-english_cocker_spaniel', 'dog-english_setter',
+ 'dog-german_shorthaired', 'dog-great_pyrenees', 'dog-havanese',
+ 'dog-japanese_chin', 'dog-keeshond', 'dog-leonberger',
+ 'dog-miniature_pinscher', 'dog-newfoundland', 'dog-pomeranian',
+ 'dog-pug', 'dog-saint_bernard', 'dog-samoyed',
+ 'dog-scottish_terrier', 'dog-shiba_inu',
+ 'dog-staffordshire_bull_terrier', 'dog-wheaten_terrier',
+ 'dog-yorkshire_terrier')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'OxfordPets/by-breed/' # noqa
+dataset_OxfordPets_by_breed = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_OxfordPets_by_breed = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------20 OxfordPets_by_species---------------------#
+class_name = ('cat', 'dog')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'OxfordPets/by-species/' # noqa
+dataset_OxfordPets_by_species = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_OxfordPets_by_species = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------21 PKLot---------------------#
+class_name = ('space-empty', 'space-occupied')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'PKLot/640/' # noqa
+dataset_PKLot = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_PKLot = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------22 Packages---------------------#
+class_name = ('package', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Packages/Raw/'
+caption_prompt = {
+ 'package': {
+ 'prefix': 'there is a ',
+ 'suffix': ' on the porch'
+ }
+}
+dataset_Packages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt, # NOTE w. prompt 0.695; wo. prompt 0.687
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Packages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------23 PascalVOC---------------------#
+class_name = ('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car',
+ 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse',
+ 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train',
+ 'tvmonitor')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'PascalVOC/'
+dataset_PascalVOC = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_PascalVOC = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------24 pistols---------------------#
+class_name = ('pistol', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pistols/export/'
+dataset_pistols = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pistols = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------25 plantdoc---------------------#
+class_name = ('Apple Scab Leaf', 'Apple leaf', 'Apple rust leaf',
+ 'Bell_pepper leaf', 'Bell_pepper leaf spot', 'Blueberry leaf',
+ 'Cherry leaf', 'Corn Gray leaf spot', 'Corn leaf blight',
+ 'Corn rust leaf', 'Peach leaf', 'Potato leaf',
+ 'Potato leaf early blight', 'Potato leaf late blight',
+ 'Raspberry leaf', 'Soyabean leaf', 'Soybean leaf',
+ 'Squash Powdery mildew leaf', 'Strawberry leaf',
+ 'Tomato Early blight leaf', 'Tomato Septoria leaf spot',
+ 'Tomato leaf', 'Tomato leaf bacterial spot',
+ 'Tomato leaf late blight', 'Tomato leaf mosaic virus',
+ 'Tomato leaf yellow virus', 'Tomato mold leaf',
+ 'Tomato two spotted spider mites leaf', 'grape leaf',
+ 'grape leaf black rot')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'plantdoc/416x416/'
+dataset_plantdoc = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_plantdoc = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------26 pothole---------------------#
+class_name = ('pothole', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pothole/'
+caption_prompt = {
+ 'pothole': {
+ 'name': 'holes',
+ 'prefix': 'there are some ',
+ 'suffix': ' on the road'
+ }
+}
+dataset_pothole = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ # NOTE w. prompt 0.137; wo. prompt 0.215
+ # caption_prompt=caption_prompt,
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pothole = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------27 Raccoon---------------------#
+class_name = ('raccoon', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Raccoon/Raccoon.v2-raw.coco/'
+dataset_Raccoon = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Raccoon = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------28 selfdrivingCar---------------------#
+class_name = ('biker', 'car', 'pedestrian', 'trafficLight',
+ 'trafficLight-Green', 'trafficLight-GreenLeft',
+ 'trafficLight-Red', 'trafficLight-RedLeft',
+ 'trafficLight-Yellow', 'trafficLight-YellowLeft', 'truck')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'selfdrivingCar/fixedLarge/export/'
+dataset_selfdrivingCar = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_selfdrivingCar = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------29 ShellfishOpenImages---------------------#
+class_name = ('Crab', 'Lobster', 'Shrimp')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ShellfishOpenImages/raw/'
+dataset_ShellfishOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ShellfishOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------30 ThermalCheetah---------------------#
+class_name = ('cheetah', 'human')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ThermalCheetah/'
+dataset_ThermalCheetah = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ThermalCheetah = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------31 thermalDogsAndPeople---------------------#
+class_name = ('dog', 'person')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'thermalDogsAndPeople/'
+dataset_thermalDogsAndPeople = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_thermalDogsAndPeople = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------32 UnoCards---------------------#
+class_name = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11',
+ '12', '13', '14')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'UnoCards/raw/'
+dataset_UnoCards = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_UnoCards = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------33 VehiclesOpenImages---------------------#
+class_name = ('Ambulance', 'Bus', 'Car', 'Motorcycle', 'Truck')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'VehiclesOpenImages/416x416/'
+dataset_VehiclesOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_VehiclesOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------34 WildfireSmoke---------------------#
+class_name = ('smoke', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'WildfireSmoke/'
+dataset_WildfireSmoke = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_WildfireSmoke = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------35 websiteScreenshots---------------------#
+class_name = ('button', 'field', 'heading', 'iframe', 'image', 'label', 'link',
+ 'text')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'websiteScreenshots/'
+dataset_websiteScreenshots = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_websiteScreenshots = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# --------------------- Config---------------------#
+
+dataset_prefixes = [
+ 'AerialMaritimeDrone_large',
+ 'AerialMaritimeDrone_tiled',
+ 'AmericanSignLanguageLetters',
+ 'Aquarium',
+ 'BCCD',
+ 'boggleBoards',
+ 'brackishUnderwater',
+ 'ChessPieces',
+ 'CottontailRabbits',
+ 'dice',
+ 'DroneControl',
+ 'EgoHands_generic',
+ 'EgoHands_specific',
+ 'HardHatWorkers',
+ 'MaskWearing',
+ 'MountainDewCommercial',
+ 'NorthAmericaMushrooms',
+ 'openPoetryVision',
+ 'OxfordPets_by_breed',
+ 'OxfordPets_by_species',
+ 'PKLot',
+ 'Packages',
+ 'PascalVOC',
+ 'pistols',
+ 'plantdoc',
+ 'pothole',
+ 'Raccoons',
+ 'selfdrivingCar',
+ 'ShellfishOpenImages',
+ 'ThermalCheetah',
+ 'thermalDogsAndPeople',
+ 'UnoCards',
+ 'VehiclesOpenImages',
+ 'WildfireSmoke',
+ 'websiteScreenshots',
+]
+
+datasets = [
+ dataset_AerialMaritimeDrone_large, dataset_AerialMaritimeDrone_tiled,
+ dataset_AmericanSignLanguageLetters, dataset_Aquarium, dataset_BCCD,
+ dataset_boggleBoards, dataset_brackishUnderwater, dataset_ChessPieces,
+ dataset_CottontailRabbits, dataset_dice, dataset_DroneControl,
+ dataset_EgoHands_generic, dataset_EgoHands_specific,
+ dataset_HardHatWorkers, dataset_MaskWearing, dataset_MountainDewCommercial,
+ dataset_NorthAmericaMushrooms, dataset_openPoetryVision,
+ dataset_OxfordPets_by_breed, dataset_OxfordPets_by_species, dataset_PKLot,
+ dataset_Packages, dataset_PascalVOC, dataset_pistols, dataset_plantdoc,
+ dataset_pothole, dataset_Raccoon, dataset_selfdrivingCar,
+ dataset_ShellfishOpenImages, dataset_ThermalCheetah,
+ dataset_thermalDogsAndPeople, dataset_UnoCards, dataset_VehiclesOpenImages,
+ dataset_WildfireSmoke, dataset_websiteScreenshots
+]
+
+metrics = [
+ val_evaluator_AerialMaritimeDrone_large,
+ val_evaluator_AerialMaritimeDrone_tiled,
+ val_evaluator_AmericanSignLanguageLetters, val_evaluator_Aquarium,
+ val_evaluator_BCCD, val_evaluator_boggleBoards,
+ val_evaluator_brackishUnderwater, val_evaluator_ChessPieces,
+ val_evaluator_CottontailRabbits, val_evaluator_dice,
+ val_evaluator_DroneControl, val_evaluator_EgoHands_generic,
+ val_evaluator_EgoHands_specific, val_evaluator_HardHatWorkers,
+ val_evaluator_MaskWearing, val_evaluator_MountainDewCommercial,
+ val_evaluator_NorthAmericaMushrooms, val_evaluator_openPoetryVision,
+ val_evaluator_OxfordPets_by_breed, val_evaluator_OxfordPets_by_species,
+ val_evaluator_PKLot, val_evaluator_Packages, val_evaluator_PascalVOC,
+ val_evaluator_pistols, val_evaluator_plantdoc, val_evaluator_pothole,
+ val_evaluator_Raccoon, val_evaluator_selfdrivingCar,
+ val_evaluator_ShellfishOpenImages, val_evaluator_ThermalCheetah,
+ val_evaluator_thermalDogsAndPeople, val_evaluator_UnoCards,
+ val_evaluator_VehiclesOpenImages, val_evaluator_WildfireSmoke,
+ val_evaluator_websiteScreenshots
+]
+
+# -------------------------------------------------#
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/grounding_dino/odinw/override_category.py b/configs/grounding_dino/odinw/override_category.py
new file mode 100644
index 00000000000..9ff05fc6e5e
--- /dev/null
+++ b/configs/grounding_dino/odinw/override_category.py
@@ -0,0 +1,109 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+
+import mmengine
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description='Override Category')
+ parser.add_argument('data_root')
+ return parser.parse_args()
+
+
+def main():
+ args = parse_args()
+
+ ChessPieces = [{
+ 'id': 1,
+ 'name': ' ',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 2,
+ 'name': 'black bishop',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 3,
+ 'name': 'black king',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 4,
+ 'name': 'black knight',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 5,
+ 'name': 'black pawn',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 6,
+ 'name': 'black queen',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 7,
+ 'name': 'black rook',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 8,
+ 'name': 'white bishop',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 9,
+ 'name': 'white king',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 10,
+ 'name': 'white knight',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 11,
+ 'name': 'white pawn',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 12,
+ 'name': 'white queen',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 13,
+ 'name': 'white rook',
+ 'supercategory': 'pieces'
+ }]
+
+ _data_root = args.data_root + 'ChessPieces/Chess Pieces.v23-raw.coco/'
+ json_data = mmengine.load(_data_root +
+ 'valid/annotations_without_background.json')
+ json_data['categories'] = ChessPieces
+ mmengine.dump(json_data,
+ _data_root + 'valid/new_annotations_without_background.json')
+
+ CottontailRabbits = [{
+ 'id': 1,
+ 'name': 'rabbit',
+ 'supercategory': 'Cottontail-Rabbit'
+ }]
+
+ _data_root = args.data_root + 'CottontailRabbits/'
+ json_data = mmengine.load(_data_root +
+ 'valid/annotations_without_background.json')
+ json_data['categories'] = CottontailRabbits
+ mmengine.dump(json_data,
+ _data_root + 'valid/new_annotations_without_background.json')
+
+ NorthAmericaMushrooms = [{
+ 'id': 1,
+ 'name': 'flat mushroom',
+ 'supercategory': 'mushroom'
+ }, {
+ 'id': 2,
+ 'name': 'yellow mushroom',
+ 'supercategory': 'mushroom'
+ }]
+
+ _data_root = args.data_root + 'NorthAmericaMushrooms/North American Mushrooms.v1-416x416.coco/' # noqa
+ json_data = mmengine.load(_data_root +
+ 'valid/annotations_without_background.json')
+ json_data['categories'] = NorthAmericaMushrooms
+ mmengine.dump(json_data,
+ _data_root + 'valid/new_annotations_without_background.json')
+
+
+if __name__ == '__main__':
+ main()
diff --git a/configs/grounding_dino/refcoco/grounding_dino_swin-b_pretrain_zeroshot_refexp.py b/configs/grounding_dino/refcoco/grounding_dino_swin-b_pretrain_zeroshot_refexp.py
new file mode 100644
index 00000000000..dea0bad08c0
--- /dev/null
+++ b/configs/grounding_dino/refcoco/grounding_dino_swin-b_pretrain_zeroshot_refexp.py
@@ -0,0 +1,14 @@
+_base_ = './grounding_dino_swin-t_pretrain_zeroshot_refexp.py'
+
+model = dict(
+ type='GroundingDINO',
+ backbone=dict(
+ pretrain_img_size=384,
+ embed_dims=128,
+ depths=[2, 2, 18, 2],
+ num_heads=[4, 8, 16, 32],
+ window_size=12,
+ drop_path_rate=0.3,
+ patch_norm=True),
+ neck=dict(in_channels=[256, 512, 1024]),
+)
diff --git a/configs/grounding_dino/refcoco/grounding_dino_swin-t_pretrain_zeroshot_refexp.py b/configs/grounding_dino/refcoco/grounding_dino_swin-t_pretrain_zeroshot_refexp.py
new file mode 100644
index 00000000000..4b5c46574a3
--- /dev/null
+++ b/configs/grounding_dino/refcoco/grounding_dino_swin-t_pretrain_zeroshot_refexp.py
@@ -0,0 +1,228 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py'
+
+# 30 is an empirical value, just set it to the maximum value
+# without affecting the evaluation result
+model = dict(test_cfg=dict(max_per_img=30))
+
+data_root = 'data/coco/'
+
+test_pipeline = [
+ dict(
+ type='LoadImageFromFile', backend_args=None,
+ imdecode_backend='pillow'),
+ dict(
+ type='FixScaleResize',
+ scale=(800, 1333),
+ keep_ratio=True,
+ backend='pillow'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'text', 'custom_entities',
+ 'tokens_positive'))
+]
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/final_refexp_val.json'
+val_dataset_all_val = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+val_evaluator_all_val = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco_testA.json'
+val_dataset_refcoco_testA = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_testA = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco_testB.json'
+val_dataset_refcoco_testB = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_testB = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco+_testA.json'
+val_dataset_refcoco_plus_testA = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_plus_testA = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco+_testB.json'
+val_dataset_refcoco_plus_testB = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_plus_testB = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcocog_test.json'
+val_dataset_refcocog_test = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcocog_test = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_grefcoco_val.json'
+val_dataset_grefcoco_val = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_grefcoco_val = dict(
+ type='gRefCOCOMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ thresh_score=0.7,
+ thresh_f1=1.0)
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_grefcoco_testA.json'
+val_dataset_grefcoco_testA = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_grefcoco_testA = dict(
+ type='gRefCOCOMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ thresh_score=0.7,
+ thresh_f1=1.0)
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_grefcoco_testB.json'
+val_dataset_grefcoco_testB = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_grefcoco_testB = dict(
+ type='gRefCOCOMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ thresh_score=0.7,
+ thresh_f1=1.0)
+
+# -------------------------------------------------#
+datasets = [
+ val_dataset_all_val, val_dataset_refcoco_testA, val_dataset_refcoco_testB,
+ val_dataset_refcoco_plus_testA, val_dataset_refcoco_plus_testB,
+ val_dataset_refcocog_test, val_dataset_grefcoco_val,
+ val_dataset_grefcoco_testA, val_dataset_grefcoco_testB
+]
+dataset_prefixes = [
+ 'val', 'refcoco_testA', 'refcoco_testB', 'refcoco+_testA',
+ 'refcoco+_testB', 'refcocog_test', 'grefcoco_val', 'grefcoco_testA',
+ 'grefcoco_testB'
+]
+metrics = [
+ val_evaluator_all_val, val_evaluator_refcoco_testA,
+ val_evaluator_refcoco_testB, val_evaluator_refcoco_plus_testA,
+ val_evaluator_refcoco_plus_testB, val_evaluator_refcocog_test,
+ val_evaluator_grefcoco_val, val_evaluator_grefcoco_testA,
+ val_evaluator_grefcoco_testB
+]
+
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/mm_grounding_dino/README.md b/configs/mm_grounding_dino/README.md
new file mode 100644
index 00000000000..bcc913446dc
--- /dev/null
+++ b/configs/mm_grounding_dino/README.md
@@ -0,0 +1,363 @@
+# MM Grounding DINO
+
+> [An Open and Comprehensive Pipeline for Unified Object Grounding and Detection](https://arxiv.org/abs/2401.02361)
+
+
+
+## Abstract
+
+Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community.
+
+
+
+
+
+
+
+
+
+## Dataset Preparation
+
+Please refer to [dataset_prepare.md](dataset_prepare.md) or [中文版数据准备](dataset_prepare_zh-CN.md)
+
+## Usage
+
+Please refer to [usage.md](usage.md) or [中文版用法说明](usage_zh-CN.md)
+
+## Zero-Shot COCO Results and Models
+
+| Model | Backbone | Style | COCO mAP | Pre-Train Data | Config | Download |
+| :--------: | :------: | :-------: | :--------: | :-------------------: | :------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| GDINO-T | Swin-T | Zero-shot | 46.7 | O365 | | |
+| GDINO-T | Swin-T | Zero-shot | 48.1 | O365,GoldG | | |
+| GDINO-T | Swin-T | Zero-shot | 48.4 | O365,GoldG,Cap4M | [config](../grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/grounding_dino/groundingdino_swint_ogc_mmdet-822d7e9d.pth) |
+| MM-GDINO-T | Swin-T | Zero-shot | 48.5(+1.8) | O365 | [config](grounding_dino_swin-t_pretrain_obj365.py) | |
+| MM-GDINO-T | Swin-T | Zero-shot | 50.4(+2.3) | O365,GoldG | [config](grounding_dino_swin-t_pretrain_obj365_goldg.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg/grounding_dino_swin-t_pretrain_obj365_goldg_20231122_132602-4ea751ce.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg/grounding_dino_swin-t_pretrain_obj365_goldg_20231122_132602.log.json) |
+| MM-GDINO-T | Swin-T | Zero-shot | 50.5(+2.1) | O365,GoldG,GRIT | [config](grounding_dino_swin-t_pretrain_obj365_goldg_grit9m.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_20231128_200818-169cc352.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_20231128_200818.log.json) |
+| MM-GDINO-T | Swin-T | Zero-shot | 50.6(+2.2) | O365,GoldG,V3Det | [config](grounding_dino_swin-t_pretrain_obj365_goldg_v3det.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_v3det_20231218_095741-e316e297.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_v3det_20231218_095741.log.json) |
+| MM-GDINO-T | Swin-T | Zero-shot | 50.4(+2.0) | O365,GoldG,GRIT,V3Det | [config](grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047.log.json) |
+
+## Zero-Shot LVIS Results
+
+| Model | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP | Pre-Train Data |
+| :--------: | :---------: | :---------: | :---------: | :---------: | :--------: | :--------: | :--------: | :---------: | :-------------------: |
+| GDINO-T | 18.8 | 24.2 | 34.7 | 28.8 | 10.1 | 15.3 | 29.9 | 20.1 | O365,GoldG,Cap4M |
+| MM-GDINO-T | 28.1 | 30.2 | 42.0 | 35.7(+6.9) | 17.1 | 22.4 | 36.5 | 27.0(+6.9) | O365,GoldG |
+| MM-GDINO-T | 26.6 | 32.4 | 41.8 | 36.5(+7.7) | 17.3 | 22.6 | 36.4 | 27.1(+7.0) | O365,GoldG,GRIT |
+| MM-GDINO-T | 33.0 | 36.0 | 45.9 | 40.5(+11.7) | 21.5 | 25.5 | 40.2 | 30.6(+10.5) | O365,GoldG,V3Det |
+| MM-GDINO-T | 34.2 | 37.4 | 46.2 | 41.4(+12.6) | 23.6 | 27.6 | 40.5 | 31.9(+11.8) | O365,GoldG,GRIT,V3Det |
+
+- The MM-GDINO-T config file is [mini-lvis](lvis/grounding_dino_swin-t_pretrain_zeroshot_mini-lvis.py) and [lvis 1.0](lvis/grounding_dino_swin-t_pretrain_zeroshot_lvis.py)
+
+## Zero-Shot ODinW (Object Detection in the Wild) Results
+
+### Results and models of ODinW13
+
+| Method | GDINO-T (O365,GoldG,Cap4M) | MM-GDINO-T (O365,GoldG) | MM-GDINO-T (O365,GoldG,GRIT) | MM-GDINO-T (O365,GoldG,V3Det) | MM-GDINO-T (O365,GoldG,GRIT,V3Det) |
+| --------------------- | -------------------------------- | ----------------------------- | ---------------------------------- | ----------------------------------- | ---------------------------------------- |
+| AerialMaritimeDrone | 0.173 | 0.133 | 0.155 | 0.177 | 0.151 |
+| Aquarium | 0.195 | 0.252 | 0.261 | 0.266 | 0.283 |
+| CottontailRabbits | 0.799 | 0.771 | 0.810 | 0.778 | 0.786 |
+| EgoHands | 0.608 | 0.499 | 0.537 | 0.506 | 0.519 |
+| NorthAmericaMushrooms | 0.507 | 0.331 | 0.462 | 0.669 | 0.767 |
+| Packages | 0.687 | 0.707 | 0.687 | 0.710 | 0.706 |
+| PascalVOC | 0.563 | 0.565 | 0.580 | 0.556 | 0.566 |
+| pistols | 0.726 | 0.585 | 0.709 | 0.671 | 0.729 |
+| pothole | 0.215 | 0.136 | 0.285 | 0.199 | 0.243 |
+| Raccoon | 0.549 | 0.469 | 0.511 | 0.553 | 0.535 |
+| ShellfishOpenImages | 0.393 | 0.321 | 0.437 | 0.519 | 0.488 |
+| thermalDogsAndPeople | 0.657 | 0.556 | 0.603 | 0.493 | 0.542 |
+| VehiclesOpenImages | 0.613 | 0.566 | 0.603 | 0.614 | 0.615 |
+| Average | **0.514** | **0.453** | **0.511** | **0.516** | **0.533** |
+
+- The MM-GDINO-T config file is [odinw13](odinw/grounding_dino_swin-t_pretrain_odinw13.py)
+
+### Results and models of ODinW35
+
+| Method | GDINO-T (O365,GoldG,Cap4M) | MM-GDINO-T (O365,GoldG) | MM-GDINO-T (O365,GoldG,GRIT) | MM-GDINO-T (O365,GoldG,V3Det) | MM-GDINO-T (O365,GoldG,GRIT,V3Det) |
+| --------------------------- | -------------------------------- | ----------------------------- | ---------------------------------- | ----------------------------------- | ---------------------------------------- |
+| AerialMaritimeDrone_large | 0.173 | 0.133 | 0.155 | 0.177 | 0.151 |
+| AerialMaritimeDrone_tiled | 0.206 | 0.170 | 0.225 | 0.184 | 0.206 |
+| AmericanSignLanguageLetters | 0.002 | 0.016 | 0.020 | 0.011 | 0.007 |
+| Aquarium | 0.195 | 0.252 | 0.261 | 0.266 | 0.283 |
+| BCCD | 0.161 | 0.069 | 0.118 | 0.083 | 0.077 |
+| boggleBoards | 0.000 | 0.002 | 0.001 | 0.001 | 0.002 |
+| brackishUnderwater | 0.021 | 0.033 | 0.021 | 0.025 | 0.025 |
+| ChessPieces | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
+| CottontailRabbits | 0.806 | 0.771 | 0.810 | 0.778 | 0.786 |
+| dice | 0.004 | 0.002 | 0.005 | 0.001 | 0.001 |
+| DroneControl | 0.042 | 0.047 | 0.097 | 0.088 | 0.074 |
+| EgoHands_generic | 0.608 | 0.527 | 0.537 | 0.506 | 0.519 |
+| EgoHands_specific | 0.002 | 0.001 | 0.005 | 0.007 | 0.003 |
+| HardHatWorkers | 0.046 | 0.048 | 0.070 | 0.070 | 0.108 |
+| MaskWearing | 0.004 | 0.009 | 0.004 | 0.011 | 0.009 |
+| MountainDewCommercial | 0.430 | 0.453 | 0.465 | 0.194 | 0.430 |
+| NorthAmericaMushrooms | 0.471 | 0.331 | 0.462 | 0.669 | 0.767 |
+| openPoetryVision | 0.000 | 0.001 | 0.000 | 0.000 | 0.000 |
+| OxfordPets_by_breed | 0.003 | 0.002 | 0.004 | 0.006 | 0.004 |
+| OxfordPets_by_species | 0.011 | 0.019 | 0.016 | 0.020 | 0.015 |
+| PKLot | 0.001 | 0.004 | 0.002 | 0.008 | 0.007 |
+| Packages | 0.695 | 0.707 | 0.687 | 0.710 | 0.706 |
+| PascalVOC | 0.563 | 0.565 | 0.580 | 0.566 | 0.566 |
+| pistols | 0.726 | 0.585 | 0.709 | 0.671 | 0.729 |
+| plantdoc | 0.005 | 0.005 | 0.007 | 0.008 | 0.011 |
+| pothole | 0.215 | 0.136 | 0.219 | 0.077 | 0.168 |
+| Raccoons | 0.549 | 0.469 | 0.511 | 0.553 | 0.535 |
+| selfdrivingCar | 0.089 | 0.091 | 0.076 | 0.094 | 0.083 |
+| ShellfishOpenImages | 0.393 | 0.321 | 0.437 | 0.519 | 0.488 |
+| ThermalCheetah | 0.087 | 0.063 | 0.081 | 0.030 | 0.045 |
+| thermalDogsAndPeople | 0.657 | 0.556 | 0.603 | 0.493 | 0.543 |
+| UnoCards | 0.006 | 0.012 | 0.010 | 0.009 | 0.005 |
+| VehiclesOpenImages | 0.613 | 0.566 | 0.603 | 0.614 | 0.615 |
+| WildfireSmoke | 0.134 | 0.106 | 0.154 | 0.042 | 0.127 |
+| websiteScreenshots | 0.012 | 0.02 | 0.016 | 0.016 | 0.016 |
+| Average | **0.227** | **0.202** | **0.228** | **0.214** | **0.284** |
+
+- The MM-GDINO-T config file is [odinw35](odinw/grounding_dino_swin-t_pretrain_odinw35.py)
+
+## Zero-Shot Referring Expression Comprehension Results
+
+| Method | GDINO-T (O365,GoldG,Cap4M) | MM-GDINO-T (O365,GoldG) | MM-GDINO-T (O365,GoldG,GRIT) | MM-GDINO-T (O365,GoldG,V3Det) | MM-GDINO-T (O365,GoldG,GRIT,V3Det) |
+| ---------------------- | -------------------------------- | ----------------------------- | ---------------------------------- | ----------------------------------- | ---------------------------------------- |
+| RefCOCO val @1,5,10 | 50.8/89.5/94.9 | 53.1/89.9/94.7 | 53.4/90.3/95.5 | 52.1/89.8/95.0 | 53.1/89.7/95.1 |
+| RefCOCO testA @1,5,10 | 57.4/91.3/95.6 | 59.7/91.5/95.9 | 58.8/91.70/96.2 | 58.4/86.8/95.6 | 59.1/91.0/95.5 |
+| RefCOCO testB @1,5,10 | 45.0/86.5/92.9 | 46.4/86.9/92.2 | 46.8/87.7/93.3 | 45.4/86.2/92.6 | 46.8/87.8/93.6 |
+| RefCOCO+ val @1,5,10 | 51.6/86.4/92.6 | 53.1/87.0/92.8 | 53.5/88.0/93.7 | 52.5/86.8/93.2 | 52.7/87.7/93.5 |
+| RefCOCO+ testA @1,5,10 | 57.3/86.7/92.7 | 58.9/87.3/92.9 | 59.0/88.1/93.7 | 58.1/86.7/93.5 | 58.7/87.2/93.1 |
+| RefCOCO+ testB @1,5,10 | 46.4/84.1/90.7 | 47.9/84.3/91.0 | 47.9/85.5/92.7 | 46.9/83.7/91.5 | 48.4/85.8/92.1 |
+| RefCOCOg val @1,5,10 | 60.4/92.1/96.2 | 61.2/92.6/96.1 | 62.7/93.3/97.0 | 61.7/92.9/96.6 | 62.9/93.3/97.2 |
+| RefCOCOg test @1,5,10 | 59.7/92.1/96.3 | 61.1/93.3/96.7 | 62.6/94.9/97.1 | 61.0/93.1/96.8 | 62.9/93.9/97.4 |
+
+| Method | thresh_score | GDINO-T (O365,GoldG,Cap4M) | MM-GDINO-T (O365,GoldG) | MM-GDINO-T (O365,GoldG,GRIT) | MM-GDINO-T (O365,GoldG,V3Det) | MM-GDINO-T (O365,GoldG,GRIT,V3Det) |
+| --------------------------------------- | ------------ | -------------------------------- | ----------------------------- | ---------------------------------- | ----------------------------------- | ---------------------------------------- |
+| gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc | 0.5 | 39.3/70.4 | | | | 39.4/67.5 |
+| gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc | 0.6 | 40.5/83.8 | | | | 40.6/83.1 |
+| gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc | 0.7 | 41.3/91.8 | 39.8/84.7 | 40.7/89.7 | 40.3/88.8 | 41.0/91.3 |
+| gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc | 0.8 | 41.5/96.8 | | | | 41.1/96.4 |
+| gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc | 0.5 | 31.9/70.4 | | | | 33.1/69.5 |
+| gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc | 0.6 | 29.3/82.9 | | | | 29.2/84.3 |
+| gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc | 0.7 | 27.2/90.2 | 26.3/89.0 | 26.0/91.9 | 25.4/91.8 | 26.1/93.0 |
+| gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc | 0.8 | 25.1/96.3 | | | | 23.8/97.2 |
+| gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc | 0.5 | 30.9/72.5 | | | | 33.0/69.6 |
+| gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc | 0.6 | 30.0/86.1 | | | | 31.6/96.7 |
+| gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc | 0.7 | 29.7/93.5 | 31.3/84.8 | 30.6/90.2 | 30.7/89.9 | 30.4/92.3 |
+| gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc | 0.8 | 29.1/97.4 | | | | 29.5/84.2 |
+
+- The MM-GDINO-T config file is [here](refcoco/grounding_dino_swin-t_pretrain_zeroshot_refexp.py)
+
+## Zero-Shot Description Detection Dataset(DOD)
+
+```shell
+pip install ddd-dataset
+```
+
+| Method | mode | GDINO-T (O365,GoldG,Cap4M) | MM-GDINO-T (O365,GoldG) | MM-GDINO-T (O365,GoldG,GRIT) | MM-GDINO-T (O365,GoldG,V3Det) | MM-GDINO-T (O365,GoldG,GRIT,V3Det) |
+| -------------------------------- | -------- | -------------------------------- | ----------------------------- | ---------------------------------- | ----------------------------------- | ---------------------------------------- |
+| FULL/short/middle/long/very long | concat | 17.2/18.0/18.7/14.8/16.3 | 15.6/17.3/16.7/14.3/13.1 | 17.0/17.7/18.0/15.7/15.7 | 16.2/17.4/16.8/14.9/15.4 | 17.5/23.4/18.3/14.7/13.8 |
+| FULL/short/middle/long/very long | parallel | 22.3/28.2/24.8/19.1/13.9 | 21.7/24.7/24.0/20.2/13.7 | 22.5/25.6/25.1/20.5/14.9 | 22.3/25.6/24.5/20.6/14.7 | 22.9/28.1/25.4/20.4/14.4 |
+| PRES/short/middle/long/very long | concat | 17.8/18.3/19.2/15.2/17.3 | 16.4/18.4/17.3/14.5/14.2 | 17.9/19.0/18.3/16.5/17.5 | 16.6/18.8/17.1/15.1/15.0 | 18.0/23.7/18.6/15.4/13.3 |
+| PRES/short/middle/long/very long | parallel | 21.0/27.0/22.8/17.5/12.5 | 21.3/25.5/22.8/19.2/12.9 | 21.5/25.2/23.0/19.0/15.0 | 21.6/25.7/23.0/19.5/14.8 | 21.9/27.4/23.2/19.1/14.2 |
+| ABS/short/middle/long/very long | concat | 15.4/17.1/16.4/13.6/14.9 | 13.4/13.4/14.5/13.5/11.9 | 14.5/13.1/16.7/13.6/13.3 | 14.8/12.5/15.6/14.3/15.8 | 15.9/22.2/17.1/12.5/14.4 |
+| ABS/short/middle/long/very long | parallel | 26.0/32.0/33.0/23.6/15.5 | 22.8/22.2/28.7/22.9/14.7 | 25.6/26.8/33.9/24.5/14.7 | 24.1/24.9/30.7/23.8/14.7 | 26.0/30.3/34.1/23.9/14.6 |
+
+Note:
+
+1. Considering that the evaluation time for Inter-scenario is very long and the performance is low, it is temporarily not supported. The mentioned metrics are for Intra-scenario.
+2. `concat` is the default inference mode for Grounding DINO, where it concatenates multiple sub-sentences with "." to form a single sentence for inference. On the other hand, "parallel" performs inference on each sub-sentence in a for-loop.
+3. The MM-GDINO-T config file is [concat_dod](dod/grounding_dino_swin-t_pretrain_zeroshot_concat_dod.py) and [parallel_dod](dod/grounding_dino_swin-t_pretrain_zeroshot_parallel_dod.py)
+
+## Pretrain Flickr30k Results
+
+| Model | Pre-Train Data | Val R@1 | Val R@5 | Val R@10 | Test R@1 | Test R@5 | Test R@10 |
+| :--------: | :-------------------: | ------- | ------- | -------- | -------- | -------- | --------- |
+| GLIP-T | O365,GoldG | 84.9 | 94.9 | 96.3 | 85.6 | 95.4 | 96.7 |
+| GLIP-T | O365,GoldG,CC3M,SBU | 85.3 | 95.5 | 96.9 | 86.0 | 95.9 | 97.2 |
+| GDINO-T | O365,GoldG,Cap4M | 87.8 | 96.6 | 98.0 | 88.1 | 96.9 | 98.2 |
+| MM-GDINO-T | O365,GoldG | 85.5 | 95.6 | 97.2 | 86.2 | 95.7 | 97.4 |
+| MM-GDINO-T | O365,GoldG,GRIT | 86.7 | 95.8 | 97.6 | 87.0 | 96.2 | 97.7 |
+| MM-GDINO-T | O365,GoldG,V3Det | 85.9 | 95.7 | 97.4 | 86.3 | 95.7 | 97.4 |
+| MM-GDINO-T | O365,GoldG,GRIT,V3Det | 86.7 | 96.0 | 97.6 | 87.2 | 96.2 | 97.7 |
+
+Note:
+
+1. `@1,5,10` refers to precision at the top 1, 5, and 10 positions in a predicted ranked list.
+2. The MM-GDINO-T config file is [here](flickr30k/grounding_dino_swin-t-pretrain_flickr30k.py)
+
+## Validating the generalization of a pre-trained model through fine-tuning
+
+### RTTS
+
+| Architecture | Backbone | Lr schd | box AP |
+| :-----------------: | :------: | ------- | -------- |
+| Faster R-CNN | R-50 | 1x | 48.1 |
+| Cascade R-CNN | R-50 | 1x | 50.8 |
+| ATSS | R-50 | 1x | 48.2 |
+| TOOD | R-50 | 1X | 50.8 |
+| MM-GDINO(zero-shot) | Swin-T | | 49.8 |
+| MM-GDINO | Swin-T | 1x | **69.1** |
+
+- The reference metrics come from https://github.com/BIGWangYuDong/lqit/tree/main/configs/detection/rtts_dataset
+- The MM-GDINO-T config file is [here](rtts/grounding_dino_swin-t_finetune_8xb4_1x_rtts.py)
+
+### RUOD
+
+| Architecture | Backbone | Lr schd | box AP |
+| :-----------------: | :------: | ------- | -------- |
+| Faster R-CNN | R-50 | 1x | 52.4 |
+| Cascade R-CNN | R-50 | 1x | 55.3 |
+| ATSS | R-50 | 1x | 55.7 |
+| TOOD | R-50 | 1X | 57.4 |
+| MM-GDINO(zero-shot) | Swin-T | | 29.8 |
+| MM-GDINO | Swin-T | 1x | **65.5** |
+
+- The reference metrics come from https://github.com/BIGWangYuDong/lqit/tree/main/configs/detection/ruod_dataset
+- The MM-GDINO-T config file is [here](ruod/grounding_dino_swin-t_finetune_8xb4_1x_ruod.py)
+
+### Brain Tumor
+
+| Architecture | Backbone | Lr schd | box AP |
+| :-----------: | :------: | ------- | ------ |
+| Faster R-CNN | R-50 | 50e | 43.5 |
+| Cascade R-CNN | R-50 | 50e | 46.2 |
+| DINO | R-50 | 50e | 46.4 |
+| Cascade-DINO | R-50 | 50e | 48.6 |
+| MM-GDINO | Swin-T | 50e | 47.5 |
+
+- The reference metrics come from https://arxiv.org/abs/2307.11035
+- The MM-GDINO-T config file is [here](brain_tumor/grounding_dino_swin-t_finetune_8xb4_50e_brain_tumor.py)
+
+### Cityscapes
+
+| Architecture | Backbone | Lr schd | box AP |
+| :-----------------: | :------: | ------- | -------- |
+| Faster R-CNN | R-50 | 50e | 30.1 |
+| Cascade R-CNN | R-50 | 50e | 31.8 |
+| DINO | R-50 | 50e | 34.5 |
+| Cascade-DINO | R-50 | 50e | 34.8 |
+| MM-GDINO(zero-shot) | Swin-T | | 34.2 |
+| MM-GDINO | Swin-T | 50e | **51.5** |
+
+- The reference metrics come from https://arxiv.org/abs/2307.11035
+- The MM-GDINO-T config file is [here](cityscapes/grounding_dino_swin-t_finetune_8xb4_50e_cityscapes.py)
+
+### People in Painting
+
+| Architecture | Backbone | Lr schd | box AP |
+| :-----------------: | :------: | ------- | -------- |
+| Faster R-CNN | R-50 | 50e | 17.0 |
+| Cascade R-CNN | R-50 | 50e | 18.0 |
+| DINO | R-50 | 50e | 12.0 |
+| Cascade-DINO | R-50 | 50e | 13.4 |
+| MM-GDINO(zero-shot) | Swin-T | | 23.1 |
+| MM-GDINO | Swin-T | 50e | **38.9** |
+
+- The reference metrics come from https://arxiv.org/abs/2307.11035
+- The MM-GDINO-T config file is [here](people_in_painting/grounding_dino_swin-t_finetune_8xb4_50e_people_in_painting.py)
+
+### COCO
+
+**(1) Closed-set performance**
+
+| Architecture | Backbone | Lr schd | box AP |
+| :-----------------: | :------: | ------- | ------ |
+| Faster R-CNN | R-50 | 1x | 37.4 |
+| Cascade R-CNN | R-50 | 1x | 40.3 |
+| ATSS | R-50 | 1x | 39.4 |
+| TOOD | R-50 | 1X | 42.4 |
+| DINO | R-50 | 1X | 50.1 |
+| GLIP(zero-shot) | Swin-T | | 46.6 |
+| GDINO(zero-shot) | Swin-T | | 48.5 |
+| MM-GDINO(zero-shot) | Swin-T | | 50.4 |
+| GLIP | Swin-T | 1x | 55.4 |
+| GDINO | Swin-T | 1x | 58.1 |
+| MM-GDINO | Swin-T | 1x | 58.2 |
+
+- The MM-GDINO-T config file is [here](coco/grounding_dino_swin-t_finetune_16xb4_1x_coco.py)
+
+**(2) Open-set continuing pretraining performance**
+
+| Architecture | Backbone | Lr schd | box AP |
+| :-----------------: | :------: | :-----: | :----: |
+| GLIP(zero-shot) | Swin-T | | 46.7 |
+| GDINO(zero-shot) | Swin-T | | 48.5 |
+| MM-GDINO(zero-shot) | Swin-T | | 50.4 |
+| MM-GDINO | Swin-T | 1x | 54.7 |
+
+- The MM-GDINO-T config file is [here](coco/grounding_dino_swin-t_finetune_16xb4_1x_sft_coco.py)
+- Due to the small size of the COCO dataset, continuing pretraining solely on COCO can easily lead to overfitting. The results shown above are from the third epoch. I do not recommend you train using this approach.
+
+**(3) Open vocabulary performance**
+
+| Architecture | Backbone | Lr schd | box AP | Base box AP | Novel box AP | box AP@50 | Base box AP@50 | Novel box AP@50 |
+| :-----------------: | :------: | :-----: | :----: | :---------: | :----------: | :-------: | :------------: | :-------------: |
+| MM-GDINO(zero-shot) | Swin-T | | 51.1 | 48.4 | 58.9 | 66.7 | 64.0 | 74.2 |
+| MM-GDINO | Swin-T | 1x | 57.2 | 56.1 | 60.4 | 73.6 | 73.0 | 75.3 |
+
+- The MM-GDINO-T config file is [here](coco/grounding_dino_swin-t_finetune_16xb4_1x_coco_48_17.py)
+
+### LVIS 1.0
+
+**(1) Open-set continuing pretraining performance**
+
+| Architecture | Backbone | Lr schd | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP |
+| :-----------------: | :------: | :-----: | :---------: | :---------: | :---------: | :--------: | :--------: | :--------: | :--------: | :-------: |
+| GLIP(zero-shot) | Swin-T | | 18.1 | 21.2 | 33.1 | 26.7 | 10.8 | 14.7 | 29.0 | 19.6 |
+| GDINO(zero-shot) | Swin-T | | 18.8 | 24.2 | 34.7 | 28.8 | 10.1 | 15.3 | 29.9 | 20.1 |
+| MM-GDINO(zero-shot) | Swin-T | | 34.2 | 37.4 | 46.2 | 41.4 | 23.6 | 27.6 | 40.5 | 31.9 |
+| MM-GDINO | Swin-T | 1x | 50.7 | 58.8 | 60.1 | 58.7 | 45.2 | 50.2 | 56.1 | 51.7 |
+
+- The MM-GDINO-T config file is [here](lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis.py)
+
+**(2) Open vocabulary performance**
+
+| Architecture | Backbone | Lr schd | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP |
+| :-----------------: | :------: | :-----: | :---------: | :---------: | :---------: | :--------: |
+| MM-GDINO(zero-shot) | Swin-T | | 34.2 | 37.4 | 46.2 | 41.4 |
+| MM-GDINO | Swin-T | 1x | 43.2 | 57.4 | 59.3 | 57.1 |
+
+- The MM-GDINO-T config file is [here](lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis_866_337.py)
+
+### RefEXP
+
+#### RefCOCO
+
+| Architecture | Backbone | Lr schd | val @1 | val @5 | val @10 | testA @1 | testA @5 | testA @10 | testB @1 | testB @5 | testB @10 |
+| :-----------------: | :------: | :-----: | :----: | :----: | :-----: | :------: | :------: | :-------: | :------: | :------: | :-------: |
+| GDINO(zero-shot) | Swin-T | | 50.8 | 89.5 | 94.9 | 57.5 | 91.3 | 95.6 | 45.0 | 86.5 | 92.9 |
+| MM-GDINO(zero-shot) | Swin-T | | 53.1 | 89.7 | 95.1 | 59.1 | 91.0 | 95.5 | 46.8 | 87.8 | 93.6 |
+| GDINO | Swin-T | UNK | 89.2 | | | 91.9 | | | 86.0 | | |
+| MM-GDINO | Swin-T | 5e | 89.5 | 98.6 | 99.4 | 91.4 | 99.2 | 99.8 | 86.6 | 97.9 | 99.1 |
+
+- The MM-GDINO-T config file is [here](refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcoco.py)
+
+#### RefCOCO+
+
+| Architecture | Backbone | Lr schd | val @1 | val @5 | val @10 | testA @1 | testA @5 | testA @10 | testB @1 | testB @5 | testB @10 |
+| :-----------------: | :------: | :-----: | :----: | :----: | :-----: | :------: | :------: | :-------: | :------: | :------: | :-------: |
+| GDINO(zero-shot) | Swin-T | | 51.6 | 86.4 | 92.6 | 57.3 | 86.7 | 92.7 | 46.4 | 84.1 | 90.7 |
+| MM-GDINO(zero-shot) | Swin-T | | 52.7 | 87.7 | 93.5 | 58.7 | 87.2 | 93.1 | 48.4 | 85.8 | 92.1 |
+| GDINO | Swin-T | UNK | 81.1 | | | 87.4 | | | 74.7 | | |
+| MM-GDINO | Swin-T | 5e | 82.1 | 97.8 | 99.2 | 87.5 | 99.2 | 99.7 | 74.0 | 96.3 | 96.4 |
+
+- The MM-GDINO-T config file is [here](refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcoco_plus.py)
+
+#### RefCOCOg
+
+| Architecture | Backbone | Lr schd | val @1 | val @5 | val @10 | test @1 | test @5 | test @10 |
+| :-----------------: | :------: | :-----: | :----: | :----: | :-----: | :-----: | :-----: | :------: |
+| GDINO(zero-shot) | Swin-T | | 60.4 | 92.1 | 96.2 | 59.7 | 92.1 | 96.3 |
+| MM-GDINO(zero-shot) | Swin-T | | 62.9 | 93.3 | 97.2 | 62.9 | 93.9 | 97.4 |
+| GDINO | Swin-T | UNK | 84.2 | | | 84.9 | | |
+| MM-GDINO | Swin-T | 5e | 85.5 | 98.4 | 99.4 | 85.8 | 98.6 | 99.4 |
+
+- The MM-GDINO-T config file is [here](refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcocog.py)
+
+#### gRefCOCO
+
+| Architecture | Backbone | Lr schd | val Pr@(F1=1, IoU≥0.5) | val N-acc | testA Pr@(F1=1, IoU≥0.5) | testA N-acc | testB Pr@(F1=1, IoU≥0.5) | testB N-acc |
+| :-----------------: | :------: | :-----: | :--------------------: | :-------: | :----------------------: | :---------: | :----------------------: | :---------: |
+| GDINO(zero-shot) | Swin-T | | 41.3 | 91.8 | 27.2 | 90.2 | 29.7 | 93.5 |
+| MM-GDINO(zero-shot) | Swin-T | | 41.0 | 91.3 | 26.1 | 93.0 | 30.4 | 92.3 |
+| MM-GDINO | Swin-T | 5e | 45.1 | 64.7 | 42.5 | 65.5 | 40.3 | 63.2 |
+
+- The MM-GDINO-T config file is [here](refcoco/grounding_dino_swin-t_finetune_8xb4_5e_grefcoco.py)
diff --git a/configs/mm_grounding_dino/brain_tumor/grounding_dino_swin-t_finetune_8xb4_50e_brain_tumor.py b/configs/mm_grounding_dino/brain_tumor/grounding_dino_swin-t_finetune_8xb4_50e_brain_tumor.py
new file mode 100644
index 00000000000..1172da5b641
--- /dev/null
+++ b/configs/mm_grounding_dino/brain_tumor/grounding_dino_swin-t_finetune_8xb4_50e_brain_tumor.py
@@ -0,0 +1,112 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+# https://universe.roboflow.com/roboflow-100/brain-tumor-m2pbp/dataset/2
+data_root = 'data/brain_tumor_v2/'
+class_name = ('label0', 'label1', 'label2')
+label_name = '_annotations.coco.json'
+
+palette = [(220, 20, 60), (255, 0, 0), (0, 0, 142)]
+
+metainfo = dict(classes=class_name, palette=palette)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities'))
+]
+
+train_dataloader = dict(
+ sampler=dict(_delete_=True, type='DefaultSampler', shuffle=True),
+ batch_sampler=dict(type='AspectRatioBatchSampler'),
+ dataset=dict(
+ _delete_=True,
+ type='RepeatDataset',
+ times=10,
+ dataset=dict(
+ type='CocoDataset',
+ data_root=data_root,
+ metainfo=metainfo,
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ pipeline=train_pipeline,
+ return_classes=True,
+ data_prefix=dict(img='train/'),
+ ann_file='train/' + label_name)))
+
+val_dataloader = dict(
+ dataset=dict(
+ metainfo=metainfo,
+ data_root=data_root,
+ return_classes=True,
+ ann_file='valid/' + label_name,
+ data_prefix=dict(img='valid/')))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ type='CocoMetric',
+ ann_file=data_root + 'valid/' + label_name,
+ metric='bbox',
+ format_only=False)
+test_evaluator = val_evaluator
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.0001, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1)
+ }))
+
+# learning policy
+max_epochs = 5
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[4],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=1)
+
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=1, save_best='auto'))
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/cityscapes/grounding_dino_swin-t_finetune_8xb4_50e_cityscapes.py b/configs/mm_grounding_dino/cityscapes/grounding_dino_swin-t_finetune_8xb4_50e_cityscapes.py
new file mode 100644
index 00000000000..c4283413c4b
--- /dev/null
+++ b/configs/mm_grounding_dino/cityscapes/grounding_dino_swin-t_finetune_8xb4_50e_cityscapes.py
@@ -0,0 +1,110 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/cityscapes/'
+class_name = ('person', 'rider', 'car', 'truck', 'bus', 'train', 'motorcycle',
+ 'bicycle')
+palette = [(220, 20, 60), (255, 0, 0), (0, 0, 142), (0, 0, 70), (0, 60, 100),
+ (0, 80, 100), (0, 0, 230), (119, 11, 32)]
+
+metainfo = dict(classes=class_name, palette=palette)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities'))
+]
+
+train_dataloader = dict(
+ sampler=dict(_delete_=True, type='DefaultSampler', shuffle=True),
+ batch_sampler=dict(type='AspectRatioBatchSampler'),
+ dataset=dict(
+ _delete_=True,
+ type='RepeatDataset',
+ times=10,
+ dataset=dict(
+ type='CocoDataset',
+ data_root=data_root,
+ metainfo=metainfo,
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ pipeline=train_pipeline,
+ return_classes=True,
+ data_prefix=dict(img='leftImg8bit/train/'),
+ ann_file='annotations/instancesonly_filtered_gtFine_train.json')))
+
+val_dataloader = dict(
+ dataset=dict(
+ metainfo=metainfo,
+ data_root=data_root,
+ return_classes=True,
+ ann_file='annotations/instancesonly_filtered_gtFine_val.json',
+ data_prefix=dict(img='leftImg8bit/val/')))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ type='CocoMetric',
+ ann_file=data_root + 'annotations/instancesonly_filtered_gtFine_val.json',
+ metric='bbox',
+ format_only=False)
+test_evaluator = val_evaluator
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.0001, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1)
+ }))
+
+# learning policy
+max_epochs = 5
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[4],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=1)
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=1, save_best='auto'))
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/coco/grounding_dino_swin-t_finetune_16xb4_1x_coco.py b/configs/mm_grounding_dino/coco/grounding_dino_swin-t_finetune_16xb4_1x_coco.py
new file mode 100644
index 00000000000..792297accd3
--- /dev/null
+++ b/configs/mm_grounding_dino/coco/grounding_dino_swin-t_finetune_16xb4_1x_coco.py
@@ -0,0 +1,85 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/coco/'
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities'))
+]
+
+train_dataloader = dict(
+ dataset=dict(
+ _delete_=True,
+ type='CocoDataset',
+ data_root=data_root,
+ ann_file='annotations/instances_train2017.json',
+ data_prefix=dict(img='train2017/'),
+ return_classes=True,
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ pipeline=train_pipeline))
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1),
+ 'language_model': dict(lr_mult=0.1),
+ }))
+
+# learning policy
+max_epochs = 12
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[8, 11],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=1)
+
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=1, save_best='auto'))
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/coco/grounding_dino_swin-t_finetune_16xb4_1x_coco_48_17.py b/configs/mm_grounding_dino/coco/grounding_dino_swin-t_finetune_16xb4_1x_coco_48_17.py
new file mode 100644
index 00000000000..e68afbb4328
--- /dev/null
+++ b/configs/mm_grounding_dino/coco/grounding_dino_swin-t_finetune_16xb4_1x_coco_48_17.py
@@ -0,0 +1,157 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/coco/'
+base_classes = ('person', 'bicycle', 'car', 'motorcycle', 'train', 'truck',
+ 'boat', 'bench', 'bird', 'horse', 'sheep', 'bear', 'zebra',
+ 'giraffe', 'backpack', 'handbag', 'suitcase', 'frisbee',
+ 'skis', 'kite', 'surfboard', 'bottle', 'fork', 'spoon', 'bowl',
+ 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
+ 'pizza', 'donut', 'chair', 'bed', 'toilet', 'tv', 'laptop',
+ 'mouse', 'remote', 'microwave', 'oven', 'toaster',
+ 'refrigerator', 'book', 'clock', 'vase', 'toothbrush') # 48
+novel_classes = ('airplane', 'bus', 'cat', 'dog', 'cow', 'elephant',
+ 'umbrella', 'tie', 'snowboard', 'skateboard', 'cup', 'knife',
+ 'cake', 'couch', 'keyboard', 'sink', 'scissors') # 17
+all_classes = (
+ 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train',
+ 'truck', 'boat', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
+ 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
+ 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'kite', 'skateboard',
+ 'surfboard', 'bottle', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana',
+ 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'pizza', 'donut',
+ 'cake', 'chair', 'couch', 'bed', 'toilet', 'tv', 'laptop', 'mouse',
+ 'remote', 'keyboard', 'microwave', 'oven', 'toaster', 'sink',
+ 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'toothbrush') # 65
+
+train_metainfo = dict(classes=base_classes)
+test_metainfo = dict(
+ classes=all_classes,
+ base_classes=base_classes,
+ novel_classes=novel_classes)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities'))
+]
+
+test_pipeline = [
+ dict(
+ type='LoadImageFromFile', backend_args=None,
+ imdecode_backend='pillow'),
+ dict(
+ type='FixScaleResize',
+ scale=(800, 1333),
+ keep_ratio=True,
+ backend='pillow'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'text', 'custom_entities',
+ 'tokens_positive'))
+]
+
+train_dataloader = dict(
+ dataset=dict(
+ _delete_=True,
+ type='CocoDataset',
+ metainfo=train_metainfo,
+ data_root=data_root,
+ ann_file='annotations/instances_train2017_seen_2.json',
+ data_prefix=dict(img='train2017/'),
+ return_classes=True,
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ pipeline=train_pipeline))
+
+val_dataloader = dict(
+ batch_size=1,
+ num_workers=2,
+ persistent_workers=True,
+ drop_last=False,
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ dataset=dict(
+ type='CocoDataset',
+ metainfo=test_metainfo,
+ data_root=data_root,
+ ann_file='annotations/instances_val2017_all_2.json',
+ data_prefix=dict(img='val2017/'),
+ test_mode=True,
+ pipeline=test_pipeline,
+ return_classes=True,
+ ))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ type='OVCocoMetric',
+ ann_file=data_root + 'annotations/instances_val2017_all_2.json',
+ metric='bbox',
+ format_only=False)
+test_evaluator = val_evaluator
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.00005, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1),
+ # 'language_model': dict(lr_mult=0),
+ }))
+
+# learning policy
+max_epochs = 12
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[8, 11],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=1)
+
+default_hooks = dict(
+ checkpoint=dict(
+ max_keep_ckpts=1, save_best='coco/novel_ap50', rule='greater'))
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/coco/grounding_dino_swin-t_finetune_16xb4_1x_sft_coco.py b/configs/mm_grounding_dino/coco/grounding_dino_swin-t_finetune_16xb4_1x_sft_coco.py
new file mode 100644
index 00000000000..5505df58b8b
--- /dev/null
+++ b/configs/mm_grounding_dino/coco/grounding_dino_swin-t_finetune_16xb4_1x_sft_coco.py
@@ -0,0 +1,93 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/coco/'
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=20, # ======= important =====
+ label_map_file='data/coco/annotations/coco2017_label_map.json',
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+
+train_dataloader = dict(
+ dataset=dict(
+ _delete_=True,
+ type='ODVGDataset',
+ need_text=False,
+ data_root=data_root,
+ ann_file='annotations/instances_train2017_od.json',
+ label_map_file='annotations/coco2017_label_map.json',
+ data_prefix=dict(img='train2017/'),
+ return_classes=True,
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ pipeline=train_pipeline))
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.00005, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1),
+ 'language_model': dict(lr_mult=0.0),
+ }))
+
+# learning policy
+max_epochs = 12
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[8, 11],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=1)
+
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=1, save_best='auto'))
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/dataset_prepare.md b/configs/mm_grounding_dino/dataset_prepare.md
new file mode 100644
index 00000000000..af60a8bf4bf
--- /dev/null
+++ b/configs/mm_grounding_dino/dataset_prepare.md
@@ -0,0 +1,1193 @@
+# Data Prepare and Process
+
+## MM-GDINO-T Pre-train Dataset
+
+For the MM-GDINO-T model, we provide a total of 5 different data combination pre-training configurations. The data is trained in a progressive accumulation manner, so users can prepare it according to their actual needs.
+
+### 1 Objects365v1
+
+The corresponding training config is [grounding_dino_swin-t_pretrain_obj365](./grounding_dino_swin-t_pretrain_obj365.py)
+
+Objects365v1 can be downloaded from [opendatalab](https://opendatalab.com/OpenDataLab/Objects365_v1). It offers two methods of download: CLI and SDK.
+
+After downloading and unzipping, place the dataset or create a symbolic link to the `data/objects365v1` directory. The directory structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── objects365v1
+│ │ ├── objects365_train.json
+│ │ ├── objects365_val.json
+│ │ ├── train
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── test
+```
+
+Then, use [coco2odvg.py](../../tools/dataset_converters/coco2odvg.py) to convert it into the ODVG format required for training.
+
+```shell
+python tools/dataset_converters/coco2odvg.py data/objects365v1/objects365_train.json -d o365v1
+```
+
+After the program runs successfully, it will create two new files, `o365v1_train_od.json` and `o365v1_label_map.json`, in the `data/objects365v1` directory. The complete structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── objects365v1
+│ │ ├── objects365_train.json
+│ │ ├── objects365_val.json
+│ │ ├── o365v1_train_od.json
+│ │ ├── o365v1_label_map.json
+│ │ ├── train
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── test
+```
+
+### 2 COCO 2017
+
+The above configuration will evaluate the performance on the COCO 2017 dataset during the training process. Therefore, it is necessary to prepare the COCO 2017 dataset. You can download it from the [COCO](https://cocodataset.org/) official website or from [opendatalab](https://opendatalab.com/OpenDataLab/COCO_2017).
+
+After downloading and unzipping, place the dataset or create a symbolic link to the `data/coco` directory. The directory structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+### 3 GoldG
+
+After downloading the dataset, you can start training with the [grounding_dino_swin-t_pretrain_obj365_goldg](./grounding_dino_swin-t_pretrain_obj365_goldg.py) configuration.
+
+The GoldG dataset includes the `GQA` and `Flickr30k` datasets, which are part of the MixedGrounding dataset mentioned in the GLIP paper, excluding the COCO dataset. The download links are [mdetr_annotations](https://huggingface.co/GLIPModel/GLIP/tree/main/mdetr_annotations), and the specific files currently needed are `mdetr_annotations/final_mixed_train_no_coco.json` and `mdetr_annotations/final_flickr_separateGT_train.json`.
+
+Then download the [GQA images](https://nlp.stanford.edu/data/gqa/images.zip). After downloading and unzipping, place the dataset or create a symbolic link to them in the `data/gqa` directory, with the following directory structure:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── gqa
+| | ├── final_mixed_train_no_coco.json
+│ │ ├── images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+Then download the [Flickr30k images](http://shannon.cs.illinois.edu/DenotationGraph/). You need to apply for access to this dataset and then download it using the provided link. After downloading and unzipping, place the dataset or create a symbolic link to them in the `data/flickr30k_entities` directory, with the following directory structure:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── flickr30k_entities
+│ │ ├── final_flickr_separateGT_train.json
+│ │ ├── flickr30k_images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+For the GQA dataset, you need to use [goldg2odvg.py](../../tools/dataset_converters/goldg2odvg.py) to convert it into the ODVG format required for training:
+
+```shell
+python tools/dataset_converters/goldg2odvg.py data/gqa/final_mixed_train_no_coco.json
+```
+
+After the program has run, a new file `final_mixed_train_no_coco_vg.json` will be created in the `data/gqa` directory, with the complete structure as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── gqa
+| | ├── final_mixed_train_no_coco.json
+| | ├── final_mixed_train_no_coco_vg.json
+│ │ ├── images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+For the Flickr30k dataset, you need to use [goldg2odvg.py](../../tools/dataset_converters/goldg2odvg.py) to convert it into the ODVG format required for training:
+
+```shell
+python tools/dataset_converters/goldg2odvg.py data/flickr30k_entities/final_flickr_separateGT_train.json
+```
+
+After the program has run, a new file `final_flickr_separateGT_train_vg.json` will be created in the `data/flickr30k_entities` directory, with the complete structure as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── flickr30k_entities
+│ │ ├── final_flickr_separateGT_train.json
+│ │ ├── final_flickr_separateGT_train_vg.json
+│ │ ├── flickr30k_images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 4 GRIT-20M
+
+The corresponding training configuration is [grounding_dino_swin-t_pretrain_obj365_goldg_grit9m](./grounding_dino_swin-t_pretrain_obj365_goldg_grit9m.py).
+
+The GRIT dataset can be downloaded using the img2dataset package from [GRIT](https://huggingface.co/datasets/zzliang/GRIT#download-image). By default, the dataset size is 1.1T, and downloading and processing it may require at least 2T of disk space, depending on your available storage capacity. After downloading, the dataset is in its original format, which includes:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── grit_raw
+│ │ ├── 00000_stats.json
+│ │ ├── 00000.parquet
+│ │ ├── 00000.tar
+│ │ ├── 00001_stats.json
+│ │ ├── 00001.parquet
+│ │ ├── 00001.tar
+│ │ ├── ...
+```
+
+After downloading, further format processing is required:
+
+```shell
+python tools/dataset_converters/grit_processing.py data/grit_raw data/grit_processed
+```
+
+The processed format is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── grit_processed
+│ │ ├── annotations
+│ │ │ ├── 00000.json
+│ │ │ ├── 00001.json
+│ │ │ ├── ...
+│ │ ├── images
+│ │ │ ├── 00000
+│ │ │ │ ├── 000000000.jpg
+│ │ │ │ ├── 000000003.jpg
+│ │ │ │ ├── 000000004.jpg
+│ │ │ │ ├── ...
+│ │ │ ├── 00001
+│ │ │ ├── ...
+```
+
+As for the GRIT dataset, you need to use [grit2odvg.py](../../tools/dataset_converters/grit2odvg.py) to convert it to the format of ODVG:
+
+```shell
+python tools/dataset_converters/grit2odvg.py data/grit_processed/
+```
+
+After the program has run, a new file `grit20m_vg.json` will be created in the `data/grit_processed` directory, which has about 9M data, with the complete structure as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── grit_processed
+| | ├── grit20m_vg.json
+│ │ ├── annotations
+│ │ │ ├── 00000.json
+│ │ │ ├── 00001.json
+│ │ │ ├── ...
+│ │ ├── images
+│ │ │ ├── 00000
+│ │ │ │ ├── 000000000.jpg
+│ │ │ │ ├── 000000003.jpg
+│ │ │ │ ├── 000000004.jpg
+│ │ │ │ ├── ...
+│ │ │ ├── 00001
+│ │ │ ├── ...
+```
+
+### 5 V3Det
+
+The corresponding training configurations are:
+
+- [grounding_dino_swin-t_pretrain_obj365_goldg_v3det](./grounding_dino_swin-t_pretrain_obj365_goldg_v3det.py)
+- [grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det](./grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det.py)
+
+The V3Det dataset can be downloaded from [opendatalab](https://opendatalab.com/V3Det/V3Det). After downloading and unzipping, place the dataset or create a symbolic link to it in the `data/v3det` directory, with the following directory structure:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── v3det
+│ │ ├── annotations
+│ │ | ├── v3det_2023_v1_train.json
+│ │ ├── images
+│ │ │ ├── a00000066
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+Then use [coco2odvg.py](../../tools/dataset_converters/coco2odvg.py) to convert it into the ODVG format required for training:
+
+```shell
+python tools/dataset_converters/coco2odvg.py data/v3det/annotations/v3det_2023_v1_train.json -d v3det
+```
+
+After the program has run, two new files `v3det_2023_v1_train_od.json` and `v3det_2023_v1_label_map.json` will be created in the `data/v3det/annotations` directory, with the complete structure as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── v3det
+│ │ ├── annotations
+│ │ | ├── v3det_2023_v1_train.json
+│ │ | ├── v3det_2023_v1_train_od.json
+│ │ | ├── v3det_2023_v1_label_map.json
+│ │ ├── images
+│ │ │ ├── a00000066
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 6 Data Splitting and Visualization
+
+Considering that users need to prepare many datasets, which is inconvenient for confirming images and annotations before training, we provide a data splitting and visualization tool. This tool can split the dataset into a tiny version and then use a visualization script to check the correctness of the images and labels.
+
+1. Splitting the Dataset
+
+The script is located [here](../../tools/misc/split_odvg.py). Taking `Object365 v1` as an example, the command to split the dataset is as follows:
+
+```shell
+python tools/misc/split_odvg.py data/object365_v1/ o365v1_train_od.json train your_output_dir --label-map-file o365v1_label_map.json -n 200
+```
+
+After running the above script, it will create a folder structure in the `your_output_dir` directory identical to `data/object365_v1/`, but it will only save 200 training images and their corresponding json files for convenient user review.
+
+2. Visualizing the Original Dataset
+
+The script is located [here](../../tools/analysis_tools/browse_grounding_raw.py). Taking `Object365 v1` as an example, the command to visualize the dataset is as follows:
+
+```shell
+python tools/analysis_tools/browse_grounding_raw.py data/object365_v1/ o365v1_train_od.json train --label-map-file o365v1_label_map.json -o your_output_dir --not-show
+```
+
+After running the above script, it will generate images in the `your_output_dir` directory that include both the pictures and their labels, making it convenient for users to review.
+
+3. Visualizing the Output Dataset
+
+The script is located [here](../../tools/analysis_tools/browse_grounding_dataset.py). Users can use this script to view the results of the dataset output, including the results of data augmentation. Taking `Object365 v1` as an example, the command to visualize the dataset is as follows:
+
+```shell
+python tools/analysis_tools/browse_grounding_dataset.py configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py -o your_output_dir --not-show
+```
+
+After running the above script, it will generate images in the `your_output_dir` directory that include both the pictures and their labels, making it convenient for users to review.
+
+## MM-GDINO-L Pre-training Data Preparation and Processing
+
+### 1 Object365 v2
+
+Objects365_v2 can be downloaded from [opendatalab](https://opendatalab.com/OpenDataLab/Objects365). It offers two download methods: CLI and SDK.
+
+After downloading and unzipping, place the dataset or create a symbolic link to it in the `data/objects365v2` directory, with the following directory structure:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── objects365v2
+│ │ ├── annotations
+│ │ │ ├── zhiyuan_objv2_train.json
+│ │ ├── train
+│ │ │ ├── patch0
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+Since some category names in Objects365v2 are incorrect, it is necessary to correct them first.
+
+```shell
+python tools/dataset_converters/fix_o365_names.py
+```
+
+A new annotation file `zhiyuan_objv2_train_fixname.json` will be generated in the `data/objects365v2/annotations` directory.
+
+Then use [coco2odvg.py](../../tools/dataset_converters/coco2odvg.py) to convert it into the ODVG format required for training:
+
+```shell
+python tools/dataset_converters/coco2odvg.py data/objects365v2/annotations/zhiyuan_objv2_train_fixname.json -d o365v2
+```
+
+After the program has run, two new files `zhiyuan_objv2_train_fixname_od.json` and `o365v2_label_map.json` will be created in the `data/objects365v2` directory, with the complete structure as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── objects365v2
+│ │ ├── annotations
+│ │ │ ├── zhiyuan_objv2_train.json
+│ │ │ ├── zhiyuan_objv2_train_fixname.json
+│ │ │ ├── zhiyuan_objv2_train_fixname_od.json
+│ │ │ ├── o365v2_label_map.json
+│ │ ├── train
+│ │ │ ├── patch0
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 2 OpenImages v6
+
+OpenImages v6 can be downloaded from the [official website](https://storage.googleapis.com/openimages/web/download_v6.html). Due to the large size of the dataset, it may take some time to download. After completion, the file structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── OpenImages
+│ │ ├── annotations
+| │ │ ├── oidv6-train-annotations-bbox.csv
+| │ │ ├── class-descriptions-boxable.csv
+│ │ ├── OpenImages
+│ │ │ ├── train
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+Then use [openimages2odvg.py](../../tools/dataset_converters/openimages2odvg.py) to convert it into the ODVG format required for training:
+
+```shell
+python tools/dataset_converters/openimages2odvg.py data/OpenImages/annotations
+```
+
+After the program has run, two new files `oidv6-train-annotation_od.json` and `openimages_label_map.json` will be created in the `data/OpenImages/annotations` directory, with the complete structure as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── OpenImages
+│ │ ├── annotations
+| │ │ ├── oidv6-train-annotations-bbox.csv
+| │ │ ├── class-descriptions-boxable.csv
+| │ │ ├── oidv6-train-annotations_od.json
+| │ │ ├── openimages_label_map.json
+│ │ ├── OpenImages
+│ │ │ ├── train
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 3 V3Det
+
+Referring to the data preparation section of the previously mentioned MM-GDINO-T pre-training data preparation and processing, the complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── v3det
+│ │ ├── annotations
+│ │ | ├── v3det_2023_v1_train.json
+│ │ | ├── v3det_2023_v1_train_od.json
+│ │ | ├── v3det_2023_v1_label_map.json
+│ │ ├── images
+│ │ │ ├── a00000066
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 4 LVIS 1.0
+
+Please refer to the `2 LVIS 1.0` section of the later `Fine-tuning Dataset Preparation`. The complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── lvis_v1_train.json
+│ │ │ ├── lvis_v1_val.json
+│ │ │ ├── lvis_v1_train_od.json
+│ │ │ ├── lvis_v1_label_map.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── lvis_v1_minival_inserted_image_name.json
+│ │ │ ├── lvis_od_val.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+### 5 COCO2017 OD
+
+You can refer to the earlier section `MM-GDINO-T Pre-training Data Preparation and Processing` for data preparation. For convenience in subsequent processing, please create a symbolic link or move the downloaded [mdetr_annotations](https://huggingface.co/GLIPModel/GLIP/tree/main/mdetr_annotations) folder to the `data/coco` path. The complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── ...
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+Due to some overlap between COCO2017 train and RefCOCO/RefCOCO+/RefCOCOg/gRefCOCO val, if not removed in advance, there will be data leakage when evaluating RefExp.
+
+```shell
+python tools/dataset_converters/remove_cocotrain2017_from_refcoco.py data/coco/mdetr_annotations data/coco/annotations/instances_train2017.json
+```
+
+A new file `instances_train2017_norefval.json` will be created in the `data/coco/annotations` directory. Finally, use [coco2odvg.py](../../tools/dataset_converters/coco2odvg.py) to convert it into the ODVG format required for training:
+
+```shell
+python tools/dataset_converters/coco2odvg.py data/coco/annotations/instances_train2017_norefval.json -d coco
+```
+
+Two new files `instances_train2017_norefval_od.json` and `coco_label_map.json` will be created in the `data/coco/annotations` directory, with the complete structure as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2017_norefval_od.json
+│ │ │ ├── coco_label_map.json
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── ...
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+Note: There are 15,000 images that overlap between the COCO2017 train and LVIS 1.0 val datasets. Therefore, if the COCO2017 train dataset is used in training, the evaluation results of LVIS 1.0 val will have a data leakage issue. However, LVIS 1.0 minival does not have this problem.
+
+### 6 GoldG
+
+Please refer to the section on `MM-GDINO-T Pre-training Data Preparation and Processing`.
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── flickr30k_entities
+│ │ ├── final_flickr_separateGT_train.json
+│ │ ├── final_flickr_separateGT_train_vg.json
+│ │ ├── flickr30k_images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ ├── gqa
+| | ├── final_mixed_train_no_coco.json
+| | ├── final_mixed_train_no_coco_vg.json
+│ │ ├── images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 7 COCO2014 VG
+
+MDetr provides a Phrase Grounding version of the COCO2014 train annotations. The original annotation file is named `final_mixed_train.json`, and similar to the previous structure, the file structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_mixed_train.json
+│ │ │ ├── ...
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+We can extract the COCO portion of the data from `final_mixed_train.json`.
+
+```shell
+python tools/dataset_converters/extract_coco_from_mixed.py data/coco/mdetr_annotations/final_mixed_train.json
+```
+
+A new file named `final_mixed_train_only_coco.json` will be created in the `data/coco/mdetr_annotations` directory. Finally, use [goldg2odvg.py](../../tools/dataset_converters/goldg2odvg.py) to convert it into the ODVG format required for training:
+
+```shell
+python tools/dataset_converters/goldg2odvg.py data/coco/mdetr_annotations/final_mixed_train_only_coco.json
+```
+
+A new file named `final_mixed_train_only_coco_vg.json` will be created in the `data/coco/mdetr_annotations` directory, with the complete structure as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_mixed_train.json
+│ │ │ ├── final_mixed_train_only_coco.json
+│ │ │ ├── final_mixed_train_only_coco_vg.json
+│ │ │ ├── ...
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+Note: COCO2014 train and COCO2017 val do not have duplicate images, so there is no need to worry about data leakage issues in COCO evaluation.
+
+### 8 Referring Expression Comprehension
+
+There are a total of 4 datasets included. For data preparation, please refer to the `Fine-tuning Dataset Preparation` section.
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2014.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── finetune_refcoco_testB.json
+│ │ │ ├── finetune_refcoco+_testA.json
+│ │ │ ├── finetune_refcoco+_testB.json
+│ │ │ ├── finetune_refcocog_test.json
+│ │ │ ├── finetune_refcoco_train_vg.json
+│ │ │ ├── finetune_refcoco+_train_vg.json
+│ │ │ ├── finetune_refcocog_train_vg.json
+│ │ │ ├── finetune_grefcoco_train_vg.json
+```
+
+### 9 GRIT-20M
+
+Please refer to the `MM-GDINO-T Pre-training Data Preparation and Processing` section.
+
+## Preparation of Evaluation Dataset
+
+### 1 COCO 2017
+
+The data preparation process is consistent with the previous descriptions, and the final structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+### 2 LVIS 1.0
+
+The LVIS 1.0 val dataset includes both mini and full versions. The significance of the mini version is:
+
+1. The full LVIS val evaluation dataset is quite large, and conducting an evaluation with it can take a significant amount of time.
+2. In the full LVIS val dataset, there are 15,000 images from the COCO2017 train dataset. If a user has used the COCO2017 data for training, there can be a data leakage issue when evaluating on the full LVIS val dataset
+
+The LVIS 1.0 dataset contains images that are exactly the same as the COCO2017 dataset, with the addition of new annotations. You can download the minival annotation file from [here](https://huggingface.co/GLIPModel/GLIP/blob/main/lvis_v1_minival_inserted_image_name.json), and the val 1.0 annotation file from [here](https://huggingface.co/GLIPModel/GLIP/blob/main/lvis_od_val.json). The final structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── lvis_v1_minival_inserted_image_name.json
+│ │ │ ├── lvis_od_val.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+### 3 ODinW
+
+ODinW, which stands for Object Detection in the Wild, is a dataset used to evaluate the generalization capability of grounding pre-trained models in different real-world scenarios. It consists of two subsets, ODinW13 and ODinW35, representing datasets composed of 13 and 35 different datasets, respectively. You can download it from [here](https://huggingface.co/GLIPModel/GLIP/tree/main/odinw_35), and then unzip each file. The final structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── odinw
+│ │ ├── AerialMaritimeDrone
+│ │ | |── large
+│ │ | | ├── test
+│ │ | | ├── train
+│ │ | | ├── valid
+│ │ | |── tiled
+│ │ ├── AmericanSignLanguageLetters
+│ │ ├── Aquarium
+│ │ ├── BCCD
+│ │ ├── ...
+```
+
+When evaluating ODinW35, custom prompts are required. Therefore, it's necessary to preprocess the annotated JSON files in advance. You can use the [override_category.py](./odinw/override_category.py) script for this purpose. After processing, it will generate new annotation files without overwriting the original ones.
+
+```shell
+python configs/mm_grounding_dino/odinw/override_category.py data/odinw/
+```
+
+### 4 DOD
+
+DOD stands for Described Object Detection, and it is introduced in the paper titled [Described Object Detection: Liberating Object Detection with Flexible Expressions](https://arxiv.org/abs/2307.12813). You can download the dataset from [here](https://github.com/shikras/d-cube?tab=readme-ov-file). The final structure of the dataset is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── d3
+│ │ ├── d3_images
+│ │ ├── d3_json
+│ │ ├── d3_pkl
+```
+
+### 5 Flickr30k Entities
+
+In the previous GoldG data preparation section, we downloaded the necessary files for training with Flickr30k. For evaluation, you will need 2 JSON files, which you can download from [here](https://huggingface.co/GLIPModel/GLIP/blob/main/mdetr_annotations/final_flickr_separateGT_val.json) and [here](https://huggingface.co/GLIPModel/GLIP/blob/main/mdetr_annotations/final_flickr_separateGT_test.json). The final structure of the dataset is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── flickr30k_entities
+│ │ ├── final_flickr_separateGT_train.json
+│ │ ├── final_flickr_separateGT_val.json
+│ │ ├── final_flickr_separateGT_test.json
+│ │ ├── final_flickr_separateGT_train_vg.json
+│ │ ├── flickr30k_images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 6 Referring Expression Comprehension
+
+Referential Expression Comprehension includes 4 datasets: RefCOCO, RefCOCO+, RefCOCOg, and gRefCOCO. The images used in these 4 datasets are from COCO2014 train, similar to COCO2017. You can download the images from the official COCO website or opendatalab. The annotations can be directly downloaded from [here](https://huggingface.co/GLIPModel/GLIP/tree/main/mdetr_annotations). The mdetr_annotations folder contains a large number of annotations, so you can choose to download only the JSON files you need. The final structure of the dataset is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2014.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── finetune_refcoco_testB.json
+│ │ │ ├── finetune_refcoco+_testA.json
+│ │ │ ├── finetune_refcoco+_testB.json
+│ │ │ ├── finetune_refcocog_test.json
+│ │ │ ├── finetune_refcocog_test.json
+```
+
+Please note that gRefCOCO is introduced in [GREC: Generalized Referring Expression Comprehension](https://arxiv.org/abs/2308.16182) and is not available in the `mdetr_annotations` folder. You will need to handle it separately. Here are the specific steps:
+
+1. Download [gRefCOCO](https://github.com/henghuiding/gRefCOCO?tab=readme-ov-file) and unzip it into the `data/coco/` folder.
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2014.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── mdetr_annotations
+│ │ ├── grefs
+│ │ │ ├── grefs(unc).json
+│ │ │ ├── instances.json
+```
+
+2. Convert to COCO format
+
+You can use the official [conversion script](https://github.com/henghuiding/gRefCOCO/blob/b4b1e55b4d3a41df26d6b7d843ea011d581127d4/mdetr/scripts/fine-tuning/grefexp_coco_format.py) provided by gRefCOCO. Please note that you need to uncomment line 161 and comment out line 160 in the script to obtain the full JSON file.
+
+```shell
+# you need to clone the official repo
+git clone https://github.com/henghuiding/gRefCOCO.git
+cd gRefCOCO/mdetr
+python scripts/fine-tuning/grefexp_coco_format.py --data_path ../../data/coco/grefs --out_path ../../data/coco/mdetr_annotations/ --coco_path ../../data/coco
+```
+
+Four JSON files will be generated in the `data/coco/mdetr_annotations/` folder. The complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2014.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── finetune_refcoco_testB.json
+│ │ │ ├── finetune_grefcoco_train.json
+│ │ │ ├── finetune_grefcoco_val.json
+│ │ │ ├── finetune_grefcoco_testA.json
+│ │ │ ├── finetune_grefcoco_testB.json
+```
+
+## Fine-Tuning Dataset Preparation
+
+### 1 COCO 2017
+
+COCO is the most commonly used dataset in the field of object detection, and we aim to explore its fine-tuning modes more comprehensively. From current developments, there are a total of three fine-tuning modes:
+
+1. Closed-set fine-tuning, where the description on the text side cannot be modified after fine-tuning, transforms into a closed-set algorithm. This approach maximizes performance on COCO but loses generality.
+2. Open-set continued pretraining fine-tuning involves using pretraining methods consistent with the COCO dataset. There are two approaches to this: the first is to reduce the learning rate and fix certain modules, fine-tuning only on the COCO dataset; the second is to mix COCO data with some of the pre-trained data. The goal of both approaches is to improve performance on the COCO dataset as much as possible without compromising generalization.
+3. Open-vocabulary fine-tuning involves adopting a common practice in the OVD (Open-Vocabulary Detection) domain. It divides COCO categories into base classes and novel classes. During training, fine-tuning is performed only on the base classes, while evaluation is conducted on both base and novel classes. This approach allows for the assessment of COCO OVD capabilities, with the goal of improving COCO dataset performance without compromising generalization as much as possible.
+
+\*\*(1) Closed-set Fine-tuning \*\*
+
+This section does not require data preparation; you can directly use the data you have prepared previously.
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+**(2) Open-set Continued Pretraining Fine-tuning**
+To use this approach, you need to convert the COCO training data into ODVG format. You can use the following command for conversion:
+
+```shell
+python tools/dataset_converters/coco2odvg.py data/coco/annotations/instances_train2017.json -d coco
+```
+
+This will generate new files, `instances_train2017_od.json` and `coco2017_label_map.json`, in the `data/coco/annotations/` directory. The complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_train2017_od.json
+│ │ │ ├── coco2017_label_map.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+Once you have obtained the data, you can choose whether to perform individual pretraining or mixed pretraining.
+
+**(3) Open-vocabulary Fine-tuning**
+For this approach, you need to convert the COCO training data into OVD (Open-Vocabulary Detection) format. You can use the following command for conversion:
+
+```shell
+python tools/dataset_converters/coco2ovd.py data/coco/
+```
+
+This will generate new files, `instances_val2017_all_2.json` and `instances_val2017_seen_2.json`, in the `data/coco/annotations/` directory. The complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_train2017_od.json
+│ │ │ ├── instances_val2017_all_2.json
+│ │ │ ├── instances_val2017_seen_2.json
+│ │ │ ├── coco2017_label_map.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+You can then proceed to train and test directly using the [configuration](coco/grounding_dino_swin-t_finetune_16xb4_1x_coco_48_17.py).
+
+### 2 LVIS 1.0
+
+LVIS is a dataset that includes 1,203 classes, making it a valuable dataset for fine-tuning. Due to its large number of classes, it's not feasible to perform closed-set fine-tuning. Therefore, we can only use open-set continued pretraining fine-tuning and open-vocabulary fine-tuning on LVIS.
+
+You need to prepare the LVIS training JSON files first, which you can download from [here](https://www.lvisdataset.org/dataset). We only need `lvis_v1_train.json` and `lvis_v1_val.json`. After downloading them, place them in the `data/coco/annotations/` directory, and then run the following command:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── lvis_v1_train.json
+│ │ │ ├── lvis_v1_val.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── lvis_v1_minival_inserted_image_name.json
+│ │ │ ├── lvis_od_val.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+(1) Open-set continued pretraining fine-tuning
+
+Convert to ODVG format using the following command:
+
+```shell
+python tools/dataset_converters/lvis2odvg.py data/coco/annotations/lvis_v1_train.json
+```
+
+It will generate new files, `lvis_v1_train_od.json` and `lvis_v1_label_map.json`, in the `data/coco/annotations/` directory, and the complete dataset structure will look like this:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── lvis_v1_train.json
+│ │ │ ├── lvis_v1_val.json
+│ │ │ ├── lvis_v1_train_od.json
+│ │ │ ├── lvis_v1_label_map.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── lvis_v1_minival_inserted_image_name.json
+│ │ │ ├── lvis_od_val.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+You can directly use the provided [configuration](lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis.py) for training and testing, or you can modify the configuration to mix it with some of the pretraining datasets as needed.
+
+**(2) Open Vocabulary Fine-tuning**
+
+Convert to OVD format using the following command:
+
+```shell
+python tools/dataset_converters/lvis2ovd.py data/coco/
+```
+
+New `lvis_v1_train_od_norare.json` and `lvis_v1_label_map_norare.json` will be generated under `data/coco/annotations/`, and the complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── lvis_v1_train.json
+│ │ │ ├── lvis_v1_val.json
+│ │ │ ├── lvis_v1_train_od.json
+│ │ │ ├── lvis_v1_label_map.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── lvis_v1_minival_inserted_image_name.json
+│ │ │ ├── lvis_od_val.json
+│ │ │ ├── lvis_v1_train_od_norare.json
+│ │ │ ├── lvis_v1_label_map_norare.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+然Then you can directly use the [configuration](lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis_866_337.py) for training and testing.
+
+### 3 RTTS
+
+RTTS is a foggy weather dataset, which contains 4,322 foggy images, including five classes: bicycle, bus, car, motorbike, and person. It can be downloaded from [here](https://drive.google.com/file/d/15Ei1cHGVqR1mXFep43BO7nkHq1IEGh1e/view), and then extracted to the `data/RTTS/` folder. The complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── RTTS
+│ │ ├── annotations_json
+│ │ ├── annotations_xml
+│ │ ├── ImageSets
+│ │ ├── JPEGImages
+```
+
+### 4 RUOD
+
+RUOD is an underwater object detection dataset. You can download it from [here](https://drive.google.com/file/d/1hxtbdgfVveUm_DJk5QXkNLokSCTa_E5o/view), and then extract it to the `data/RUOD/` folder. The complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── RUOD
+│ │ ├── Environment_pic
+│ │ ├── Environmet_ANN
+│ │ ├── RUOD_ANN
+│ │ ├── RUOD_pic
+```
+
+### 5 Brain Tumor
+
+Brain Tumor is a 2D detection dataset in the medical field. You can download it from [here](https://universe.roboflow.com/roboflow-100/brain-tumor-m2pbp/dataset/2), please make sure to choose the `COCO JSON` format. Then extract it to the `data/brain_tumor_v2/` folder. The complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── brain_tumor_v2
+│ │ ├── test
+│ │ ├── train
+│ │ ├── valid
+```
+
+### 6 Cityscapes
+
+Cityscapes is an urban street scene dataset. You can download it from [here](https://www.cityscapes-dataset.com/) or from opendatalab, and then extract it to the `data/cityscapes/` folder. The complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── cityscapes
+│ │ ├── annotations
+│ │ ├── leftImg8bit
+│ │ │ ├── train
+│ │ │ ├── val
+│ │ ├── gtFine
+│ │ │ ├── train
+│ │ │ ├── val
+```
+
+After downloading, you can use the [cityscapes.py](../../tools/dataset_converters/cityscapes.py) script to generate the required JSON format.
+
+```shell
+python tools/dataset_converters/cityscapes.py data/cityscapes/
+```
+
+Three new JSON files will be generated in the annotations directory. The complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── cityscapes
+│ │ ├── annotations
+│ │ │ ├── instancesonly_filtered_gtFine_train.json
+│ │ │ ├── instancesonly_filtered_gtFine_val.json
+│ │ │ ├── instancesonly_filtered_gtFine_test.json
+│ │ ├── leftImg8bit
+│ │ │ ├── train
+│ │ │ ├── val
+│ │ ├── gtFine
+│ │ │ ├── train
+│ │ │ ├── val
+```
+
+### 7 People in Painting
+
+People in Painting is an oil painting dataset that you can download from [here](https://universe.roboflow.com/roboflow-100/people-in-paintings/dataset/2). Please make sure to choose the `COCO JSON` format. After downloading, unzip the dataset to the `data/people_in_painting_v2/` folder. The complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── people_in_painting_v2
+│ │ ├── test
+│ │ ├── train
+│ │ ├── valid
+```
+
+### 8 Referring Expression Comprehension
+
+Fine-tuning for Referential Expression Comprehension is similar to what was described earlier and includes four datasets. The dataset preparation for evaluation has already been organized. The complete dataset structure is as follows:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2014.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── finetune_refcoco_testB.json
+│ │ │ ├── finetune_refcoco+_testA.json
+│ │ │ ├── finetune_refcoco+_testB.json
+│ │ │ ├── finetune_refcocog_test.json
+│ │ │ ├── finetune_refcocog_test.json
+```
+
+Then we need to convert it to the required ODVG format. Please use the [refcoco2odvg.py](../../tools/dataset_converters/refcoco2odvg.py) script to perform the conversion.
+
+```shell
+python tools/dataset_converters/refcoco2odvg.py data/coco/mdetr_annotations
+```
+
+The converted dataset structure will include 4 new JSON files in the `data/coco/mdetr_annotations` directory. Here is the structure of the converted dataset:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2014.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── finetune_refcoco_testB.json
+│ │ │ ├── finetune_refcoco+_testA.json
+│ │ │ ├── finetune_refcoco+_testB.json
+│ │ │ ├── finetune_refcocog_test.json
+│ │ │ ├── finetune_refcoco_train_vg.json
+│ │ │ ├── finetune_refcoco+_train_vg.json
+│ │ │ ├── finetune_refcocog_train_vg.json
+│ │ │ ├── finetune_grefcoco_train_vg.json
+```
diff --git a/configs/mm_grounding_dino/dataset_prepare_zh-CN.md b/configs/mm_grounding_dino/dataset_prepare_zh-CN.md
new file mode 100644
index 00000000000..10520b02fe5
--- /dev/null
+++ b/configs/mm_grounding_dino/dataset_prepare_zh-CN.md
@@ -0,0 +1,1194 @@
+# 数据准备和处理
+
+## MM-GDINO-T 预训练数据准备和处理
+
+MM-GDINO-T 模型中我们一共提供了 5 种不同数据组合的预训练配置,数据采用逐步累加的方式进行训练,因此用户可以根据自己的实际需求准备数据。
+
+### 1 Objects365 v1
+
+对应的训练配置为 [grounding_dino_swin-t_pretrain_obj365](./grounding_dino_swin-t_pretrain_obj365.py)
+
+Objects365_v1 可以从 [opendatalab](https://opendatalab.com/OpenDataLab/Objects365_v1) 下载,其提供了 CLI 和 SDK 两者下载方式。
+
+下载并解压后,将其放置或者软链接到 `data/objects365v1` 目录下,目录结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── objects365v1
+│ │ ├── objects365_train.json
+│ │ ├── objects365_val.json
+│ │ ├── train
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── test
+```
+
+然后使用 [coco2odvg.py](../../tools/dataset_converters/coco2odvg.py) 转换为训练所需的 ODVG 格式:
+
+```shell
+python tools/dataset_converters/coco2odvg.py data/objects365v1/objects365_train.json -d o365v1
+```
+
+程序运行完成后会在 `data/objects365v1` 目录下创建 `o365v1_train_od.json` 和 `o365v1_label_map.json` 两个新文件,完整结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── objects365v1
+│ │ ├── objects365_train.json
+│ │ ├── objects365_val.json
+│ │ ├── o365v1_train_od.json
+│ │ ├── o365v1_label_map.json
+│ │ ├── train
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── test
+```
+
+### 2 COCO 2017
+
+上述配置在训练过程中会评估 COCO 2017 数据集的性能,因此需要准备 COCO 2017 数据集。你可以从 [COCO](https://cocodataset.org/) 官网下载或者从 [opendatalab](https://opendatalab.com/OpenDataLab/COCO_2017) 下载
+
+下载并解压后,将其放置或者软链接到 `data/coco` 目录下,目录结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+### 3 GoldG
+
+下载该数据集后就可以训练 [grounding_dino_swin-t_pretrain_obj365_goldg](./grounding_dino_swin-t_pretrain_obj365_goldg.py) 配置了。
+
+GoldG 数据集包括 `GQA` 和 `Flickr30k` 两个数据集,来自 GLIP 论文中提到的 MixedGrounding 数据集,其排除了 COCO 数据集。下载链接为 [mdetr_annotations](https://huggingface.co/GLIPModel/GLIP/tree/main/mdetr_annotations),我们目前需要的是 `mdetr_annotations/final_mixed_train_no_coco.json` 和 `mdetr_annotations/final_flickr_separateGT_train.json` 文件。
+
+然后下载 [GQA images](https://nlp.stanford.edu/data/gqa/images.zip) 图片。下载并解压后,将其放置或者软链接到 `data/gqa` 目录下,目录结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── gqa
+| | ├── final_mixed_train_no_coco.json
+│ │ ├── images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+然后下载 [Flickr30k images](http://shannon.cs.illinois.edu/DenotationGraph/) 图片。这个数据下载需要先申请,再获得下载链接后才可以下载。下载并解压后,将其放置或者软链接到 `data/flickr30k_entities` 目录下,目录结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── flickr30k_entities
+│ │ ├── final_flickr_separateGT_train.json
+│ │ ├── flickr30k_images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+对于 GQA 数据集,你需要使用 [goldg2odvg.py](../../tools/dataset_converters/goldg2odvg.py) 转换为训练所需的 ODVG 格式:
+
+```shell
+python tools/dataset_converters/goldg2odvg.py data/gqa/final_mixed_train_no_coco.json
+```
+
+程序运行完成后会在 `data/gqa` 目录下创建 `final_mixed_train_no_coco_vg.json` 新文件,完整结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── gqa
+| | ├── final_mixed_train_no_coco.json
+| | ├── final_mixed_train_no_coco_vg.json
+│ │ ├── images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+对于 Flickr30k 数据集,你需要使用 [goldg2odvg.py](../../tools/dataset_converters/goldg2odvg.py) 转换为训练所需的 ODVG 格式:
+
+```shell
+python tools/dataset_converters/goldg2odvg.py data/flickr30k_entities/final_flickr_separateGT_train.json
+```
+
+程序运行完成后会在 `data/flickr30k_entities` 目录下创建 `final_flickr_separateGT_train_vg.json` 新文件,完整结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── flickr30k_entities
+│ │ ├── final_flickr_separateGT_train.json
+│ │ ├── final_flickr_separateGT_train_vg.json
+│ │ ├── flickr30k_images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 4 GRIT-20M
+
+对应的训练配置为 [grounding_dino_swin-t_pretrain_obj365_goldg_grit9m](./grounding_dino_swin-t_pretrain_obj365_goldg_grit9m.py)
+
+GRIT数据集可以从 [GRIT](https://huggingface.co/datasets/zzliang/GRIT#download-image) 中使用 img2dataset 包下载,默认指令下载后数据集大小为 1.1T,下载和处理预估需要至少 2T 硬盘空间,可根据硬盘容量酌情下载。下载后原始格式为:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── grit_raw
+│ │ ├── 00000_stats.json
+│ │ ├── 00000.parquet
+│ │ ├── 00000.tar
+│ │ ├── 00001_stats.json
+│ │ ├── 00001.parquet
+│ │ ├── 00001.tar
+│ │ ├── ...
+```
+
+下载后需要对格式进行进一步处理:
+
+```shell
+python tools/dataset_converters/grit_processing.py data/grit_raw data/grit_processed
+```
+
+处理后的格式为:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── grit_processed
+│ │ ├── annotations
+│ │ │ ├── 00000.json
+│ │ │ ├── 00001.json
+│ │ │ ├── ...
+│ │ ├── images
+│ │ │ ├── 00000
+│ │ │ │ ├── 000000000.jpg
+│ │ │ │ ├── 000000003.jpg
+│ │ │ │ ├── 000000004.jpg
+│ │ │ │ ├── ...
+│ │ │ ├── 00001
+│ │ │ ├── ...
+```
+
+对于 GRIT 数据集,你需要使用 [grit2odvg.py](../../tools/dataset_converters/grit2odvg.py) 转化成需要的 ODVG 格式:
+
+```shell
+python tools/dataset_converters/grit2odvg.py data/grit_processed/
+```
+
+程序运行完成后会在 `data/grit_processed` 目录下创建 `grit20m_vg.json` 新文件,大概包含 9M 条数据,完整结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── grit_processed
+| | ├── grit20m_vg.json
+│ │ ├── annotations
+│ │ │ ├── 00000.json
+│ │ │ ├── 00001.json
+│ │ │ ├── ...
+│ │ ├── images
+│ │ │ ├── 00000
+│ │ │ │ ├── 000000000.jpg
+│ │ │ │ ├── 000000003.jpg
+│ │ │ │ ├── 000000004.jpg
+│ │ │ │ ├── ...
+│ │ │ ├── 00001
+│ │ │ ├── ...
+```
+
+### 5 V3Det
+
+对应的训练配置为
+
+- [grounding_dino_swin-t_pretrain_obj365_goldg_v3det](./grounding_dino_swin-t_pretrain_obj365_goldg_v3det.py)
+- [grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det](./grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det.py)
+
+V3Det 数据集下载可以从 [opendatalab](https://opendatalab.com/V3Det/V3Det) 下载,下载并解压后,将其放置或者软链接到 `data/v3det` 目录下,目录结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── v3det
+│ │ ├── annotations
+│ │ | ├── v3det_2023_v1_train.json
+│ │ ├── images
+│ │ │ ├── a00000066
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+然后使用 [coco2odvg.py](../../tools/dataset_converters/coco2odvg.py) 转换为训练所需的 ODVG 格式:
+
+```shell
+python tools/dataset_converters/coco2odvg.py data/v3det/annotations/v3det_2023_v1_train.json -d v3det
+```
+
+程序运行完成后会在 `data/v3det/annotations` 目录下创建目录下创建 `v3det_2023_v1_train_od.json` 和 `v3det_2023_v1_label_map.json` 两个新文件,完整结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── v3det
+│ │ ├── annotations
+│ │ | ├── v3det_2023_v1_train.json
+│ │ | ├── v3det_2023_v1_train_od.json
+│ │ | ├── v3det_2023_v1_label_map.json
+│ │ ├── images
+│ │ │ ├── a00000066
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 6 数据切分和可视化
+
+考虑到用户需要准备的数据集过多,不方便对图片和标注进行训练前确认,因此我们提供了一个数据切分和可视化的工具,可以将数据集切分为 tiny 版本,然后使用可视化脚本查看图片和标签正确性。
+
+1. 切分数据集
+
+脚本位于 [这里](../../tools/misc/split_odvg.py), 以 `Object365 v1` 为例,切分数据集的命令如下:
+
+```shell
+python tools/misc/split_odvg.py data/object365_v1/ o365v1_train_od.json train your_output_dir --label-map-file o365v1_label_map.json -n 200
+```
+
+上述脚本运行后会在 `your_output_dir` 目录下创建和 `data/object365_v1/` 一样的文件夹结构,但是只会保存 200 张训练图片和对应的 json,方便用户查看。
+
+2. 可视化原始数据集
+
+脚本位于 [这里](../../tools/analysis_tools/browse_grounding_raw.py), 以 `Object365 v1` 为例,可视化数据集的命令如下:
+
+```shell
+python tools/analysis_tools/browse_grounding_raw.py data/object365_v1/ o365v1_train_od.json train --label-map-file o365v1_label_map.json -o your_output_dir --not-show
+```
+
+上述脚本运行后会在 `your_output_dir` 目录下生成同时包括图片和标签的图片,方便用户查看。
+
+3. 可视化 dataset 输出的数据集
+
+脚本位于 [这里](../../tools/analysis_tools/browse_grounding_dataset.py), 用户可以通过该脚本查看 dataset 输出的结果即包括了数据增强的结果。 以 `Object365 v1` 为例,可视化数据集的命令如下:
+
+```shell
+python tools/analysis_tools/browse_grounding_dataset.py configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py -o your_output_dir --not-show
+```
+
+上述脚本运行后会在 `your_output_dir` 目录下生成同时包括图片和标签的图片,方便用户查看。
+
+## MM-GDINO-L 预训练数据准备和处理
+
+### 1 Object365 v2
+
+Objects365_v2 可以从 [opendatalab](https://opendatalab.com/OpenDataLab/Objects365) 下载,其提供了 CLI 和 SDK 两者下载方式。
+
+下载并解压后,将其放置或者软链接到 `data/objects365v2` 目录下,目录结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── objects365v2
+│ │ ├── annotations
+│ │ │ ├── zhiyuan_objv2_train.json
+│ │ ├── train
+│ │ │ ├── patch0
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+由于 objects365v2 类别中有部分类名是错误的,因此需要先进行修正。
+
+```shell
+python tools/dataset_converters/fix_o365_names.py
+```
+
+会在 `data/objects365v2/annotations` 下生成新的标注文件 `zhiyuan_objv2_train_fixname.json`。
+
+然后使用 [coco2odvg.py](../../tools/dataset_converters/coco2odvg.py) 转换为训练所需的 ODVG 格式:
+
+```shell
+python tools/dataset_converters/coco2odvg.py data/objects365v2/annotations/zhiyuan_objv2_train_fixname.json -d o365v2
+```
+
+程序运行完成后会在 `data/objects365v2` 目录下创建 `zhiyuan_objv2_train_fixname_od.json` 和 `o365v2_label_map.json` 两个新文件,完整结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── objects365v2
+│ │ ├── annotations
+│ │ │ ├── zhiyuan_objv2_train.json
+│ │ │ ├── zhiyuan_objv2_train_fixname.json
+│ │ │ ├── zhiyuan_objv2_train_fixname_od.json
+│ │ │ ├── o365v2_label_map.json
+│ │ ├── train
+│ │ │ ├── patch0
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 2 OpenImages v6
+
+OpenImages v6 可以从 [官网](https://storage.googleapis.com/openimages/web/download_v6.html) 下载,由于数据集比较大,需要花费一定的时间,下载完成后文件结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── OpenImages
+│ │ ├── annotations
+| │ │ ├── oidv6-train-annotations-bbox.csv
+| │ │ ├── class-descriptions-boxable.csv
+│ │ ├── OpenImages
+│ │ │ ├── train
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+然后使用 [openimages2odvg.py](../../tools/dataset_converters/openimages2odvg.py) 转换为训练所需的 ODVG 格式:
+
+```shell
+python tools/dataset_converters/openimages2odvg.py data/OpenImages/annotations
+```
+
+程序运行完成后会在 `data/OpenImages/annotations` 目录下创建 `oidv6-train-annotation_od.json` 和 `openimages_label_map.json` 两个新文件,完整结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── OpenImages
+│ │ ├── annotations
+| │ │ ├── oidv6-train-annotations-bbox.csv
+| │ │ ├── class-descriptions-boxable.csv
+| │ │ ├── oidv6-train-annotations_od.json
+| │ │ ├── openimages_label_map.json
+│ │ ├── OpenImages
+│ │ │ ├── train
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 3 V3Det
+
+参见前面的 MM-GDINO-T 预训练数据准备和处理 数据准备部分,完整数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── v3det
+│ │ ├── annotations
+│ │ | ├── v3det_2023_v1_train.json
+│ │ | ├── v3det_2023_v1_train_od.json
+│ │ | ├── v3det_2023_v1_label_map.json
+│ │ ├── images
+│ │ │ ├── a00000066
+│ │ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 4 LVIS 1.0
+
+参见后面的 `微调数据集准备` 的 `2 LVIS 1.0` 部分。完整数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── lvis_v1_train.json
+│ │ │ ├── lvis_v1_val.json
+│ │ │ ├── lvis_v1_train_od.json
+│ │ │ ├── lvis_v1_label_map.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── lvis_v1_minival_inserted_image_name.json
+│ │ │ ├── lvis_od_val.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+### 5 COCO2017 OD
+
+数据准备可以参考前面的 `MM-GDINO-T 预训练数据准备和处理` 部分。为了方便后续处理,请将下载的 [mdetr_annotations](https://huggingface.co/GLIPModel/GLIP/tree/main/mdetr_annotations) 文件夹软链接或者移动到 `data/coco` 路径下
+完整数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── ...
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+由于 COCO2017 train 和 RefCOCO/RefCOCO+/RefCOCOg/gRefCOCO val 中存在部分重叠,如果不提前移除,在评测 RefExp 时候会存在数据泄露。
+
+```shell
+python tools/dataset_converters/remove_cocotrain2017_from_refcoco.py data/coco/mdetr_annotations data/coco/annotations/instances_train2017.json
+```
+
+会在 `data/coco/annotations` 目录下创建 `instances_train2017_norefval.json` 新文件。最后使用 [coco2odvg.py](../../tools/dataset_converters/coco2odvg.py) 转换为训练所需的 ODVG 格式:
+
+```shell
+python tools/dataset_converters/coco2odvg.py data/coco/annotations/instances_train2017_norefval.json -d coco
+```
+
+会在 `data/coco/annotations` 目录下创建 `instances_train2017_norefval_od.json` 和 `coco_label_map.json` 两个新文件,完整结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2017_norefval_od.json
+│ │ │ ├── coco_label_map.json
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── ...
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+注意: COCO2017 train 和 LVIS 1.0 val 数据集有 15000 张图片重复,因此一旦在训练中使用了 COCO2017 train,那么 LVIS 1.0 val 的评测结果就存在数据泄露问题,LVIS 1.0 minival 没有这个问题。
+
+### 6 GoldG
+
+参见 MM-GDINO-T 预训练数据准备和处理 部分
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── flickr30k_entities
+│ │ ├── final_flickr_separateGT_train.json
+│ │ ├── final_flickr_separateGT_train_vg.json
+│ │ ├── flickr30k_images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ ├── gqa
+| | ├── final_mixed_train_no_coco.json
+| | ├── final_mixed_train_no_coco_vg.json
+│ │ ├── images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 7 COCO2014 VG
+
+MDetr 中提供了 COCO2014 train 的 Phrase Grounding 版本标注, 最原始标注文件为 `final_mixed_train.json`,和之前类似,文件结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_mixed_train.json
+│ │ │ ├── ...
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+我们可以从 `final_mixed_train.json` 中提取出 COCO 部分数据
+
+```shell
+python tools/dataset_converters/extract_coco_from_mixed.py data/coco/mdetr_annotations/final_mixed_train.json
+```
+
+会在 `data/coco/mdetr_annotations` 目录下创建 `final_mixed_train_only_coco.json` 新文件,最后使用 [goldg2odvg.py](../../tools/dataset_converters/goldg2odvg.py) 转换为训练所需的 ODVG 格式:
+
+```shell
+python tools/dataset_converters/goldg2odvg.py data/coco/mdetr_annotations/final_mixed_train_only_coco.json
+```
+
+会在 `data/coco/mdetr_annotations` 目录下创建 `final_mixed_train_only_coco_vg.json` 新文件,完整结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_mixed_train.json
+│ │ │ ├── final_mixed_train_only_coco.json
+│ │ │ ├── final_mixed_train_only_coco_vg.json
+│ │ │ ├── ...
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+注意: COCO2014 train 和 COCO2017 val 没有重复图片,因此不用担心 COCO 评测的数据泄露问题。
+
+### 8 Referring Expression Comprehension
+
+其一共包括 4 个数据集。数据准备部分请参见 微调数据集准备 部分。
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2014.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── finetune_refcoco_testB.json
+│ │ │ ├── finetune_refcoco+_testA.json
+│ │ │ ├── finetune_refcoco+_testB.json
+│ │ │ ├── finetune_refcocog_test.json
+│ │ │ ├── finetune_refcoco_train_vg.json
+│ │ │ ├── finetune_refcoco+_train_vg.json
+│ │ │ ├── finetune_refcocog_train_vg.json
+│ │ │ ├── finetune_grefcoco_train_vg.json
+```
+
+### 9 GRIT-20M
+
+参见 MM-GDINO-T 预训练数据准备和处理 部分
+
+## 评测数据集准备
+
+### 1 COCO 2017
+
+数据准备流程和前面描述一致,最终结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+### 2 LVIS 1.0
+
+LVIS 1.0 val 数据集包括 mini 和全量两个版本,mini 版本存在的意义是:
+
+1. LVIS val 全量评测数据集比较大,评测一次需要比较久的时间
+2. LVIS val 全量数据集中包括了 15000 张 COCO2017 train, 如果用户使用了 COCO2017 数据进行训练,那么将存在数据泄露问题
+
+LVIS 1.0 图片和 COCO2017 数据集图片完全一样,只是提供了新的标注而已,minival 标注文件可以从 [这里](https://huggingface.co/GLIPModel/GLIP/blob/main/lvis_v1_minival_inserted_image_name.json)下载, val 1.0 标注文件可以从 [这里](https://huggingface.co/GLIPModel/GLIP/blob/main/lvis_od_val.json) 下载。 最终结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── lvis_v1_minival_inserted_image_name.json
+│ │ │ ├── lvis_od_val.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+### 3 ODinW
+
+ODinw 全称为 Object Detection in the Wild,是用于验证 grounding 预训练模型在不同实际场景中的泛化能力的数据集,其包括两个子集,分别是 ODinW13 和 ODinW35,代表是由 13 和 35 个数据集组成的。你可以从 [这里](https://huggingface.co/GLIPModel/GLIP/tree/main/odinw_35)下载,然后对每个文件进行解压,最终结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── odinw
+│ │ ├── AerialMaritimeDrone
+│ │ | |── large
+│ │ | | ├── test
+│ │ | | ├── train
+│ │ | | ├── valid
+│ │ | |── tiled
+│ │ ├── AmericanSignLanguageLetters
+│ │ ├── Aquarium
+│ │ ├── BCCD
+│ │ ├── ...
+```
+
+在评测 ODinW3535 时候由于需要自定义 prompt,因此需要提前对标注的 json 文件进行处理,你可以使用 [override_category.py](./odinw/override_category.py) 脚本进行处理,处理后会生成新的标注文件,不会覆盖原先的标注文件。
+
+```shell
+python configs/mm_grounding_dino/odinw/override_category.py data/odinw/
+```
+
+### 4 DOD
+
+DOD 来自 [Described Object Detection: Liberating Object Detection with Flexible Expressions](https://arxiv.org/abs/2307.12813)。其数据集可以从 [这里](https://github.com/shikras/d-cube?tab=readme-ov-file#download)下载,最终的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── d3
+│ │ ├── d3_images
+│ │ ├── d3_json
+│ │ ├── d3_pkl
+```
+
+### 5 Flickr30k Entities
+
+在前面 GoldG 数据准备章节中我们已经下载了 Flickr30k 训练所需文件,评估所需的文件是 2 个 json 文件,你可以从 [这里](https://huggingface.co/GLIPModel/GLIP/blob/main/mdetr_annotations/final_flickr_separateGT_val.json) 和 [这里](https://huggingface.co/GLIPModel/GLIP/blob/main/mdetr_annotations/final_flickr_separateGT_test.json)下载,最终的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── flickr30k_entities
+│ │ ├── final_flickr_separateGT_train.json
+│ │ ├── final_flickr_separateGT_val.json
+│ │ ├── final_flickr_separateGT_test.json
+│ │ ├── final_flickr_separateGT_train_vg.json
+│ │ ├── flickr30k_images
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+```
+
+### 6 Referring Expression Comprehension
+
+指代性表达式理解包括 4 个数据集: RefCOCO, RefCOCO+, RefCOCOg, gRefCOCO。这 4 个数据集所采用的图片都来自于 COCO2014 train,和 COCO2017 类似,你可以从 COCO 官方或者 opendatalab 中下载,而标注可以直接从 [这里](https://huggingface.co/GLIPModel/GLIP/tree/main/mdetr_annotations) 下载,mdetr_annotations 文件夹里面包括了其他大量的标注,你如果觉得数量过多,可以只下载所需要的几个 json 文件即可。最终的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2014.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── finetune_refcoco_testB.json
+│ │ │ ├── finetune_refcoco+_testA.json
+│ │ │ ├── finetune_refcoco+_testB.json
+│ │ │ ├── finetune_refcocog_test.json
+│ │ │ ├── finetune_refcocog_test.json
+```
+
+注意 gRefCOCO 是在 [GREC: Generalized Referring Expression Comprehension](https://arxiv.org/abs/2308.16182) 被提出,并不在 `mdetr_annotations` 文件夹中,需要自行处理。具体步骤为:
+
+1. 下载 [gRefCOCO](https://github.com/henghuiding/gRefCOCO?tab=readme-ov-file#grefcoco-dataset-download),并解压到 data/coco/ 文件夹中
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2014.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── mdetr_annotations
+│ │ ├── grefs
+│ │ │ ├── grefs(unc).json
+│ │ │ ├── instances.json
+```
+
+2. 转换为 coco 格式
+
+你可以使用 gRefCOCO 官方提供的[转换脚本](https://github.com/henghuiding/gRefCOCO/blob/b4b1e55b4d3a41df26d6b7d843ea011d581127d4/mdetr/scripts/fine-tuning/grefexp_coco_format.py)。注意需要将被注释的 161 行打开,并注释 160 行才可以得到全量的 json 文件。
+
+```shell
+# 需要克隆官方 repo
+git clone https://github.com/henghuiding/gRefCOCO.git
+cd gRefCOCO/mdetr
+python scripts/fine-tuning/grefexp_coco_format.py --data_path ../../data/coco/grefs --out_path ../../data/coco/mdetr_annotations/ --coco_path ../../data/coco
+```
+
+会在 `data/coco/mdetr_annotations/` 文件夹中生成 4 个 json 文件,完整的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2014.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── finetune_refcoco_testB.json
+│ │ │ ├── finetune_grefcoco_train.json
+│ │ │ ├── finetune_grefcoco_val.json
+│ │ │ ├── finetune_grefcoco_testA.json
+│ │ │ ├── finetune_grefcoco_testB.json
+```
+
+## 微调数据集准备
+
+### 1 COCO 2017
+
+COCO 是检测领域最常用的数据集,我们希望能够更充分探索其微调模式。从目前发展来看,一共有 3 种微调方式:
+
+1. 闭集微调,即微调后文本端将无法修改描述,转变为闭集算法,在 COCO 上性能能够最大化,但是失去了通用性。
+2. 开集继续预训练微调,即对 COCO 数据集采用和预训练一致的预训练手段。此时有两种做法,第一种是降低学习率并固定某些模块,仅仅在 COCO 数据上预训练,第二种是将 COCO 数据和部分预训练数据混合一起训练,两种方式的目的都是在尽可能不降低泛化性时提高 COCO 数据集性能
+3. 开放词汇微调,即采用 OVD 领域常用做法,将 COCO 类别分成 base 类和 novel 类,训练时候仅仅在 base 类上进行,评测在 base 和 novel 类上进行。这种方式可以验证 COCO OVD 能力,目的也是在尽可能不降低泛化性时提高 COCO 数据集性能
+
+**(1) 闭集微调**
+
+这个部分无需准备数据,直接用之前的数据即可。
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+**(2) 开集继续预训练微调**
+这种方式需要将 COCO 训练数据转换为 ODVG 格式,你可以使用如下命令转换:
+
+```shell
+python tools/dataset_converters/coco2odvg.py data/coco/annotations/instances_train2017.json -d coco
+```
+
+会在 `data/coco/annotations/` 下生成新的 `instances_train2017_od.json` 和 `coco2017_label_map.json`,完整的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_train2017_od.json
+│ │ │ ├── coco2017_label_map.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+在得到数据后,你可以自行选择单独预习还是混合预训练方式。
+
+**(3) 开放词汇微调**
+这种方式需要将 COCO 训练数据转换为 OVD 格式,你可以使用如下命令转换:
+
+```shell
+python tools/dataset_converters/coco2ovd.py data/coco/
+```
+
+会在 `data/coco/annotations/` 下生成新的 `instances_val2017_all_2.json` 和 `instances_val2017_seen_2.json`,完整的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_train2017_od.json
+│ │ │ ├── instances_val2017_all_2.json
+│ │ │ ├── instances_val2017_seen_2.json
+│ │ │ ├── coco2017_label_map.json
+│ │ │ ├── instances_val2017.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+然后可以直接使用 [配置](coco/grounding_dino_swin-t_finetune_16xb4_1x_coco_48_17.py) 进行训练和测试。
+
+### 2 LVIS 1.0
+
+LVIS 是一个包括 1203 类的数据集,同时也是一个长尾联邦数据集,对其进行微调很有意义。 由于其类别过多,我们无法对其进行闭集微调,因此只能采用开集继续预训练微调和开放词汇微调。
+
+你需要先准备好 LVIS 训练 JSON 文件,你可以从 [这里](https://www.lvisdataset.org/dataset) 下载,我们只需要 `lvis_v1_train.json` 和 `lvis_v1_val.json`,然后将其放到 `data/coco/annotations/` 下,然后运行如下命令:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── lvis_v1_train.json
+│ │ │ ├── lvis_v1_val.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── lvis_v1_minival_inserted_image_name.json
+│ │ │ ├── lvis_od_val.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+(1) 开集继续预训练微调
+
+使用如下命令转换为 ODVG 格式:
+
+```shell
+python tools/dataset_converters/lvis2odvg.py data/coco/annotations/lvis_v1_train.json
+```
+
+会在 `data/coco/annotations/` 下生成新的 `lvis_v1_train_od.json` 和 `lvis_v1_label_map.json`,完整的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── lvis_v1_train.json
+│ │ │ ├── lvis_v1_val.json
+│ │ │ ├── lvis_v1_train_od.json
+│ │ │ ├── lvis_v1_label_map.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── lvis_v1_minival_inserted_image_name.json
+│ │ │ ├── lvis_od_val.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+然后可以直接使用 [配置](lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis.py) 进行训练测试,或者你修改配置将其和部分预训练数据集混合使用。
+
+**(2) 开放词汇微调**
+
+使用如下命令转换为 OVD 格式:
+
+```shell
+python tools/dataset_converters/lvis2ovd.py data/coco/
+```
+
+会在 `data/coco/annotations/` 下生成新的 `lvis_v1_train_od_norare.json` 和 `lvis_v1_label_map_norare.json`,完整的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── lvis_v1_train.json
+│ │ │ ├── lvis_v1_val.json
+│ │ │ ├── lvis_v1_train_od.json
+│ │ │ ├── lvis_v1_label_map.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── lvis_v1_minival_inserted_image_name.json
+│ │ │ ├── lvis_od_val.json
+│ │ │ ├── lvis_v1_train_od_norare.json
+│ │ │ ├── lvis_v1_label_map_norare.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+```
+
+然后可以直接使用 [配置](lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis_866_337.py) 进行训练测试
+
+### 3 RTTS
+
+RTTS 是一个浓雾天气数据集,该数据集包含 4,322 张雾天图像,包含五个类:自行车 (bicycle)、公共汽车 (bus)、汽车 (car)、摩托车 (motorbike) 和人 (person)。可以从 [这里](https://drive.google.com/file/d/15Ei1cHGVqR1mXFep43BO7nkHq1IEGh1e/view)下载, 然后解压到 `data/RTTS/` 文件夹中。完整的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── RTTS
+│ │ ├── annotations_json
+│ │ ├── annotations_xml
+│ │ ├── ImageSets
+│ │ ├── JPEGImages
+```
+
+### 4 RUOD
+
+RUOD 是一个水下目标检测数据集,你可以从 [这里](https://drive.google.com/file/d/1hxtbdgfVveUm_DJk5QXkNLokSCTa_E5o/view)下载, 然后解压到 `data/RUOD/` 文件夹中。完整的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── RUOD
+│ │ ├── Environment_pic
+│ │ ├── Environmet_ANN
+│ │ ├── RUOD_ANN
+│ │ ├── RUOD_pic
+```
+
+### 5 Brain Tumor
+
+Brain Tumor 是一个医学领域的 2d 检测数据集,你可以从 [这里](https://universe.roboflow.com/roboflow-100/brain-tumor-m2pbp/dataset/2)下载, 请注意选择 `COCO JSON` 格式。然后解压到 `data/brain_tumor_v2/` 文件夹中。完整的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── brain_tumor_v2
+│ │ ├── test
+│ │ ├── train
+│ │ ├── valid
+```
+
+### 6 Cityscapes
+
+Cityscapes 是一个城市街景数据集,你可以从 [这里](https://www.cityscapes-dataset.com/) 或者 opendatalab 中下载, 然后解压到 `data/cityscapes/` 文件夹中。完整的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── cityscapes
+│ │ ├── annotations
+│ │ ├── leftImg8bit
+│ │ │ ├── train
+│ │ │ ├── val
+│ │ ├── gtFine
+│ │ │ ├── train
+│ │ │ ├── val
+```
+
+在下载后,然后使用 [cityscapes.py](../../tools/dataset_converters/cityscapes.py) 脚本生成我们所需要的 json 格式
+
+```shell
+python tools/dataset_converters/cityscapes.py data/cityscapes/
+```
+
+会在 annotations 中生成 3 个新的 json 文件。完整的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── cityscapes
+│ │ ├── annotations
+│ │ │ ├── instancesonly_filtered_gtFine_train.json
+│ │ │ ├── instancesonly_filtered_gtFine_val.json
+│ │ │ ├── instancesonly_filtered_gtFine_test.json
+│ │ ├── leftImg8bit
+│ │ │ ├── train
+│ │ │ ├── val
+│ │ ├── gtFine
+│ │ │ ├── train
+│ │ │ ├── val
+```
+
+### 7 People in Painting
+
+People in Painting 是一个油画数据集,你可以从 [这里](https://universe.roboflow.com/roboflow-100/people-in-paintings/dataset/2), 请注意选择 `COCO JSON` 格式。然后解压到 `data/people_in_painting_v2/` 文件夹中。完整的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── people_in_painting_v2
+│ │ ├── test
+│ │ ├── train
+│ │ ├── valid
+```
+
+### 8 Referring Expression Comprehension
+
+指代性表达式理解的微调和前面一样,也是包括 4 个数据集,在评测数据准备阶段已经全部整理好了,完整的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2014.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── finetune_refcoco_testB.json
+│ │ │ ├── finetune_refcoco+_testA.json
+│ │ │ ├── finetune_refcoco+_testB.json
+│ │ │ ├── finetune_refcocog_test.json
+│ │ │ ├── finetune_refcocog_test.json
+```
+
+然后我们需要将其转换为所需的 ODVG 格式,请使用 [refcoco2odvg.py](../../tools/dataset_converters/refcoco2odvg.py) 脚本转换,
+
+```shell
+python tools/dataset_converters/refcoco2odvg.py data/coco/mdetr_annotations
+```
+
+会在 `data/coco/mdetr_annotations` 中生成新的 4 个 json 文件。 转换后的数据集结构如下:
+
+```text
+mmdetection
+├── configs
+├── data
+│ ├── coco
+│ │ ├── annotations
+│ │ │ ├── instances_train2017.json
+│ │ │ ├── instances_val2017.json
+│ │ │ ├── instances_train2014.json
+│ │ ├── train2017
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── val2017
+│ │ │ ├── xxxx.jpg
+│ │ │ ├── ...
+│ │ ├── train2014
+│ │ │ ├── xxx.jpg
+│ │ │ ├── ...
+│ │ ├── mdetr_annotations
+│ │ │ ├── final_refexp_val.json
+│ │ │ ├── finetune_refcoco_testA.json
+│ │ │ ├── finetune_refcoco_testB.json
+│ │ │ ├── finetune_refcoco+_testA.json
+│ │ │ ├── finetune_refcoco+_testB.json
+│ │ │ ├── finetune_refcocog_test.json
+│ │ │ ├── finetune_refcoco_train_vg.json
+│ │ │ ├── finetune_refcoco+_train_vg.json
+│ │ │ ├── finetune_refcocog_train_vg.json
+│ │ │ ├── finetune_grefcoco_train_vg.json
+```
diff --git a/configs/mm_grounding_dino/dod/grounding_dino_swin-t_pretrain_zeroshot_concat_dod.py b/configs/mm_grounding_dino/dod/grounding_dino_swin-t_pretrain_zeroshot_concat_dod.py
new file mode 100644
index 00000000000..e59a0a52518
--- /dev/null
+++ b/configs/mm_grounding_dino/dod/grounding_dino_swin-t_pretrain_zeroshot_concat_dod.py
@@ -0,0 +1,78 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/d3/'
+
+test_pipeline = [
+ dict(
+ type='LoadImageFromFile', backend_args=None,
+ imdecode_backend='pillow'),
+ dict(
+ type='FixScaleResize',
+ scale=(800, 1333),
+ keep_ratio=True,
+ backend='pillow'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'text', 'custom_entities', 'sent_ids'))
+]
+
+# -------------------------------------------------#
+val_dataset_full = dict(
+ type='DODDataset',
+ data_root=data_root,
+ ann_file='d3_json/d3_full_annotations.json',
+ data_prefix=dict(img='d3_images/', anno='d3_pkl'),
+ pipeline=test_pipeline,
+ test_mode=True,
+ backend_args=None,
+ return_classes=True)
+
+val_evaluator_full = dict(
+ type='DODCocoMetric',
+ ann_file=data_root + 'd3_json/d3_full_annotations.json')
+
+# -------------------------------------------------#
+val_dataset_pres = dict(
+ type='DODDataset',
+ data_root=data_root,
+ ann_file='d3_json/d3_pres_annotations.json',
+ data_prefix=dict(img='d3_images/', anno='d3_pkl'),
+ pipeline=test_pipeline,
+ test_mode=True,
+ backend_args=None,
+ return_classes=True)
+val_evaluator_pres = dict(
+ type='DODCocoMetric',
+ ann_file=data_root + 'd3_json/d3_pres_annotations.json')
+
+# -------------------------------------------------#
+val_dataset_abs = dict(
+ type='DODDataset',
+ data_root=data_root,
+ ann_file='d3_json/d3_abs_annotations.json',
+ data_prefix=dict(img='d3_images/', anno='d3_pkl'),
+ pipeline=test_pipeline,
+ test_mode=True,
+ backend_args=None,
+ return_classes=True)
+val_evaluator_abs = dict(
+ type='DODCocoMetric',
+ ann_file=data_root + 'd3_json/d3_abs_annotations.json')
+
+# -------------------------------------------------#
+datasets = [val_dataset_full, val_dataset_pres, val_dataset_abs]
+dataset_prefixes = ['FULL', 'PRES', 'ABS']
+metrics = [val_evaluator_full, val_evaluator_pres, val_evaluator_abs]
+
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/mm_grounding_dino/dod/grounding_dino_swin-t_pretrain_zeroshot_parallel_dod.py b/configs/mm_grounding_dino/dod/grounding_dino_swin-t_pretrain_zeroshot_parallel_dod.py
new file mode 100644
index 00000000000..3d680091162
--- /dev/null
+++ b/configs/mm_grounding_dino/dod/grounding_dino_swin-t_pretrain_zeroshot_parallel_dod.py
@@ -0,0 +1,3 @@
+_base_ = 'grounding_dino_swin-t_pretrain_zeroshot_concat_dod.py'
+
+model = dict(test_cfg=dict(chunked_size=1))
diff --git a/configs/mm_grounding_dino/flickr30k/grounding_dino_swin-t-pretrain_flickr30k.py b/configs/mm_grounding_dino/flickr30k/grounding_dino_swin-t-pretrain_flickr30k.py
new file mode 100644
index 00000000000..e9eb783da97
--- /dev/null
+++ b/configs/mm_grounding_dino/flickr30k/grounding_dino_swin-t-pretrain_flickr30k.py
@@ -0,0 +1,57 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+dataset_type = 'Flickr30kDataset'
+data_root = 'data/flickr30k_entities/'
+
+test_pipeline = [
+ dict(
+ type='LoadImageFromFile', backend_args=None,
+ imdecode_backend='pillow'),
+ dict(
+ type='FixScaleResize',
+ scale=(800, 1333),
+ keep_ratio=True,
+ backend='pillow'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'text', 'custom_entities',
+ 'tokens_positive', 'phrase_ids', 'phrases'))
+]
+
+dataset_Flickr30k_val = dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='final_flickr_separateGT_val.json',
+ data_prefix=dict(img='flickr30k_images/'),
+ pipeline=test_pipeline,
+)
+
+dataset_Flickr30k_test = dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='final_flickr_separateGT_test.json',
+ data_prefix=dict(img='flickr30k_images/'),
+ pipeline=test_pipeline,
+)
+
+val_evaluator_Flickr30k = dict(type='Flickr30kMetric')
+
+test_evaluator_Flickr30k = dict(type='Flickr30kMetric')
+
+# ----------Config---------- #
+dataset_prefixes = ['Flickr30kVal', 'Flickr30kTest']
+datasets = [dataset_Flickr30k_val, dataset_Flickr30k_test]
+metrics = [val_evaluator_Flickr30k, test_evaluator_Flickr30k]
+
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/mm_grounding_dino/grounding_dino_swin-l_pretrain_all.py b/configs/mm_grounding_dino/grounding_dino_swin-l_pretrain_all.py
new file mode 100644
index 00000000000..46241e2e03b
--- /dev/null
+++ b/configs/mm_grounding_dino/grounding_dino_swin-l_pretrain_all.py
@@ -0,0 +1,542 @@
+_base_ = 'grounding_dino_swin-t_pretrain_obj365.py'
+
+pretrained = 'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window12_384_22k.pth' # noqa
+num_levels = 5
+model = dict(
+ num_feature_levels=num_levels,
+ backbone=dict(
+ _delete_=True,
+ type='SwinTransformer',
+ pretrain_img_size=384,
+ embed_dims=192,
+ depths=[2, 2, 18, 2],
+ num_heads=[6, 12, 24, 48],
+ window_size=12,
+ mlp_ratio=4,
+ qkv_bias=True,
+ qk_scale=None,
+ drop_rate=0.,
+ attn_drop_rate=0.,
+ drop_path_rate=0.2,
+ patch_norm=True,
+ out_indices=(0, 1, 2, 3),
+ # Please only add indices that would be used
+ # in FPN, otherwise some parameter will not be used
+ with_cp=True,
+ convert_weights=True,
+ frozen_stages=-1,
+ init_cfg=dict(type='Pretrained', checkpoint=pretrained)),
+ neck=dict(in_channels=[192, 384, 768, 1536], num_outs=num_levels),
+ encoder=dict(layer_cfg=dict(self_attn_cfg=dict(num_levels=num_levels))),
+ decoder=dict(layer_cfg=dict(cross_attn_cfg=dict(num_levels=num_levels))))
+
+# --------------------------- object365v2 od dataset---------------------------
+# objv2_backend_args = dict(
+# backend='petrel',
+# path_mapping=dict({
+# './data/objects365v2/': 'yudong:s3://wangyudong/obj365_v2/',
+# 'data/objects365v2/': 'yudong:s3://wangyudong/obj365_v2/'
+# }))
+objv2_backend_args = None
+
+objv2_train_pipeline = [
+ dict(type='LoadImageFromFile', backend_args=objv2_backend_args),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=85,
+ # change this
+ label_map_file='data/objects365v2/annotations/o365v2_label_map.json',
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+
+o365v2_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/objects365v2/',
+ ann_file='annotations/zhiyuan_objv2_train_od.json',
+ label_map_file='annotations/o365v2_label_map.json',
+ data_prefix=dict(img='train/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=objv2_train_pipeline,
+ return_classes=True,
+ need_text=False,
+ backend_args=None,
+)
+
+# --------------------------- openimagev6 od dataset---------------------------
+# oi_backend_args = dict(
+# backend='petrel',
+# path_mapping=dict({
+# './data/': 's3://openmmlab/datasets/detection/',
+# 'data/': 's3://openmmlab/datasets/detection/'
+# }))
+oi_backend_args = None
+
+oi_train_pipeline = [
+ dict(type='LoadImageFromFile', backend_args=oi_backend_args),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=85,
+ # change this
+ label_map_file='data/OpenImages/annotations/openimages_label_map.json',
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+
+oiv6_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/OpenImages/',
+ ann_file='annotations/oidv6-train-annotations_od.json',
+ label_map_file='annotations/openimages_label_map.json',
+ data_prefix=dict(img='OpenImages/train/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ need_text=False,
+ pipeline=oi_train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+# --------------------------- v3det od dataset---------------------------
+v3d_train_pipeline = [
+ dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=85,
+ # change this
+ label_map_file='data/V3Det/annotations/v3det_2023_v1_label_map.json',
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+v3det_dataset = dict(
+ type='RepeatDataset',
+ times=2,
+ dataset=dict(
+ type='ODVGDataset',
+ data_root='data/V3Det/',
+ ann_file='annotations/v3det_2023_v1_train_od.json',
+ label_map_file='annotations/v3det_2023_v1_label_map.json',
+ data_prefix=dict(img=''),
+ filter_cfg=dict(filter_empty_gt=False),
+ need_text=False,
+ pipeline=v3d_train_pipeline,
+ return_classes=True,
+ backend_args=None))
+
+# --------------------------- lvis od dataset---------------------------
+lvis_train_pipeline = [
+ dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=85,
+ # change this
+ label_map_file='data/coco/annotations/lvis_v1_label_map.json',
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+lvis_dataset = dict(
+ type='ClassBalancedDataset',
+ oversample_thr=1e-3,
+ dataset=dict(
+ type='ODVGDataset',
+ data_root='data/coco/',
+ ann_file='annotations/lvis_v1_train_od.json',
+ label_map_file='annotations/lvis_v1_label_map.json',
+ data_prefix=dict(img=''),
+ filter_cfg=dict(filter_empty_gt=False),
+ need_text=False, # change this
+ pipeline=lvis_train_pipeline,
+ return_classes=True,
+ backend_args=None))
+
+# --------------------------- coco2017 od dataset---------------------------
+coco2017_train_dataset = dict(
+ type='RepeatDataset',
+ times=2,
+ dataset=dict(
+ type='ODVGDataset',
+ data_root='data/coco/',
+ ann_file='annotations/instance_train2017_norefval_od.json',
+ label_map_file='annotations/coco2017_label_map.json',
+ data_prefix=dict(img='train2017'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None))
+
+# --------------------------- flickr30k vg dataset---------------------------
+flickr30k_dataset = dict(
+ type='RepeatDataset',
+ times=2,
+ dataset=dict(
+ type='ODVGDataset',
+ data_root='data/flickr30k_entities/',
+ ann_file='final_flickr_separateGT_train_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='flickr30k_images/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None))
+
+# --------------------------- gqa vg dataset---------------------------
+gqa_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/gqa/',
+ ann_file='final_mixed_train_no_coco_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='images/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+# --------------------------- coco2014 vg dataset---------------------------
+coco2014_vg_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/coco/',
+ ann_file='mdetr_annotations/final_mixed_train_only_coco_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='train2014/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+# --------------------------- refcoco vg dataset---------------------------
+refcoco_dataset = dict(
+ type='RepeatDataset',
+ times=2,
+ dataset=dict(
+ type='ODVGDataset',
+ data_root='data/coco/',
+ ann_file='mdetr_annotations/finetune_refcoco_train_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='train2014'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None))
+
+# --------------------------- refcoco+ vg dataset---------------------------
+refcoco_plus_dataset = dict(
+ type='RepeatDataset',
+ times=2,
+ dataset=dict(
+ type='ODVGDataset',
+ data_root='data/coco/',
+ ann_file='mdetr_annotations/finetune_refcoco+_train_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='train2014'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None))
+
+# --------------------------- refcocog vg dataset---------------------------
+refcocog_dataset = dict(
+ type='RepeatDataset',
+ times=3,
+ dataset=dict(
+ type='ODVGDataset',
+ data_root='data/coco/',
+ ann_file='mdetr_annotations/finetune_refcocog_train_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='train2014'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None))
+
+# --------------------------- grefcoco vg dataset---------------------------
+grefcoco_dataset = dict(
+ type='RepeatDataset',
+ times=2,
+ dataset=dict(
+ type='ODVGDataset',
+ data_root='data/coco/',
+ ann_file='mdetr_annotations/finetune_grefcoco_train_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='train2014'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None))
+
+# --------------------------- grit vg dataset---------------------------
+# grit_backend_args = dict(
+# backend='petrel',
+# path_mapping=dict({
+# './data/grit/': 'yichen:s3://chenyicheng/grit/',
+# 'data/grit/': 'yichen:s3://chenyicheng/grit/'
+# }))
+grit_backend_args = None
+
+grit_train_pipeline = [
+ dict(type='LoadImageFromFile', backend_args=grit_backend_args),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=85,
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+
+grit_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/grit/',
+ ann_file='grit20m_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img=''),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=grit_train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+# --------------------------- dataloader---------------------------
+train_dataloader = dict(
+ batch_size=4,
+ num_workers=4,
+ sampler=dict(
+ _delete_=True,
+ type='CustomSampleSizeSampler',
+ ratio_mode=True,
+ # OD ~ 1.74+1.67*0.5+0.18*2+0.12*2+0.1=3.2
+ # vg ~ 0.15*2+0.62*1+0.49*1+0.12*2+0.12*2+0.08*3+0.19*2+9*0.09=3.3
+ dataset_size=[-1, 0.5, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0.09]),
+ dataset=dict(datasets=[
+ o365v2_dataset, # 1.74M
+ oiv6_dataset, # 1.67M
+ v3det_dataset, # 0.18M
+ coco2017_train_dataset, # 0.12M
+ lvis_dataset, # 0.1M
+ flickr30k_dataset, # 0.15M
+ gqa_dataset, # 0.62M
+ coco2014_vg_dataset, # 0.49M
+ refcoco_dataset, # 0.12M
+ refcoco_plus_dataset, # 0.12M
+ refcocog_dataset, # 0.08M
+ grefcoco_dataset, # 0.19M
+ grit_dataset # 9M
+ ]))
+
+# bs=256
+optim_wrapper = dict(optimizer=dict(lr=0.0008))
+
+# one epoch = (3.2+3.3)M/256 = 25390 iter
+# 24e=609360 iter
+# 16e=406240 iter
+# 20e=507800 iter
+max_iter = 609360
+train_cfg = dict(
+ _delete_=True,
+ type='IterBasedTrainLoop',
+ max_iters=max_iter,
+ val_interval=13000)
+
+param_scheduler = [
+ dict(type='LinearLR', start_factor=0.1, by_epoch=False, begin=0, end=1000),
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_iter,
+ by_epoch=False,
+ milestones=[406240, 507800],
+ gamma=0.1)
+]
+
+default_hooks = dict(
+ checkpoint=dict(by_epoch=False, interval=13000, max_keep_ckpts=30))
+log_processor = dict(by_epoch=False)
diff --git a/configs/mm_grounding_dino/grounding_dino_swin-t_finetune_8xb4_20e_cat.py b/configs/mm_grounding_dino/grounding_dino_swin-t_finetune_8xb4_20e_cat.py
new file mode 100644
index 00000000000..bf3b35894eb
--- /dev/null
+++ b/configs/mm_grounding_dino/grounding_dino_swin-t_finetune_8xb4_20e_cat.py
@@ -0,0 +1,102 @@
+_base_ = 'grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/cat/'
+class_name = ('cat', )
+num_classes = len(class_name)
+metainfo = dict(classes=class_name, palette=[(220, 20, 60)])
+
+model = dict(bbox_head=dict(num_classes=num_classes))
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities'))
+]
+
+train_dataloader = dict(
+ dataset=dict(
+ _delete_=True,
+ type='CocoDataset',
+ data_root=data_root,
+ metainfo=metainfo,
+ return_classes=True,
+ pipeline=train_pipeline,
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ ann_file='annotations/trainval.json',
+ data_prefix=dict(img='images/')))
+
+val_dataloader = dict(
+ dataset=dict(
+ metainfo=metainfo,
+ data_root=data_root,
+ ann_file='annotations/test.json',
+ data_prefix=dict(img='images/')))
+
+test_dataloader = val_dataloader
+
+val_evaluator = dict(ann_file=data_root + 'annotations/test.json')
+test_evaluator = val_evaluator
+
+max_epoch = 20
+
+default_hooks = dict(
+ checkpoint=dict(interval=1, max_keep_ckpts=1, save_best='auto'),
+ logger=dict(type='LoggerHook', interval=5))
+train_cfg = dict(max_epochs=max_epoch, val_interval=1)
+
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epoch,
+ by_epoch=True,
+ milestones=[15],
+ gamma=0.1)
+]
+
+optim_wrapper = dict(
+ optimizer=dict(lr=0.0001),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.0),
+ 'language_model': dict(lr_mult=0.0)
+ }))
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py
new file mode 100644
index 00000000000..66060f45ea7
--- /dev/null
+++ b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py
@@ -0,0 +1,247 @@
+_base_ = [
+ '../_base_/datasets/coco_detection.py',
+ '../_base_/schedules/schedule_1x.py', '../_base_/default_runtime.py'
+]
+pretrained = 'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth' # noqa
+lang_model_name = 'bert-base-uncased'
+
+model = dict(
+ type='GroundingDINO',
+ num_queries=900,
+ with_box_refine=True,
+ as_two_stage=True,
+ data_preprocessor=dict(
+ type='DetDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ bgr_to_rgb=True,
+ pad_mask=False,
+ ),
+ language_model=dict(
+ type='BertModel',
+ name=lang_model_name,
+ max_tokens=256,
+ pad_to_max=False,
+ use_sub_sentence_represent=True,
+ special_tokens_list=['[CLS]', '[SEP]', '.', '?'],
+ add_pooling_layer=False,
+ ),
+ backbone=dict(
+ type='SwinTransformer',
+ embed_dims=96,
+ depths=[2, 2, 6, 2],
+ num_heads=[3, 6, 12, 24],
+ window_size=7,
+ mlp_ratio=4,
+ qkv_bias=True,
+ qk_scale=None,
+ drop_rate=0.,
+ attn_drop_rate=0.,
+ drop_path_rate=0.2,
+ patch_norm=True,
+ out_indices=(1, 2, 3),
+ with_cp=True,
+ convert_weights=True,
+ frozen_stages=-1,
+ init_cfg=dict(type='Pretrained', checkpoint=pretrained)),
+ neck=dict(
+ type='ChannelMapper',
+ in_channels=[192, 384, 768],
+ kernel_size=1,
+ out_channels=256,
+ act_cfg=None,
+ bias=True,
+ norm_cfg=dict(type='GN', num_groups=32),
+ num_outs=4),
+ encoder=dict(
+ num_layers=6,
+ num_cp=6,
+ # visual layer config
+ layer_cfg=dict(
+ self_attn_cfg=dict(embed_dims=256, num_levels=4, dropout=0.0),
+ ffn_cfg=dict(
+ embed_dims=256, feedforward_channels=2048, ffn_drop=0.0)),
+ # text layer config
+ text_layer_cfg=dict(
+ self_attn_cfg=dict(num_heads=4, embed_dims=256, dropout=0.0),
+ ffn_cfg=dict(
+ embed_dims=256, feedforward_channels=1024, ffn_drop=0.0)),
+ # fusion layer config
+ fusion_layer_cfg=dict(
+ v_dim=256,
+ l_dim=256,
+ embed_dim=1024,
+ num_heads=4,
+ init_values=1e-4),
+ ),
+ decoder=dict(
+ num_layers=6,
+ return_intermediate=True,
+ layer_cfg=dict(
+ # query self attention layer
+ self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+ # cross attention layer query to text
+ cross_attn_text_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+ # cross attention layer query to image
+ cross_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+ ffn_cfg=dict(
+ embed_dims=256, feedforward_channels=2048, ffn_drop=0.0)),
+ post_norm_cfg=None),
+ positional_encoding=dict(
+ num_feats=128, normalize=True, offset=0.0, temperature=20),
+ bbox_head=dict(
+ type='GroundingDINOHead',
+ num_classes=256,
+ sync_cls_avg_factor=True,
+ contrastive_cfg=dict(max_text_len=256, log_scale='auto', bias=True),
+ loss_cls=dict(
+ type='FocalLoss',
+ use_sigmoid=True,
+ gamma=2.0,
+ alpha=0.25,
+ loss_weight=1.0), # 2.0 in DeformDETR
+ loss_bbox=dict(type='L1Loss', loss_weight=5.0)),
+ dn_cfg=dict( # TODO: Move to model.train_cfg ?
+ label_noise_scale=0.5,
+ box_noise_scale=1.0, # 0.4 for DN-DETR
+ group_cfg=dict(dynamic=True, num_groups=None,
+ num_dn_queries=100)), # TODO: half num_dn_queries
+ # training and testing settings
+ train_cfg=dict(
+ assigner=dict(
+ type='HungarianAssigner',
+ match_costs=[
+ dict(type='BinaryFocalLossCost', weight=2.0),
+ dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'),
+ dict(type='IoUCost', iou_mode='giou', weight=2.0)
+ ])),
+ test_cfg=dict(max_per_img=300))
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=lang_model_name,
+ num_sample_negative=85,
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+
+test_pipeline = [
+ dict(
+ type='LoadImageFromFile', backend_args=None,
+ imdecode_backend='pillow'),
+ dict(
+ type='FixScaleResize',
+ scale=(800, 1333),
+ keep_ratio=True,
+ backend='pillow'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'text', 'custom_entities',
+ 'tokens_positive'))
+]
+
+dataset_type = 'ODVGDataset'
+data_root = 'data/objects365v1/'
+
+coco_od_dataset = dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='o365v1_train_odvg.json',
+ label_map_file='o365v1_label_map.json',
+ data_prefix=dict(img='train/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+train_dataloader = dict(
+ _delete_=True,
+ batch_size=4,
+ num_workers=4,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ batch_sampler=dict(type='AspectRatioBatchSampler'),
+ dataset=dict(type='ConcatDataset', datasets=[coco_od_dataset]))
+
+val_dataloader = dict(
+ dataset=dict(pipeline=test_pipeline, return_classes=True))
+test_dataloader = val_dataloader
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.0004,
+ weight_decay=0.0001), # bs=16 0.0001
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1),
+ 'language_model': dict(lr_mult=0.1),
+ }))
+
+# learning policy
+max_epochs = 30
+param_scheduler = [
+ dict(type='LinearLR', start_factor=0.1, by_epoch=False, begin=0, end=1000),
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[19, 26],
+ gamma=0.1)
+]
+
+train_cfg = dict(
+ type='EpochBasedTrainLoop', max_epochs=max_epochs, val_interval=1)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# USER SHOULD NOT CHANGE ITS VALUES.
+# base_batch_size = (16 GPUs) x (2 samples per GPU)
+auto_scale_lr = dict(base_batch_size=64)
+
+default_hooks = dict(visualization=dict(type='GroundingVisualizationHook'))
diff --git a/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg.py b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg.py
new file mode 100644
index 00000000000..b7f388bdd4e
--- /dev/null
+++ b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg.py
@@ -0,0 +1,38 @@
+_base_ = 'grounding_dino_swin-t_pretrain_obj365.py'
+
+o365v1_od_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/objects365v1/',
+ ann_file='o365v1_train_odvg.json',
+ label_map_file='o365v1_label_map.json',
+ data_prefix=dict(img='train/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None,
+)
+
+flickr30k_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/flickr30k_entities/',
+ ann_file='final_flickr_separateGT_train_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='flickr30k_images/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+gqa_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/gqa/',
+ ann_file='final_mixed_train_no_coco_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='images/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+train_dataloader = dict(
+ dataset=dict(datasets=[o365v1_od_dataset, flickr30k_dataset, gqa_dataset]))
diff --git a/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m.py b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m.py
new file mode 100644
index 00000000000..8e9f5ca4aab
--- /dev/null
+++ b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m.py
@@ -0,0 +1,55 @@
+_base_ = 'grounding_dino_swin-t_pretrain_obj365.py'
+
+o365v1_od_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/objects365v1/',
+ ann_file='o365v1_train_odvg.json',
+ label_map_file='o365v1_label_map.json',
+ data_prefix=dict(img='train/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None,
+)
+
+flickr30k_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/flickr30k_entities/',
+ ann_file='final_flickr_separateGT_train_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='flickr30k_images/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+gqa_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/gqa/',
+ ann_file='final_mixed_train_no_coco_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='images/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+grit_dataset = dict(
+ type='ODVGDataset',
+ data_root='grit_processed/',
+ ann_file='grit20m_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img=''),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+train_dataloader = dict(
+ sampler=dict(
+ _delete_=True,
+ type='CustomSampleSizeSampler',
+ dataset_size=[-1, -1, -1, 500000]),
+ dataset=dict(datasets=[
+ o365v1_od_dataset, flickr30k_dataset, gqa_dataset, grit_dataset
+ ]))
diff --git a/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det.py b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det.py
new file mode 100644
index 00000000000..56e500c8693
--- /dev/null
+++ b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det.py
@@ -0,0 +1,117 @@
+_base_ = 'grounding_dino_swin-t_pretrain_obj365.py'
+
+o365v1_od_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/objects365v1/',
+ ann_file='o365v1_train_odvg.json',
+ label_map_file='o365v1_label_map.json',
+ data_prefix=dict(img='train/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None,
+)
+
+flickr30k_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/flickr30k_entities/',
+ ann_file='final_flickr_separateGT_train_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='flickr30k_images/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+gqa_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/gqa/',
+ ann_file='final_mixed_train_no_coco_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='images/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+v3d_train_pipeline = [
+ dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=85,
+ # change this
+ label_map_file='data/V3Det/annotations/v3det_2023_v1_label_map.json',
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+v3det_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/V3Det/',
+ ann_file='annotations/v3det_2023_v1_train_od.json',
+ label_map_file='annotations/v3det_2023_v1_label_map.json',
+ data_prefix=dict(img=''),
+ filter_cfg=dict(filter_empty_gt=False),
+ need_text=False, # change this
+ pipeline=v3d_train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+grit_dataset = dict(
+ type='ODVGDataset',
+ data_root='grit_processed/',
+ ann_file='grit20m_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img=''),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+train_dataloader = dict(
+ sampler=dict(
+ _delete_=True,
+ type='CustomSampleSizeSampler',
+ dataset_size=[-1, -1, -1, -1, 500000]),
+ dataset=dict(datasets=[
+ o365v1_od_dataset, flickr30k_dataset, gqa_dataset, v3det_dataset,
+ grit_dataset
+ ]))
diff --git a/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_v3det.py b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_v3det.py
new file mode 100644
index 00000000000..c89014fbbe4
--- /dev/null
+++ b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_v3det.py
@@ -0,0 +1,101 @@
+_base_ = 'grounding_dino_swin-t_pretrain_obj365.py'
+
+o365v1_od_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/objects365v1/',
+ ann_file='o365v1_train_odvg.json',
+ label_map_file='o365v1_label_map.json',
+ data_prefix=dict(img='train/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None,
+)
+
+flickr30k_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/flickr30k_entities/',
+ ann_file='final_flickr_separateGT_train_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='flickr30k_images/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+gqa_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/gqa/',
+ ann_file='final_mixed_train_no_coco_vg.json',
+ label_map_file=None,
+ data_prefix=dict(img='images/'),
+ filter_cfg=dict(filter_empty_gt=False),
+ pipeline=_base_.train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+v3d_train_pipeline = [
+ dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=85,
+ # change this
+ label_map_file='data/V3Det/annotations/v3det_2023_v1_label_map.json',
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+v3det_dataset = dict(
+ type='ODVGDataset',
+ data_root='data/V3Det/',
+ ann_file='annotations/v3det_2023_v1_train_od.json',
+ label_map_file='annotations/v3det_2023_v1_label_map.json',
+ data_prefix=dict(img=''),
+ filter_cfg=dict(filter_empty_gt=False),
+ need_text=False, # change this
+ pipeline=v3d_train_pipeline,
+ return_classes=True,
+ backend_args=None)
+
+train_dataloader = dict(
+ dataset=dict(datasets=[
+ o365v1_od_dataset, flickr30k_dataset, gqa_dataset, v3det_dataset
+ ]))
diff --git a/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_pseudo-labeling_cat.py b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_pseudo-labeling_cat.py
new file mode 100644
index 00000000000..6dc8dcd8df4
--- /dev/null
+++ b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_pseudo-labeling_cat.py
@@ -0,0 +1,43 @@
+_base_ = 'grounding_dino_swin-t_pretrain_obj365.py'
+
+test_pipeline = [
+ dict(
+ type='LoadImageFromFile', backend_args=None,
+ imdecode_backend='pillow'),
+ dict(
+ type='FixScaleResize',
+ scale=(800, 1333),
+ keep_ratio=True,
+ backend='pillow'),
+ dict(type='LoadTextAnnotations'),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'text', 'custom_entities',
+ 'tokens_positive'))
+]
+
+data_root = 'data/cat/'
+
+val_dataloader = dict(
+ batch_size=1,
+ num_workers=2,
+ persistent_workers=False,
+ dataset=dict(
+ type='ODVGDataset',
+ data_root=data_root,
+ label_map_file='cat_label_map.json',
+ ann_file='cat_train_od.json',
+ data_prefix=dict(img='images/'),
+ pipeline=test_pipeline,
+ return_classes=True))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ outfile_path=data_root + 'cat_train_od_v1.json',
+ img_prefix=data_root + 'images/',
+ score_thr=0.7,
+ nms_thr=0.5,
+ type='DumpODVGResults')
+test_evaluator = val_evaluator
diff --git a/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_pseudo-labeling_flickr30k.py b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_pseudo-labeling_flickr30k.py
new file mode 100644
index 00000000000..78bf1c344bf
--- /dev/null
+++ b/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_pseudo-labeling_flickr30k.py
@@ -0,0 +1,42 @@
+_base_ = 'grounding_dino_swin-t_pretrain_obj365.py'
+
+test_pipeline = [
+ dict(
+ type='LoadImageFromFile', backend_args=None,
+ imdecode_backend='pillow'),
+ dict(
+ type='FixScaleResize',
+ scale=(800, 1333),
+ keep_ratio=True,
+ backend='pillow'),
+ dict(type='LoadTextAnnotations'),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'text', 'custom_entities',
+ 'tokens_positive'))
+]
+
+data_root = 'data/flickr30k_entities/'
+
+val_dataloader = dict(
+ batch_size=1,
+ num_workers=2,
+ persistent_workers=False,
+ dataset=dict(
+ type='ODVGDataset',
+ data_root=data_root,
+ ann_file='flickr_simple_train_vg.json',
+ data_prefix=dict(img='flickr30k_images/'),
+ pipeline=test_pipeline,
+ return_classes=True))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ outfile_path=data_root + 'flickr_simple_train_vg_v1.json',
+ img_prefix=data_root + 'flickr30k_images/',
+ score_thr=0.4,
+ nms_thr=0.5,
+ type='DumpODVGResults')
+test_evaluator = val_evaluator
diff --git a/configs/mm_grounding_dino/lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis.py b/configs/mm_grounding_dino/lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis.py
new file mode 100644
index 00000000000..3ba12c90675
--- /dev/null
+++ b/configs/mm_grounding_dino/lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis.py
@@ -0,0 +1,120 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/coco/'
+
+model = dict(test_cfg=dict(
+ max_per_img=300,
+ chunked_size=40,
+))
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=85,
+ # change this
+ label_map_file='data/coco/annotations/lvis_v1_label_map.json',
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+
+train_dataloader = dict(
+ dataset=dict(
+ _delete_=True,
+ type='ClassBalancedDataset',
+ oversample_thr=1e-3,
+ dataset=dict(
+ type='ODVGDataset',
+ data_root=data_root,
+ need_text=False,
+ label_map_file='annotations/lvis_v1_label_map.json',
+ ann_file='annotations/lvis_v1_train_od.json',
+ data_prefix=dict(img=''),
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ return_classes=True,
+ pipeline=train_pipeline)))
+
+val_dataloader = dict(
+ dataset=dict(
+ data_root=data_root,
+ type='LVISV1Dataset',
+ ann_file='annotations/lvis_v1_minival_inserted_image_name.json',
+ data_prefix=dict(img='')))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='LVISFixedAPMetric',
+ ann_file=data_root +
+ 'annotations/lvis_v1_minival_inserted_image_name.json')
+test_evaluator = val_evaluator
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1),
+ # 'language_model': dict(lr_mult=0),
+ }))
+
+# learning policy
+max_epochs = 12
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[11],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=3)
+
+default_hooks = dict(
+ checkpoint=dict(
+ max_keep_ckpts=1, save_best='lvis_fixed_ap/AP', rule='greater'))
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis_866_337.py b/configs/mm_grounding_dino/lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis_866_337.py
new file mode 100644
index 00000000000..28d0141d3e2
--- /dev/null
+++ b/configs/mm_grounding_dino/lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis_866_337.py
@@ -0,0 +1,120 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/coco/'
+
+model = dict(test_cfg=dict(
+ max_per_img=300,
+ chunked_size=40,
+))
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=85,
+ # change this
+ label_map_file='data/coco/annotations/lvis_v1_label_map_norare.json',
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+
+train_dataloader = dict(
+ dataset=dict(
+ _delete_=True,
+ type='ClassBalancedDataset',
+ oversample_thr=1e-3,
+ dataset=dict(
+ type='ODVGDataset',
+ data_root=data_root,
+ need_text=False,
+ label_map_file='annotations/lvis_v1_label_map_norare.json',
+ ann_file='annotations/lvis_v1_train_od_norare.json',
+ data_prefix=dict(img=''),
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ return_classes=True,
+ pipeline=train_pipeline)))
+
+val_dataloader = dict(
+ dataset=dict(
+ data_root=data_root,
+ type='LVISV1Dataset',
+ ann_file='annotations/lvis_v1_minival_inserted_image_name.json',
+ data_prefix=dict(img='')))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='LVISFixedAPMetric',
+ ann_file=data_root +
+ 'annotations/lvis_v1_minival_inserted_image_name.json')
+test_evaluator = val_evaluator
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.00005, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1),
+ # 'language_model': dict(lr_mult=0),
+ }))
+
+# learning policy
+max_epochs = 12
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[8, 11],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=3)
+
+default_hooks = dict(
+ checkpoint=dict(
+ max_keep_ckpts=3, save_best='lvis_fixed_ap/AP', rule='greater'))
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/lvis/grounding_dino_swin-t_pretrain_zeroshot_lvis.py b/configs/mm_grounding_dino/lvis/grounding_dino_swin-t_pretrain_zeroshot_lvis.py
new file mode 100644
index 00000000000..fb4ed438e0b
--- /dev/null
+++ b/configs/mm_grounding_dino/lvis/grounding_dino_swin-t_pretrain_zeroshot_lvis.py
@@ -0,0 +1,24 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+model = dict(test_cfg=dict(
+ max_per_img=300,
+ chunked_size=40,
+))
+
+dataset_type = 'LVISV1Dataset'
+data_root = 'data/coco/'
+
+val_dataloader = dict(
+ dataset=dict(
+ data_root=data_root,
+ type=dataset_type,
+ ann_file='annotations/lvis_od_val.json',
+ data_prefix=dict(img='')))
+test_dataloader = val_dataloader
+
+# numpy < 1.24.0
+val_evaluator = dict(
+ _delete_=True,
+ type='LVISFixedAPMetric',
+ ann_file=data_root + 'annotations/lvis_od_val.json')
+test_evaluator = val_evaluator
diff --git a/configs/mm_grounding_dino/lvis/grounding_dino_swin-t_pretrain_zeroshot_mini-lvis.py b/configs/mm_grounding_dino/lvis/grounding_dino_swin-t_pretrain_zeroshot_mini-lvis.py
new file mode 100644
index 00000000000..406a39a4264
--- /dev/null
+++ b/configs/mm_grounding_dino/lvis/grounding_dino_swin-t_pretrain_zeroshot_mini-lvis.py
@@ -0,0 +1,25 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+model = dict(test_cfg=dict(
+ max_per_img=300,
+ chunked_size=40,
+))
+
+dataset_type = 'LVISV1Dataset'
+data_root = 'data/coco/'
+
+val_dataloader = dict(
+ dataset=dict(
+ data_root=data_root,
+ type=dataset_type,
+ ann_file='annotations/lvis_v1_minival_inserted_image_name.json',
+ data_prefix=dict(img='')))
+test_dataloader = val_dataloader
+
+# numpy < 1.24.0
+val_evaluator = dict(
+ _delete_=True,
+ type='LVISFixedAPMetric',
+ ann_file=data_root +
+ 'annotations/lvis_v1_minival_inserted_image_name.json')
+test_evaluator = val_evaluator
diff --git a/configs/mm_grounding_dino/metafile.yml b/configs/mm_grounding_dino/metafile.yml
new file mode 100644
index 00000000000..3071686e7ac
--- /dev/null
+++ b/configs/mm_grounding_dino/metafile.yml
@@ -0,0 +1,54 @@
+Collections:
+ - Name: MM Grounding DINO
+ Metadata:
+ Training Data: Objects365, GoldG, GRIT and V3Det
+ Training Techniques:
+ - AdamW
+ - Multi Scale Train
+ - Gradient Clip
+ Training Resources: 3090 GPUs
+ Architecture:
+ - Swin Transformer
+ - BERT
+ README: configs/mm_grounding_dino/README.md
+ Code:
+ URL:
+ Version: v3.0.0
+
+Models:
+ - Name: grounding_dino_swin-t_pretrain_obj365_goldg
+ In Collection: MM Grounding DINO
+ Config: configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg.py
+ Results:
+ - Task: Object Detection
+ Dataset: COCO
+ Metrics:
+ box AP: 50.4
+ Weights: https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg/grounding_dino_swin-t_pretrain_obj365_goldg_20231122_132602-4ea751ce.pth
+ - Name: grounding_dino_swin-t_pretrain_obj365_goldg_grit9m
+ In Collection: MM Grounding DINO
+ Config: configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m.py
+ Results:
+ - Task: Object Detection
+ Dataset: COCO
+ Metrics:
+ box AP: 50.5
+ Weights: https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_20231128_200818-169cc352.pth
+ - Name: grounding_dino_swin-t_pretrain_obj365_goldg_v3det
+ In Collection: MM Grounding DINO
+ Config: configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_v3det.py
+ Results:
+ - Task: Object Detection
+ Dataset: COCO
+ Metrics:
+ box AP: 50.6
+ Weights: https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_v3det_20231218_095741-e316e297.pth
+ - Name: grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det
+ In Collection: MM Grounding DINO
+ Config: configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det.py
+ Results:
+ - Task: Object Detection
+ Dataset: COCO
+ Metrics:
+ box AP: 50.4
+ Weights: https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
diff --git a/configs/mm_grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw13.py b/configs/mm_grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw13.py
new file mode 100644
index 00000000000..d87ca7ca1ea
--- /dev/null
+++ b/configs/mm_grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw13.py
@@ -0,0 +1,338 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py' # noqa
+
+dataset_type = 'CocoDataset'
+data_root = 'data/odinw/'
+
+base_test_pipeline = _base_.test_pipeline
+base_test_pipeline[-1]['meta_keys'] = ('img_id', 'img_path', 'ori_shape',
+ 'img_shape', 'scale_factor', 'text',
+ 'custom_entities', 'caption_prompt')
+
+# ---------------------1 AerialMaritimeDrone---------------------#
+class_name = ('boat', 'car', 'dock', 'jetski', 'lift')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AerialMaritimeDrone/large/'
+dataset_AerialMaritimeDrone = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ test_mode=True,
+ pipeline=base_test_pipeline,
+ return_classes=True)
+val_evaluator_AerialMaritimeDrone = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------2 Aquarium---------------------#
+class_name = ('fish', 'jellyfish', 'penguin', 'puffin', 'shark', 'starfish',
+ 'stingray')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Aquarium/Aquarium Combined.v2-raw-1024.coco/'
+
+caption_prompt = None
+# caption_prompt = {
+# 'penguin': {
+# 'suffix': ', which is black and white'
+# },
+# 'puffin': {
+# 'suffix': ' with orange beaks'
+# },
+# 'stingray': {
+# 'suffix': ' which is flat and round'
+# },
+# }
+dataset_Aquarium = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Aquarium = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------3 CottontailRabbits---------------------#
+class_name = ('Cottontail-Rabbit', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'CottontailRabbits/'
+
+# caption_prompt = None
+caption_prompt = {'Cottontail-Rabbit': {'name': 'rabbit'}}
+
+dataset_CottontailRabbits = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_CottontailRabbits = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------4 EgoHands---------------------#
+class_name = ('hand', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'EgoHands/generic/'
+
+# caption_prompt = None
+caption_prompt = {'hand': {'suffix': ' of a person'}}
+
+dataset_EgoHands = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_EgoHands = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------5 NorthAmericaMushrooms---------------------#
+class_name = ('CoW', 'chanterelle')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'NorthAmericaMushrooms/North American Mushrooms.v1-416x416.coco/' # noqa
+
+# caption_prompt = None
+caption_prompt = {
+ 'CoW': {
+ 'name': 'flat mushroom'
+ },
+ 'chanterelle': {
+ 'name': 'yellow mushroom'
+ }
+}
+
+dataset_NorthAmericaMushrooms = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_NorthAmericaMushrooms = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------6 Packages---------------------#
+class_name = ('package', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Packages/Raw/'
+
+# caption_prompt = None
+caption_prompt = {
+ 'package': {
+ 'prefix': 'there is a ',
+ 'suffix': ' on the porch'
+ }
+}
+
+dataset_Packages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Packages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------7 PascalVOC---------------------#
+class_name = ('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car',
+ 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse',
+ 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train',
+ 'tvmonitor')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'PascalVOC/'
+dataset_PascalVOC = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_PascalVOC = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------8 pistols---------------------#
+class_name = ('pistol', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pistols/export/'
+dataset_pistols = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pistols = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------9 pothole---------------------#
+class_name = ('pothole', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pothole/'
+
+# caption_prompt = None
+caption_prompt = {
+ 'pothole': {
+ 'prefix': 'there are some ',
+ 'name': 'holes',
+ 'suffix': ' on the road'
+ }
+}
+
+dataset_pothole = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pothole = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------10 Raccoon---------------------#
+class_name = ('raccoon', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Raccoon/Raccoon.v2-raw.coco/'
+dataset_Raccoon = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Raccoon = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------11 ShellfishOpenImages---------------------#
+class_name = ('Crab', 'Lobster', 'Shrimp')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ShellfishOpenImages/raw/'
+dataset_ShellfishOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ShellfishOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------12 thermalDogsAndPeople---------------------#
+class_name = ('dog', 'person')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'thermalDogsAndPeople/'
+dataset_thermalDogsAndPeople = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_thermalDogsAndPeople = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------13 VehiclesOpenImages---------------------#
+class_name = ('Ambulance', 'Bus', 'Car', 'Motorcycle', 'Truck')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'VehiclesOpenImages/416x416/'
+dataset_VehiclesOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_VehiclesOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# --------------------- Config---------------------#
+dataset_prefixes = [
+ 'AerialMaritimeDrone', 'Aquarium', 'CottontailRabbits', 'EgoHands',
+ 'NorthAmericaMushrooms', 'Packages', 'PascalVOC', 'pistols', 'pothole',
+ 'Raccoon', 'ShellfishOpenImages', 'thermalDogsAndPeople',
+ 'VehiclesOpenImages'
+]
+datasets = [
+ dataset_AerialMaritimeDrone, dataset_Aquarium, dataset_CottontailRabbits,
+ dataset_EgoHands, dataset_NorthAmericaMushrooms, dataset_Packages,
+ dataset_PascalVOC, dataset_pistols, dataset_pothole, dataset_Raccoon,
+ dataset_ShellfishOpenImages, dataset_thermalDogsAndPeople,
+ dataset_VehiclesOpenImages
+]
+metrics = [
+ val_evaluator_AerialMaritimeDrone, val_evaluator_Aquarium,
+ val_evaluator_CottontailRabbits, val_evaluator_EgoHands,
+ val_evaluator_NorthAmericaMushrooms, val_evaluator_Packages,
+ val_evaluator_PascalVOC, val_evaluator_pistols, val_evaluator_pothole,
+ val_evaluator_Raccoon, val_evaluator_ShellfishOpenImages,
+ val_evaluator_thermalDogsAndPeople, val_evaluator_VehiclesOpenImages
+]
+
+# -------------------------------------------------#
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/mm_grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw35.py b/configs/mm_grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw35.py
new file mode 100644
index 00000000000..a6b8566aed4
--- /dev/null
+++ b/configs/mm_grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw35.py
@@ -0,0 +1,794 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py' # noqa
+
+dataset_type = 'CocoDataset'
+data_root = 'data/odinw/'
+
+base_test_pipeline = _base_.test_pipeline
+base_test_pipeline[-1]['meta_keys'] = ('img_id', 'img_path', 'ori_shape',
+ 'img_shape', 'scale_factor', 'text',
+ 'custom_entities', 'caption_prompt')
+
+# ---------------------1 AerialMaritimeDrone_large---------------------#
+class_name = ('boat', 'car', 'dock', 'jetski', 'lift')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AerialMaritimeDrone/large/'
+dataset_AerialMaritimeDrone_large = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_AerialMaritimeDrone_large = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------2 AerialMaritimeDrone_tiled---------------------#
+class_name = ('boat', 'car', 'dock', 'jetski', 'lift')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AerialMaritimeDrone/tiled/'
+dataset_AerialMaritimeDrone_tiled = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_AerialMaritimeDrone_tiled = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------3 AmericanSignLanguageLetters---------------------#
+class_name = ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
+ 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'AmericanSignLanguageLetters/American Sign Language Letters.v1-v1.coco/' # noqa
+dataset_AmericanSignLanguageLetters = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_AmericanSignLanguageLetters = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------4 Aquarium---------------------#
+class_name = ('fish', 'jellyfish', 'penguin', 'puffin', 'shark', 'starfish',
+ 'stingray')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Aquarium/Aquarium Combined.v2-raw-1024.coco/'
+dataset_Aquarium = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Aquarium = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------5 BCCD---------------------#
+class_name = ('Platelets', 'RBC', 'WBC')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'BCCD/BCCD.v3-raw.coco/'
+dataset_BCCD = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_BCCD = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------6 boggleBoards---------------------#
+class_name = ('Q', 'a', 'an', 'b', 'c', 'd', 'e', 'er', 'f', 'g', 'h', 'he',
+ 'i', 'in', 'j', 'k', 'l', 'm', 'n', 'o', 'o ', 'p', 'q', 'qu',
+ 'r', 's', 't', 't\\', 'th', 'u', 'v', 'w', 'wild', 'x', 'y', 'z')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'boggleBoards/416x416AutoOrient/export/'
+dataset_boggleBoards = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_boggleBoards = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------7 brackishUnderwater---------------------#
+class_name = ('crab', 'fish', 'jellyfish', 'shrimp', 'small_fish', 'starfish')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'brackishUnderwater/960x540/'
+dataset_brackishUnderwater = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_brackishUnderwater = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------8 ChessPieces---------------------#
+class_name = (' ', 'black bishop', 'black king', 'black knight', 'black pawn',
+ 'black queen', 'black rook', 'white bishop', 'white king',
+ 'white knight', 'white pawn', 'white queen', 'white rook')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ChessPieces/Chess Pieces.v23-raw.coco/'
+dataset_ChessPieces = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/new_annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ChessPieces = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/new_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------9 CottontailRabbits---------------------#
+class_name = ('rabbit', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'CottontailRabbits/'
+dataset_CottontailRabbits = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/new_annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_CottontailRabbits = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/new_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------10 dice---------------------#
+class_name = ('1', '2', '3', '4', '5', '6')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'dice/mediumColor/export/'
+dataset_dice = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_dice = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------11 DroneControl---------------------#
+class_name = ('follow', 'follow_hand', 'land', 'land_hand', 'null', 'object',
+ 'takeoff', 'takeoff-hand')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'DroneControl/Drone Control.v3-raw.coco/'
+dataset_DroneControl = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_DroneControl = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------12 EgoHands_generic---------------------#
+class_name = ('hand', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'EgoHands/generic/'
+caption_prompt = {'hand': {'suffix': ' of a person'}}
+dataset_EgoHands_generic = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_EgoHands_generic = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------13 EgoHands_specific---------------------#
+class_name = ('myleft', 'myright', 'yourleft', 'yourright')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'EgoHands/specific/'
+dataset_EgoHands_specific = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_EgoHands_specific = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------14 HardHatWorkers---------------------#
+class_name = ('head', 'helmet', 'person')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'HardHatWorkers/raw/'
+dataset_HardHatWorkers = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_HardHatWorkers = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------15 MaskWearing---------------------#
+class_name = ('mask', 'no-mask')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'MaskWearing/raw/'
+dataset_MaskWearing = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_MaskWearing = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------16 MountainDewCommercial---------------------#
+class_name = ('bottle', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'MountainDewCommercial/'
+dataset_MountainDewCommercial = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_MountainDewCommercial = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------17 NorthAmericaMushrooms---------------------#
+class_name = ('flat mushroom', 'yellow mushroom')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'NorthAmericaMushrooms/North American Mushrooms.v1-416x416.coco/' # noqa
+dataset_NorthAmericaMushrooms = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/new_annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_NorthAmericaMushrooms = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/new_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------18 openPoetryVision---------------------#
+class_name = ('American Typewriter', 'Andale Mono', 'Apple Chancery', 'Arial',
+ 'Avenir', 'Baskerville', 'Big Caslon', 'Bradley Hand',
+ 'Brush Script MT', 'Chalkboard', 'Comic Sans MS', 'Copperplate',
+ 'Courier', 'Didot', 'Futura', 'Geneva', 'Georgia', 'Gill Sans',
+ 'Helvetica', 'Herculanum', 'Impact', 'Kefa', 'Lucida Grande',
+ 'Luminari', 'Marker Felt', 'Menlo', 'Monaco', 'Noteworthy',
+ 'Optima', 'PT Sans', 'PT Serif', 'Palatino', 'Papyrus',
+ 'Phosphate', 'Rockwell', 'SF Pro', 'SignPainter', 'Skia',
+ 'Snell Roundhand', 'Tahoma', 'Times New Roman', 'Trebuchet MS',
+ 'Verdana')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'openPoetryVision/512x512/'
+dataset_openPoetryVision = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_openPoetryVision = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------19 OxfordPets_by_breed---------------------#
+class_name = ('cat-Abyssinian', 'cat-Bengal', 'cat-Birman', 'cat-Bombay',
+ 'cat-British_Shorthair', 'cat-Egyptian_Mau', 'cat-Maine_Coon',
+ 'cat-Persian', 'cat-Ragdoll', 'cat-Russian_Blue', 'cat-Siamese',
+ 'cat-Sphynx', 'dog-american_bulldog',
+ 'dog-american_pit_bull_terrier', 'dog-basset_hound',
+ 'dog-beagle', 'dog-boxer', 'dog-chihuahua',
+ 'dog-english_cocker_spaniel', 'dog-english_setter',
+ 'dog-german_shorthaired', 'dog-great_pyrenees', 'dog-havanese',
+ 'dog-japanese_chin', 'dog-keeshond', 'dog-leonberger',
+ 'dog-miniature_pinscher', 'dog-newfoundland', 'dog-pomeranian',
+ 'dog-pug', 'dog-saint_bernard', 'dog-samoyed',
+ 'dog-scottish_terrier', 'dog-shiba_inu',
+ 'dog-staffordshire_bull_terrier', 'dog-wheaten_terrier',
+ 'dog-yorkshire_terrier')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'OxfordPets/by-breed/' # noqa
+dataset_OxfordPets_by_breed = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_OxfordPets_by_breed = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------20 OxfordPets_by_species---------------------#
+class_name = ('cat', 'dog')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'OxfordPets/by-species/' # noqa
+dataset_OxfordPets_by_species = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_OxfordPets_by_species = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------21 PKLot---------------------#
+class_name = ('space-empty', 'space-occupied')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'PKLot/640/' # noqa
+dataset_PKLot = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_PKLot = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------22 Packages---------------------#
+class_name = ('package', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Packages/Raw/'
+caption_prompt = {
+ 'package': {
+ 'prefix': 'there is a ',
+ 'suffix': ' on the porch'
+ }
+}
+dataset_Packages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=base_test_pipeline,
+ caption_prompt=caption_prompt,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Packages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------23 PascalVOC---------------------#
+class_name = ('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car',
+ 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse',
+ 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train',
+ 'tvmonitor')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'PascalVOC/'
+dataset_PascalVOC = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_PascalVOC = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------24 pistols---------------------#
+class_name = ('pistol', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pistols/export/'
+dataset_pistols = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pistols = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------25 plantdoc---------------------#
+class_name = ('Apple Scab Leaf', 'Apple leaf', 'Apple rust leaf',
+ 'Bell_pepper leaf', 'Bell_pepper leaf spot', 'Blueberry leaf',
+ 'Cherry leaf', 'Corn Gray leaf spot', 'Corn leaf blight',
+ 'Corn rust leaf', 'Peach leaf', 'Potato leaf',
+ 'Potato leaf early blight', 'Potato leaf late blight',
+ 'Raspberry leaf', 'Soyabean leaf', 'Soybean leaf',
+ 'Squash Powdery mildew leaf', 'Strawberry leaf',
+ 'Tomato Early blight leaf', 'Tomato Septoria leaf spot',
+ 'Tomato leaf', 'Tomato leaf bacterial spot',
+ 'Tomato leaf late blight', 'Tomato leaf mosaic virus',
+ 'Tomato leaf yellow virus', 'Tomato mold leaf',
+ 'Tomato two spotted spider mites leaf', 'grape leaf',
+ 'grape leaf black rot')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'plantdoc/416x416/'
+dataset_plantdoc = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_plantdoc = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------26 pothole---------------------#
+class_name = ('pothole', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'pothole/'
+caption_prompt = {
+ 'pothole': {
+ 'name': 'holes',
+ 'prefix': 'there are some ',
+ 'suffix': ' on the road'
+ }
+}
+dataset_pothole = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ caption_prompt=caption_prompt,
+ pipeline=base_test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_pothole = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------27 Raccoon---------------------#
+class_name = ('raccoon', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'Raccoon/Raccoon.v2-raw.coco/'
+dataset_Raccoon = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_Raccoon = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------28 selfdrivingCar---------------------#
+class_name = ('biker', 'car', 'pedestrian', 'trafficLight',
+ 'trafficLight-Green', 'trafficLight-GreenLeft',
+ 'trafficLight-Red', 'trafficLight-RedLeft',
+ 'trafficLight-Yellow', 'trafficLight-YellowLeft', 'truck')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'selfdrivingCar/fixedLarge/export/'
+dataset_selfdrivingCar = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='val_annotations_without_background.json',
+ data_prefix=dict(img=''),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_selfdrivingCar = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'val_annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------29 ShellfishOpenImages---------------------#
+class_name = ('Crab', 'Lobster', 'Shrimp')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ShellfishOpenImages/raw/'
+dataset_ShellfishOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ShellfishOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------30 ThermalCheetah---------------------#
+class_name = ('cheetah', 'human')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'ThermalCheetah/'
+dataset_ThermalCheetah = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_ThermalCheetah = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------31 thermalDogsAndPeople---------------------#
+class_name = ('dog', 'person')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'thermalDogsAndPeople/'
+dataset_thermalDogsAndPeople = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_thermalDogsAndPeople = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------32 UnoCards---------------------#
+class_name = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11',
+ '12', '13', '14')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'UnoCards/raw/'
+dataset_UnoCards = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_UnoCards = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------33 VehiclesOpenImages---------------------#
+class_name = ('Ambulance', 'Bus', 'Car', 'Motorcycle', 'Truck')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'VehiclesOpenImages/416x416/'
+dataset_VehiclesOpenImages = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_VehiclesOpenImages = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------34 WildfireSmoke---------------------#
+class_name = ('smoke', )
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'WildfireSmoke/'
+dataset_WildfireSmoke = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_WildfireSmoke = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# ---------------------35 websiteScreenshots---------------------#
+class_name = ('button', 'field', 'heading', 'iframe', 'image', 'label', 'link',
+ 'text')
+metainfo = dict(classes=class_name)
+_data_root = data_root + 'websiteScreenshots/'
+dataset_websiteScreenshots = dict(
+ type=dataset_type,
+ metainfo=metainfo,
+ data_root=_data_root,
+ ann_file='valid/annotations_without_background.json',
+ data_prefix=dict(img='valid/'),
+ pipeline=_base_.test_pipeline,
+ test_mode=True,
+ return_classes=True)
+val_evaluator_websiteScreenshots = dict(
+ type='CocoMetric',
+ ann_file=_data_root + 'valid/annotations_without_background.json',
+ metric='bbox')
+
+# --------------------- Config---------------------#
+
+dataset_prefixes = [
+ 'AerialMaritimeDrone_large',
+ 'AerialMaritimeDrone_tiled',
+ 'AmericanSignLanguageLetters',
+ 'Aquarium',
+ 'BCCD',
+ 'boggleBoards',
+ 'brackishUnderwater',
+ 'ChessPieces',
+ 'CottontailRabbits',
+ 'dice',
+ 'DroneControl',
+ 'EgoHands_generic',
+ 'EgoHands_specific',
+ 'HardHatWorkers',
+ 'MaskWearing',
+ 'MountainDewCommercial',
+ 'NorthAmericaMushrooms',
+ 'openPoetryVision',
+ 'OxfordPets_by_breed',
+ 'OxfordPets_by_species',
+ 'PKLot',
+ 'Packages',
+ 'PascalVOC',
+ 'pistols',
+ 'plantdoc',
+ 'pothole',
+ 'Raccoons',
+ 'selfdrivingCar',
+ 'ShellfishOpenImages',
+ 'ThermalCheetah',
+ 'thermalDogsAndPeople',
+ 'UnoCards',
+ 'VehiclesOpenImages',
+ 'WildfireSmoke',
+ 'websiteScreenshots',
+]
+
+datasets = [
+ dataset_AerialMaritimeDrone_large, dataset_AerialMaritimeDrone_tiled,
+ dataset_AmericanSignLanguageLetters, dataset_Aquarium, dataset_BCCD,
+ dataset_boggleBoards, dataset_brackishUnderwater, dataset_ChessPieces,
+ dataset_CottontailRabbits, dataset_dice, dataset_DroneControl,
+ dataset_EgoHands_generic, dataset_EgoHands_specific,
+ dataset_HardHatWorkers, dataset_MaskWearing, dataset_MountainDewCommercial,
+ dataset_NorthAmericaMushrooms, dataset_openPoetryVision,
+ dataset_OxfordPets_by_breed, dataset_OxfordPets_by_species, dataset_PKLot,
+ dataset_Packages, dataset_PascalVOC, dataset_pistols, dataset_plantdoc,
+ dataset_pothole, dataset_Raccoon, dataset_selfdrivingCar,
+ dataset_ShellfishOpenImages, dataset_ThermalCheetah,
+ dataset_thermalDogsAndPeople, dataset_UnoCards, dataset_VehiclesOpenImages,
+ dataset_WildfireSmoke, dataset_websiteScreenshots
+]
+
+metrics = [
+ val_evaluator_AerialMaritimeDrone_large,
+ val_evaluator_AerialMaritimeDrone_tiled,
+ val_evaluator_AmericanSignLanguageLetters, val_evaluator_Aquarium,
+ val_evaluator_BCCD, val_evaluator_boggleBoards,
+ val_evaluator_brackishUnderwater, val_evaluator_ChessPieces,
+ val_evaluator_CottontailRabbits, val_evaluator_dice,
+ val_evaluator_DroneControl, val_evaluator_EgoHands_generic,
+ val_evaluator_EgoHands_specific, val_evaluator_HardHatWorkers,
+ val_evaluator_MaskWearing, val_evaluator_MountainDewCommercial,
+ val_evaluator_NorthAmericaMushrooms, val_evaluator_openPoetryVision,
+ val_evaluator_OxfordPets_by_breed, val_evaluator_OxfordPets_by_species,
+ val_evaluator_PKLot, val_evaluator_Packages, val_evaluator_PascalVOC,
+ val_evaluator_pistols, val_evaluator_plantdoc, val_evaluator_pothole,
+ val_evaluator_Raccoon, val_evaluator_selfdrivingCar,
+ val_evaluator_ShellfishOpenImages, val_evaluator_ThermalCheetah,
+ val_evaluator_thermalDogsAndPeople, val_evaluator_UnoCards,
+ val_evaluator_VehiclesOpenImages, val_evaluator_WildfireSmoke,
+ val_evaluator_websiteScreenshots
+]
+
+# -------------------------------------------------#
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/mm_grounding_dino/odinw/override_category.py b/configs/mm_grounding_dino/odinw/override_category.py
new file mode 100644
index 00000000000..9ff05fc6e5e
--- /dev/null
+++ b/configs/mm_grounding_dino/odinw/override_category.py
@@ -0,0 +1,109 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+
+import mmengine
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description='Override Category')
+ parser.add_argument('data_root')
+ return parser.parse_args()
+
+
+def main():
+ args = parse_args()
+
+ ChessPieces = [{
+ 'id': 1,
+ 'name': ' ',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 2,
+ 'name': 'black bishop',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 3,
+ 'name': 'black king',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 4,
+ 'name': 'black knight',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 5,
+ 'name': 'black pawn',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 6,
+ 'name': 'black queen',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 7,
+ 'name': 'black rook',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 8,
+ 'name': 'white bishop',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 9,
+ 'name': 'white king',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 10,
+ 'name': 'white knight',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 11,
+ 'name': 'white pawn',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 12,
+ 'name': 'white queen',
+ 'supercategory': 'pieces'
+ }, {
+ 'id': 13,
+ 'name': 'white rook',
+ 'supercategory': 'pieces'
+ }]
+
+ _data_root = args.data_root + 'ChessPieces/Chess Pieces.v23-raw.coco/'
+ json_data = mmengine.load(_data_root +
+ 'valid/annotations_without_background.json')
+ json_data['categories'] = ChessPieces
+ mmengine.dump(json_data,
+ _data_root + 'valid/new_annotations_without_background.json')
+
+ CottontailRabbits = [{
+ 'id': 1,
+ 'name': 'rabbit',
+ 'supercategory': 'Cottontail-Rabbit'
+ }]
+
+ _data_root = args.data_root + 'CottontailRabbits/'
+ json_data = mmengine.load(_data_root +
+ 'valid/annotations_without_background.json')
+ json_data['categories'] = CottontailRabbits
+ mmengine.dump(json_data,
+ _data_root + 'valid/new_annotations_without_background.json')
+
+ NorthAmericaMushrooms = [{
+ 'id': 1,
+ 'name': 'flat mushroom',
+ 'supercategory': 'mushroom'
+ }, {
+ 'id': 2,
+ 'name': 'yellow mushroom',
+ 'supercategory': 'mushroom'
+ }]
+
+ _data_root = args.data_root + 'NorthAmericaMushrooms/North American Mushrooms.v1-416x416.coco/' # noqa
+ json_data = mmengine.load(_data_root +
+ 'valid/annotations_without_background.json')
+ json_data['categories'] = NorthAmericaMushrooms
+ mmengine.dump(json_data,
+ _data_root + 'valid/new_annotations_without_background.json')
+
+
+if __name__ == '__main__':
+ main()
diff --git a/configs/mm_grounding_dino/people_in_painting/grounding_dino_swin-t_finetune_8xb4_50e_people_in_painting.py b/configs/mm_grounding_dino/people_in_painting/grounding_dino_swin-t_finetune_8xb4_50e_people_in_painting.py
new file mode 100644
index 00000000000..449d8682f89
--- /dev/null
+++ b/configs/mm_grounding_dino/people_in_painting/grounding_dino_swin-t_finetune_8xb4_50e_people_in_painting.py
@@ -0,0 +1,109 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+# https://universe.roboflow.com/roboflow-100/people-in-paintings/dataset/2
+data_root = 'data/people_in_painting_v2/'
+class_name = ('Human', )
+palette = [(220, 20, 60)]
+
+metainfo = dict(classes=class_name, palette=palette)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities'))
+]
+
+train_dataloader = dict(
+ sampler=dict(_delete_=True, type='DefaultSampler', shuffle=True),
+ batch_sampler=dict(type='AspectRatioBatchSampler'),
+ dataset=dict(
+ _delete_=True,
+ type='RepeatDataset',
+ times=10,
+ dataset=dict(
+ type='CocoDataset',
+ data_root=data_root,
+ metainfo=metainfo,
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ pipeline=train_pipeline,
+ return_classes=True,
+ data_prefix=dict(img='train/'),
+ ann_file='train/_annotations.coco.json')))
+
+val_dataloader = dict(
+ dataset=dict(
+ metainfo=metainfo,
+ data_root=data_root,
+ return_classes=True,
+ ann_file='valid/_annotations.coco.json',
+ data_prefix=dict(img='valid/')))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ type='CocoMetric',
+ ann_file=data_root + 'valid/_annotations.coco.json',
+ metric='bbox',
+ format_only=False)
+test_evaluator = val_evaluator
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.0001, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1)
+ }))
+
+# learning policy
+max_epochs = 5
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[4],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=1)
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=1, save_best='auto'))
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_finetune_8xb4_5e_grefcoco.py b/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_finetune_8xb4_5e_grefcoco.py
new file mode 100644
index 00000000000..983ffe5c6f3
--- /dev/null
+++ b/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_finetune_8xb4_5e_grefcoco.py
@@ -0,0 +1,170 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/coco/'
+
+train_pipeline = [
+ dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
+ dict(type='LoadAnnotations', with_bbox=True),
+ # change this
+ dict(type='RandomFlip', prob=0.0),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=85,
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+
+train_dataloader = dict(
+ dataset=dict(
+ _delete_=True,
+ type='ODVGDataset',
+ data_root=data_root,
+ ann_file='mdetr_annotations/finetune_grefcoco_train_vg.json',
+ data_prefix=dict(img='train2014/'),
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ return_classes=True,
+ pipeline=train_pipeline))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_grefcoco_val.json'
+val_dataset_all_val = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=_base_.test_pipeline,
+ backend_args=None)
+val_evaluator_all_val = dict(
+ type='gRefCOCOMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ thresh_score=0.7,
+ thresh_f1=1.0)
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_grefcoco_testA.json'
+val_dataset_refcoco_testA = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=_base_.test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_testA = dict(
+ type='gRefCOCOMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ thresh_score=0.7,
+ thresh_f1=1.0)
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_grefcoco_testB.json'
+val_dataset_refcoco_testB = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=_base_.test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_testB = dict(
+ type='gRefCOCOMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ thresh_score=0.7,
+ thresh_f1=1.0)
+
+# -------------------------------------------------#
+datasets = [
+ val_dataset_all_val, val_dataset_refcoco_testA, val_dataset_refcoco_testB
+]
+dataset_prefixes = ['grefcoco_val', 'grefcoco_testA', 'grefcoco_testB']
+metrics = [
+ val_evaluator_all_val, val_evaluator_refcoco_testA,
+ val_evaluator_refcoco_testB
+]
+
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1),
+ # 'language_model': dict(lr_mult=0),
+ }))
+
+# learning policy
+max_epochs = 5
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[3],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=1)
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcoco.py b/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcoco.py
new file mode 100644
index 00000000000..d91af473a23
--- /dev/null
+++ b/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcoco.py
@@ -0,0 +1,167 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/coco/'
+
+train_pipeline = [
+ dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
+ dict(type='LoadAnnotations', with_bbox=True),
+ # change this
+ dict(type='RandomFlip', prob=0.0),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=85,
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+
+train_dataloader = dict(
+ dataset=dict(
+ _delete_=True,
+ type='ODVGDataset',
+ data_root=data_root,
+ ann_file='mdetr_annotations/finetune_refcoco_train_vg.json',
+ data_prefix=dict(img='train2014/'),
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ return_classes=True,
+ pipeline=train_pipeline))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco_val.json'
+val_dataset_all_val = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=_base_.test_pipeline,
+ backend_args=None)
+val_evaluator_all_val = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco_testA.json'
+val_dataset_refcoco_testA = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=_base_.test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_testA = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco_testB.json'
+val_dataset_refcoco_testB = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=_base_.test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_testB = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+datasets = [
+ val_dataset_all_val, val_dataset_refcoco_testA, val_dataset_refcoco_testB
+]
+dataset_prefixes = ['refcoco_val', 'refcoco_testA', 'refcoco_testB']
+metrics = [
+ val_evaluator_all_val, val_evaluator_refcoco_testA,
+ val_evaluator_refcoco_testB
+]
+
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1),
+ # 'language_model': dict(lr_mult=0),
+ }))
+
+# learning policy
+max_epochs = 5
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[3],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=1)
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcoco_plus.py b/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcoco_plus.py
new file mode 100644
index 00000000000..871adc8efb4
--- /dev/null
+++ b/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcoco_plus.py
@@ -0,0 +1,167 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/coco/'
+
+train_pipeline = [
+ dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
+ dict(type='LoadAnnotations', with_bbox=True),
+ # change this
+ dict(type='RandomFlip', prob=0.0),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=85,
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+
+train_dataloader = dict(
+ dataset=dict(
+ _delete_=True,
+ type='ODVGDataset',
+ data_root=data_root,
+ ann_file='mdetr_annotations/finetune_refcoco+_train_vg.json',
+ data_prefix=dict(img='train2014/'),
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ return_classes=True,
+ pipeline=train_pipeline))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco+_val.json'
+val_dataset_all_val = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=_base_.test_pipeline,
+ backend_args=None)
+val_evaluator_all_val = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco+_testA.json'
+val_dataset_refcoco_testA = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=_base_.test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_testA = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco+_testB.json'
+val_dataset_refcoco_testB = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=_base_.test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_testB = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+datasets = [
+ val_dataset_all_val, val_dataset_refcoco_testA, val_dataset_refcoco_testB
+]
+dataset_prefixes = ['refcoco+_val', 'refcoco+_testA', 'refcoco+_testB']
+metrics = [
+ val_evaluator_all_val, val_evaluator_refcoco_testA,
+ val_evaluator_refcoco_testB
+]
+
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1),
+ # 'language_model': dict(lr_mult=0),
+ }))
+
+# learning policy
+max_epochs = 5
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[3],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=1)
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcocog.py b/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcocog.py
new file mode 100644
index 00000000000..a351d6f9d12
--- /dev/null
+++ b/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcocog.py
@@ -0,0 +1,145 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/coco/'
+
+train_pipeline = [
+ dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
+ dict(type='LoadAnnotations', with_bbox=True),
+ # change this
+ dict(type='RandomFlip', prob=0.0),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+ dict(
+ type='RandomSamplingNegPos',
+ tokenizer_name=_base_.lang_model_name,
+ num_sample_negative=85,
+ max_tokens=256),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities', 'tokens_positive', 'dataset_mode'))
+]
+
+train_dataloader = dict(
+ dataset=dict(
+ _delete_=True,
+ type='ODVGDataset',
+ data_root=data_root,
+ ann_file='mdetr_annotations/finetune_refcocog_train_vg.json',
+ data_prefix=dict(img='train2014/'),
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ return_classes=True,
+ pipeline=train_pipeline))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcocog_val.json'
+val_dataset_all_val = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=_base_.test_pipeline,
+ backend_args=None)
+val_evaluator_all_val = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcocog_test.json'
+val_dataset_refcoco_test = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=_base_.test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_test = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+datasets = [val_dataset_all_val, val_dataset_refcoco_test]
+dataset_prefixes = ['refcocog_val', 'refcocog_test']
+metrics = [val_evaluator_all_val, val_evaluator_refcoco_test]
+
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1),
+ # 'language_model': dict(lr_mult=0),
+ }))
+
+# learning policy
+max_epochs = 5
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[3],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=1)
+
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=1, save_best='auto'))
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_pretrain_zeroshot_refexp.py b/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_pretrain_zeroshot_refexp.py
new file mode 100644
index 00000000000..437d71c6b35
--- /dev/null
+++ b/configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_pretrain_zeroshot_refexp.py
@@ -0,0 +1,228 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+# 30 is an empirical value, just set it to the maximum value
+# without affecting the evaluation result
+model = dict(test_cfg=dict(max_per_img=30))
+
+data_root = 'data/coco/'
+
+test_pipeline = [
+ dict(
+ type='LoadImageFromFile', backend_args=None,
+ imdecode_backend='pillow'),
+ dict(
+ type='FixScaleResize',
+ scale=(800, 1333),
+ keep_ratio=True,
+ backend='pillow'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'text', 'custom_entities',
+ 'tokens_positive'))
+]
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/final_refexp_val.json'
+val_dataset_all_val = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+val_evaluator_all_val = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco_testA.json'
+val_dataset_refcoco_testA = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_testA = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco_testB.json'
+val_dataset_refcoco_testB = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_testB = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco+_testA.json'
+val_dataset_refcoco_plus_testA = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_plus_testA = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcoco+_testB.json'
+val_dataset_refcoco_plus_testB = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcoco_plus_testB = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_refcocog_test.json'
+val_dataset_refcocog_test = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_refcocog_test = dict(
+ type='RefExpMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ topk=(1, 5, 10))
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_grefcoco_val.json'
+val_dataset_grefcoco_val = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_grefcoco_val = dict(
+ type='gRefCOCOMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ thresh_score=0.7,
+ thresh_f1=1.0)
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_grefcoco_testA.json'
+val_dataset_grefcoco_testA = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_grefcoco_testA = dict(
+ type='gRefCOCOMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ thresh_score=0.7,
+ thresh_f1=1.0)
+
+# -------------------------------------------------#
+ann_file = 'mdetr_annotations/finetune_grefcoco_testB.json'
+val_dataset_grefcoco_testB = dict(
+ type='MDETRStyleRefCocoDataset',
+ data_root=data_root,
+ ann_file=ann_file,
+ data_prefix=dict(img='train2014/'),
+ test_mode=True,
+ return_classes=True,
+ pipeline=test_pipeline,
+ backend_args=None)
+
+val_evaluator_grefcoco_testB = dict(
+ type='gRefCOCOMetric',
+ ann_file=data_root + ann_file,
+ metric='bbox',
+ iou_thrs=0.5,
+ thresh_score=0.7,
+ thresh_f1=1.0)
+
+# -------------------------------------------------#
+datasets = [
+ val_dataset_all_val, val_dataset_refcoco_testA, val_dataset_refcoco_testB,
+ val_dataset_refcoco_plus_testA, val_dataset_refcoco_plus_testB,
+ val_dataset_refcocog_test, val_dataset_grefcoco_val,
+ val_dataset_grefcoco_testA, val_dataset_grefcoco_testB
+]
+dataset_prefixes = [
+ 'val', 'refcoco_testA', 'refcoco_testB', 'refcoco+_testA',
+ 'refcoco+_testB', 'refcocog_test', 'grefcoco_val', 'grefcoco_testA',
+ 'grefcoco_testB'
+]
+metrics = [
+ val_evaluator_all_val, val_evaluator_refcoco_testA,
+ val_evaluator_refcoco_testB, val_evaluator_refcoco_plus_testA,
+ val_evaluator_refcoco_plus_testB, val_evaluator_refcocog_test,
+ val_evaluator_grefcoco_val, val_evaluator_grefcoco_testA,
+ val_evaluator_grefcoco_testB
+]
+
+val_dataloader = dict(
+ dataset=dict(_delete_=True, type='ConcatDataset', datasets=datasets))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ _delete_=True,
+ type='MultiDatasetsEvaluator',
+ metrics=metrics,
+ dataset_prefixes=dataset_prefixes)
+test_evaluator = val_evaluator
diff --git a/configs/mm_grounding_dino/rtts/grounding_dino_swin-t_finetune_8xb4_1x_rtts.py b/configs/mm_grounding_dino/rtts/grounding_dino_swin-t_finetune_8xb4_1x_rtts.py
new file mode 100644
index 00000000000..95c2be058e2
--- /dev/null
+++ b/configs/mm_grounding_dino/rtts/grounding_dino_swin-t_finetune_8xb4_1x_rtts.py
@@ -0,0 +1,106 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/RTTS/'
+class_name = ('bicycle', 'bus', 'car', 'motorbike', 'person')
+palette = [(255, 97, 0), (0, 201, 87), (176, 23, 31), (138, 43, 226),
+ (30, 144, 255)]
+
+metainfo = dict(classes=class_name, palette=palette)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities'))
+]
+
+train_dataloader = dict(
+ sampler=dict(_delete_=True, type='DefaultSampler', shuffle=True),
+ batch_sampler=dict(type='AspectRatioBatchSampler'),
+ dataset=dict(
+ _delete_=True,
+ type='CocoDataset',
+ data_root=data_root,
+ metainfo=metainfo,
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ pipeline=train_pipeline,
+ return_classes=True,
+ ann_file='annotations_json/rtts_train.json',
+ data_prefix=dict(img='')))
+
+val_dataloader = dict(
+ dataset=dict(
+ metainfo=metainfo,
+ data_root=data_root,
+ return_classes=True,
+ ann_file='annotations_json/rtts_val.json',
+ data_prefix=dict(img='')))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ type='CocoMetric',
+ ann_file=data_root + 'annotations_json/rtts_val.json',
+ metric='bbox',
+ format_only=False)
+test_evaluator = val_evaluator
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.0001, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1)
+ }))
+
+# learning policy
+max_epochs = 12
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[11],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=1)
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=1, save_best='auto'))
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/ruod/grounding_dino_swin-t_finetune_8xb4_1x_ruod.py b/configs/mm_grounding_dino/ruod/grounding_dino_swin-t_finetune_8xb4_1x_ruod.py
new file mode 100644
index 00000000000..f57682b29d9
--- /dev/null
+++ b/configs/mm_grounding_dino/ruod/grounding_dino_swin-t_finetune_8xb4_1x_ruod.py
@@ -0,0 +1,108 @@
+_base_ = '../grounding_dino_swin-t_pretrain_obj365.py'
+
+data_root = 'data/RUOD/'
+class_name = ('holothurian', 'echinus', 'scallop', 'starfish', 'fish',
+ 'corals', 'diver', 'cuttlefish', 'turtle', 'jellyfish')
+palette = [(235, 211, 70), (106, 90, 205), (160, 32, 240), (176, 23, 31),
+ (142, 0, 0), (230, 0, 0), (106, 0, 228), (60, 100, 0), (80, 100, 0),
+ (70, 0, 0)]
+
+metainfo = dict(classes=class_name, palette=palette)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomChoice',
+ transforms=[
+ [
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ],
+ [
+ dict(
+ type='RandomChoiceResize',
+ # The radio of all image in train dataset < 7
+ # follow the original implement
+ scales=[(400, 4200), (500, 4200), (600, 4200)],
+ keep_ratio=True),
+ dict(
+ type='RandomCrop',
+ crop_type='absolute_range',
+ crop_size=(384, 600),
+ allow_negative_crop=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
+ (608, 1333), (640, 1333), (672, 1333), (704, 1333),
+ (736, 1333), (768, 1333), (800, 1333)],
+ keep_ratio=True)
+ ]
+ ]),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction', 'text',
+ 'custom_entities'))
+]
+
+train_dataloader = dict(
+ sampler=dict(_delete_=True, type='DefaultSampler', shuffle=True),
+ batch_sampler=dict(type='AspectRatioBatchSampler'),
+ dataset=dict(
+ _delete_=True,
+ type='CocoDataset',
+ data_root=data_root,
+ metainfo=metainfo,
+ filter_cfg=dict(filter_empty_gt=False, min_size=32),
+ pipeline=train_pipeline,
+ return_classes=True,
+ ann_file='RUOD_ANN/instances_train.json',
+ data_prefix=dict(img='RUOD_pic/train/')))
+
+val_dataloader = dict(
+ dataset=dict(
+ metainfo=metainfo,
+ data_root=data_root,
+ return_classes=True,
+ ann_file='RUOD_ANN/instances_test.json',
+ data_prefix=dict(img='RUOD_pic/test/')))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+ type='CocoMetric',
+ ann_file=data_root + 'RUOD_ANN/instances_test.json',
+ metric='bbox',
+ format_only=False)
+test_evaluator = val_evaluator
+
+optim_wrapper = dict(
+ _delete_=True,
+ type='OptimWrapper',
+ optimizer=dict(type='AdamW', lr=0.0001, weight_decay=0.0001),
+ clip_grad=dict(max_norm=0.1, norm_type=2),
+ paramwise_cfg=dict(custom_keys={
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'backbone': dict(lr_mult=0.1)
+ }))
+
+# learning policy
+max_epochs = 12
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[11],
+ gamma=0.1)
+]
+train_cfg = dict(max_epochs=max_epochs, val_interval=1)
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=1, save_best='auto'))
+
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
diff --git a/configs/mm_grounding_dino/usage.md b/configs/mm_grounding_dino/usage.md
new file mode 100644
index 00000000000..123c6638cbe
--- /dev/null
+++ b/configs/mm_grounding_dino/usage.md
@@ -0,0 +1,491 @@
+# Usage
+
+## Install
+
+After installing MMDet according to the instructions in the [get_started](../../docs/zh_cn/get_started.md) section, you need to install additional dependency packages:
+
+```shell
+cd $MMDETROOT
+
+pip install -r requirements/multimodal.txt
+pip install emoji ddd-dataset
+pip install git+https://github.com/lvis-dataset/lvis-api.git"
+```
+
+Please note that since the LVIS third-party library does not currently support numpy 1.24, ensure that your numpy version meets the requirements. It is recommended to install numpy version 1.23.
+
+## Instructions
+
+### Download BERT Weight
+
+MM Grounding DINO uses BERT as its language model and requires access to https://huggingface.co/. If you encounter connection errors due to network access issues, you can download the necessary files on a computer with network access and save them locally. Finally, modify the `lang_model_name` field in the configuration file to the local path. For specific instructions, please refer to the following code:
+
+```python
+from transformers import BertConfig, BertModel
+from transformers import AutoTokenizer
+
+config = BertConfig.from_pretrained("bert-base-uncased")
+model = BertModel.from_pretrained("bert-base-uncased", add_pooling_layer=False, config=config)
+tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+
+config.save_pretrained("your path/bert-base-uncased")
+model.save_pretrained("your path/bert-base-uncased")
+tokenizer.save_pretrained("your path/bert-base-uncased")
+```
+
+### Download NLTK Weight
+
+When MM Grounding DINO performs Phrase Grounding inference, it may extract noun phrases. Although it downloads specific models at runtime, considering that some users' running environments cannot connect to the internet, it is possible to download them in advance to the `~/nltk_data` path.
+
+```python
+import nltk
+nltk.download('punkt', download_dir='~/nltk_data')
+nltk.download('averaged_perceptron_tagger', download_dir='~/nltk_data')
+```
+
+### Download MM Grounding DINO-T Weight
+
+For convenience in demonstration, you can download the MM Grounding DINO-T model weights in advance to the current path.
+
+```shell
+wget load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
+```
+
+## Inference
+
+Before inference, for a better experience of the inference effects on different images, it is recommended that you first download [these images](https://github.com/microsoft/X-Decoder/tree/main/inference_demo/images) to the current path.
+
+MM Grounding DINO supports four types of inference methods: Closed-Set Object Detection, Open Vocabulary Object Detection, Phrase Grounding, and Referential Expression Comprehension. The details are explained below.
+
+**(1) Closed-Set Object Detection**
+
+Since MM Grounding DINO is a pretrained model, it can theoretically be applied to any closed-set detection dataset. Currently, we support commonly used datasets such as coco/voc/cityscapes/objects365v1/lvis, etc. Below, we will use coco as an example.
+
+```shell
+python demo/image_demo.py images/animals.png \
+ configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
+ --texts '$: coco'
+```
+
+The predictions for `outputs/vis/animals.png` will be generated in the current directory, as shown in the following image.
+
+
+
+
+
+Since ostrich is not one of the 80 classes in COCO, it will not be detected.
+
+It's important to note that Objects365v1 and LVIS have a large number of categories. If you try to input all category names directly into the network, it may exceed 256 tokens, leading to poor model predictions. In such cases, you can use the `--chunked-size` parameter to perform chunked predictions. However, please be aware that chunked predictions may take longer to complete due to the large number of categories.
+
+```shell
+python demo/image_demo.py images/animals.png \
+ configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
+ --texts '$: lvis' --chunked-size 70 \
+ --palette random
+```
+
+
+
+
+
+Different `--chunked-size` values can lead to different prediction results. You can experiment with different chunked sizes to find the one that works best for your specific task and dataset.
+
+**(2) Open Vocabulary Object Detection**
+
+Open vocabulary object detection refers to the ability to input arbitrary class names during inference.
+
+```shell
+python demo/image_demo.py images/animals.png \
+ configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
+ --texts 'zebra. giraffe' -c
+```
+
+
+
+
+
+**(3) Phrase Grounding**
+
+Phrase Grounding refers to the process where a user inputs a natural language description, and the model automatically detects the corresponding bounding boxes for the mentioned noun phrases. It can be used in two ways:
+
+1. Automatically extracting noun phrases using the NLTK library and then performing detection.
+
+```shell
+python demo/image_demo.py images/apples.jpg \
+ configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
+ --texts 'There are many apples here.'
+```
+
+
+
+
+
+The program will automatically split `many apples` as a noun phrase and then detect the corresponding objects. Different input descriptions can have a significant impact on the prediction results.
+
+2. Users can manually specify which parts of the sentence are noun phrases to avoid errors in NLTK extraction.
+
+```shell
+python demo/image_demo.py images/fruit.jpg \
+ configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
+ --texts 'The picture contains watermelon, flower, and a white bottle.' \
+ --tokens-positive "[[[21,31]], [[45,59]]]" --pred-score-thr 0.12
+```
+
+The noun phrase corresponding to positions 21-31 is `watermelon`, and the noun phrase corresponding to positions 45-59 is `a white bottle`.
+
+
+
+
+
+**(4) Referential Expression Comprehension**
+
+Referential expression understanding refers to the model automatically comprehending the referential expressions involved in a user's language description without the need for noun phrase extraction.
+
+```shell
+python demo/image_demo.py images/apples.jpg \
+ configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
+ --texts 'red apple.' \
+ --tokens-positive -1
+```
+
+
+
+
+
+## Evaluation
+
+Our provided evaluation scripts are unified, and you only need to prepare the data in advance and then run the relevant configuration.
+
+(1) Zero-Shot COCO2017 val
+
+```shell
+# single GPU
+python tools/test.py configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
+
+# 8 GPUs
+./tools/dist_test.sh configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth 8
+```
+
+(2) Zero-Shot ODinW13
+
+```shell
+# single GPU
+python tools/test.py configs/mm_grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw13.py \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
+
+# 8 GPUs
+./tools/dist_test.sh configs/mm_grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw13.py \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth 8
+```
+
+## Visualization of Evaluation Results
+
+For the convenience of visualizing and analyzing model prediction results, we provide support for visualizing evaluation dataset prediction results. Taking referential expression understanding as an example, the usage is as follows:
+
+```shell
+python tools/test.py configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_pretrain_zeroshot_refexp \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth --work-dir refcoco_result --show-dir save_path
+```
+
+During the inference process, it will save the visualization results to the `refcoco_result/{current_timestamp}/save_path` directory. For other evaluation dataset visualizations, you only need to replace the configuration file.
+
+Here are some visualization results for various datasets. The left image represents the Ground Truth (GT). The right image represents the Predicted Result.
+
+1. COCO2017 val Results:
+
+
+
+
+
+2. Flickr30k Entities Results:
+
+
+
+
+
+3. DOD Results:
+
+
+
+
+
+4. RefCOCO val Results:
+
+
+
+
+
+5. RefCOCO testA Results:
+
+
+
+
+
+6. gRefCOCO val Results:
+
+
+
+
+
+## Training
+
+If you want to reproduce our results, you can train the model by using the following command after preparing the dataset:
+
+```shell
+# Training on a single machine with 8 GPUs for obj365v1 dataset
+./tools/dist_train.sh configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py 8
+# Training on a single machine with 8 GPUs for datasets like obj365v1, goldg, grit, v3det, and other datasets is similar.
+./tools/dist_train.sh configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det.py 8
+```
+
+For multi-machine training, please refer to [train.md](../../docs/zh_cn/user_guides/train.md). The MM-Grounding-DINO T model is designed to work with 32 GPUs (specifically, 3090Ti GPUs). If your total batch size is not 32x4=128, you will need to manually adjust the learning rate accordingly.
+
+### Pretraining Custom Format Explanation
+
+In order to standardize the pretraining formats for different datasets, we refer to the format design proposed by [Open-GroundingDino](https://github.com/longzw1997/Open-GroundingDino). Specifically, it is divided into two formats.
+
+**(1) Object Detection Format (OD)**
+
+```text
+{"filename": "obj365_train_000000734304.jpg",
+ "height": 512,
+ "width": 769,
+ "detection": {
+ "instances": [
+ {"bbox": [109.4768676992, 346.0190429696, 135.1918335098, 365.3641967616], "label": 2, "category": "chair"},
+ {"bbox": [58.612365705900004, 323.2281494016, 242.6005859067, 451.4166870016], "label": 8, "category": "car"}
+ ]
+ }
+}
+```
+
+The numerical values corresponding to labels in the label dictionary should match the respective label_map. Each item in the instances list corresponds to a bounding box (in the format x1y1x2y2).
+
+**(2) Phrase Grounding Format (VG)**
+
+```text
+{"filename": "2405116.jpg",
+ "height": 375,
+ "width": 500,
+ "grounding":
+ {"caption": "Two surfers walking down the shore. sand on the beach.",
+ "regions": [
+ {"bbox": [206, 156, 282, 248], "phrase": "Two surfers", "tokens_positive": [[0, 3], [4, 11]]},
+ {"bbox": [303, 338, 443, 343], "phrase": "sand", "tokens_positive": [[36, 40]]},
+ {"bbox": [[327, 223, 421, 282], [300, 200, 400, 210]], "phrase": "beach", "tokens_positive": [[48, 53]]}
+ ]
+ }
+```
+
+The `tokens_positive` field indicates the character positions of the current phrase within the caption.
+
+## Example of Fine-tuning Custom Dataset
+
+In order to facilitate downstream fine-tuning on custom datasets, we have provided a fine-tuning example using the simple "cat" dataset as an illustration.
+
+### 1 Data Preparation
+
+```shell
+cd mmdetection
+wget https://download.openmmlab.com/mmyolo/data/cat_dataset.zip
+unzip cat_dataset.zip -d data/cat/
+```
+
+The "cat" dataset is a single-category dataset consisting of 144 images, already converted to the COCO format.
+
+
+
+
+
+### 2 Configuration Preparation
+
+Due to the simplicity and small size of the "cat" dataset, we trained it for 20 epochs using 8 GPUs, with corresponding learning rate scaling. We did not train the language model, only the visual model.
+
+Detailed configuration information can be found in [grounding_dino_swin-t_finetune_8xb4_20e_cat](grounding_dino_swin-t_finetune_8xb4_20e_cat.py).
+
+### 3 Visualization and Evaluation of Zero-Shot Results
+
+Due to MM Grounding DINO being an open-set detection model, you can perform detection and evaluation even if it was not trained on the cat dataset.
+
+Visualization of a single image:
+
+```shell
+cd mmdetection
+python demo/image_demo.py data/cat/images/IMG_20211205_120756.jpg configs/mm_grounding_dino/grounding_dino_swin-t_finetune_8xb4_20e_cat.py --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth --texts cat.
+```
+
+Evaluation results of Zero-shot on test dataset:
+
+```shell
+python tools/test.py configs/mm_grounding_dino/grounding_dino_swin-t_finetune_8xb4_20e_cat.py grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
+```
+
+```text
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.881
+ Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 1.000
+ Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.929
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000
+ Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.881
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.913
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.913
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.913
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000
+ Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.913
+```
+
+### 4 Fine-tuning
+
+```shell
+./tools/dist_train.sh configs/mm_grounding_dino/grounding_dino_swin-t_finetune_8xb4_20e_cat.py 8 --work-dir cat_work_dir
+```
+
+The model will save the best-performing checkpoint. It achieved its best performance at the 16th epoch, with the following results:
+
+```text
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.901
+ Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 1.000
+ Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.930
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000
+ Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.901
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.967
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.967
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.967
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000
+ Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.967
+```
+
+We can observe that after fine-tuning, the training performance on the cat dataset improved from 88.1 to 90.1. However, due to the small dataset size, the evaluation metrics show some fluctuations.
+
+## Iterative Generation and Optimization Pipeline of Model Self-training Pseduo Label
+
+To facilitate users in creating their own datasets from scratch or those who want to leverage the model's inference capabilities for iterative pseudo-label generation and optimization, continuously modifying pseudo-labels to improve model performance, we have provided relevant pipelines.
+
+Since we have defined two data formats, we will provide separate explanations for demonstration purposes.
+
+### 1 Object Detection Format
+
+Here, we continue to use the aforementioned cat dataset as an example. Let's assume that we currently have a series of images and predefined categories but no annotations.
+
+1. Generate initial `odvg` format file
+
+```python
+import os
+import cv2
+import json
+import jsonlines
+
+data_root = 'data/cat'
+images_path = os.path.join(data_root, 'images')
+out_path = os.path.join(data_root, 'cat_train_od.json')
+metas = []
+for files in os.listdir(images_path):
+ img = cv2.imread(os.path.join(images_path, files))
+ height, width, _ = img.shape
+ metas.append({"filename": files, "height": height, "width": width})
+
+with jsonlines.open(out_path, mode='w') as writer:
+ writer.write_all(metas)
+
+# 生成 label_map.json,由于只有一个类别,所以只需要写一个 cat 即可
+label_map_path = os.path.join(data_root, 'cat_label_map.json')
+with open(label_map_path, 'w') as f:
+ json.dump({'0': 'cat'}, f)
+```
+
+Two files, `cat_train_od.json` and `cat_label_map.json`, will be generated in the `data/cat` directory.
+
+2. Inference with pre-trained model and save the results
+
+We provide a readily usable [configuration](grounding_dino_swin-t_pretrain_pseudo-labeling_cat.py). If you are using a different dataset, you can refer to this configuration for modifications.
+
+```shell
+python tools/test.py configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_pseudo-labeling_cat.py \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
+```
+
+A new file `cat_train_od_v1.json` will be generated in the `data/cat` directory. You can manually open it to confirm or use the provided [script](../../tools/analysis_tools/browse_grounding_raw.py) to visualize the results.
+
+```shell
+python tools/analysis_tools/browse_grounding_raw.py data/cat/ cat_train_od_v1.json images --label-map-file cat_label_map.json -o your_output_dir --not-show
+```
+
+The visualization results will be generated in the `your_output_dir` directory.
+
+3. Continue training to boost performance
+
+After obtaining pseudo-labels, you can mix them with some pre-training data for further pre-training to improve the model's performance on the current dataset. Then, you can repeat step 2 to obtain more accurate pseudo-labels, and continue this iterative process.
+
+### 2 Phrase Grounding Format
+
+1. Generate initial `odvg` format file
+
+The bootstrapping process of Phrase Grounding requires providing captions corresponding to each image and pre-segmented phrase information initially. Taking flickr30k entities images as an example, the generated typical file should look like this:
+
+```text
+[
+{"filename": "3028766968.jpg",
+ "height": 375,
+ "width": 500,
+ "grounding":
+ {"caption": "Man with a black shirt on sit behind a desk sorting threw a giant stack of people work with a smirk on his face .",
+ "regions": [
+ {"bbox": [0, 0, 1, 1], "phrase": "a giant stack of people", "tokens_positive": [[58, 81]]},
+ {"bbox": [0, 0, 1, 1], "phrase": "a black shirt", "tokens_positive": [[9, 22]]},
+ {"bbox": [0, 0, 1, 1], "phrase": "a desk", "tokens_positive": [[37, 43]]},
+ {"bbox": [0, 0, 1, 1], "phrase": "his face", "tokens_positive": [[103, 111]]},
+ {"bbox": [0, 0, 1, 1], "phrase": "Man", "tokens_positive": [[0, 3]]}]}}
+{"filename": "6944134083.jpg",
+ "height": 319,
+ "width": 500,
+ "grounding":
+ {"caption": "Two men are competing in a horse race .",
+ "regions": [
+ {"bbox": [0, 0, 1, 1], "phrase": "Two men", "tokens_positive": [[0, 7]]}]}}
+]
+```
+
+Bbox needs to be set to `[0, 0, 1, 1]` for initialization to make sure the programme could run, but this value would not be utilized.
+
+```text
+{"filename": "3028766968.jpg", "height": 375, "width": 500, "grounding": {"caption": "Man with a black shirt on sit behind a desk sorting threw a giant stack of people work with a smirk on his face .", "regions": [{"bbox": [0, 0, 1, 1], "phrase": "a giant stack of people", "tokens_positive": [[58, 81]]}, {"bbox": [0, 0, 1, 1], "phrase": "a black shirt", "tokens_positive": [[9, 22]]}, {"bbox": [0, 0, 1, 1], "phrase": "a desk", "tokens_positive": [[37, 43]]}, {"bbox": [0, 0, 1, 1], "phrase": "his face", "tokens_positive": [[103, 111]]}, {"bbox": [0, 0, 1, 1], "phrase": "Man", "tokens_positive": [[0, 3]]}]}}
+{"filename": "6944134083.jpg", "height": 319, "width": 500, "grounding": {"caption": "Two men are competing in a horse race .", "regions": [{"bbox": [0, 0, 1, 1], "phrase": "Two men", "tokens_positive": [[0, 7]]}]}}
+```
+
+You can directly copy the text above, and assume that the text content is pasted into a file named `flickr_simple_train_vg.json`, which is placed in the pre-prepared `data/flickr30k_entities` dataset directory, as detailed in the data preparation document.
+
+2. Inference with pre-trained model and save the results
+
+We provide a directly usable [configuration](https://chat.openai.com/c/grounding_dino_swin-t_pretrain_pseudo-labeling_flickr30k.py). If you are using a different dataset, you can refer to this configuration for modifications.
+
+```shell
+python tools/test.py configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_pseudo-labeling_flickr30k.py \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
+```
+
+The translation of your text from Chinese to English is: "A new file `flickr_simple_train_vg_v1.json` will be generated in the `data/flickr30k_entities` directory. You can manually open it to confirm or use the [script](../../tools/analysis_tools/browse_grounding_raw.py) to visualize the effects
+
+```shell
+python tools/analysis_tools/browse_grounding_raw.py data/flickr30k_entities/ flickr_simple_train_vg_v1.json flickr30k_images -o your_output_dir --not-show
+```
+
+The visualization results will be generated in the `your_output_dir` directory, as shown in the following image:
+
+
+
+
+
+3. Continue training to boost performance
+
+After obtaining the pseudo-labels, you can mix some pre-training data to continue pre-training jointly, which enhances the model's performance on the current dataset. Then, rerun step 2 to obtain more accurate pseudo-labels, and repeat this cycle iteratively.
diff --git a/configs/mm_grounding_dino/usage_zh-CN.md b/configs/mm_grounding_dino/usage_zh-CN.md
new file mode 100644
index 00000000000..5f625ea6ca8
--- /dev/null
+++ b/configs/mm_grounding_dino/usage_zh-CN.md
@@ -0,0 +1,491 @@
+# 用法说明
+
+## 安装
+
+在按照 [get_started](../../docs/zh_cn/get_started.md) 一节的说明安装好 MMDet 之后,需要安装额外的依赖包:
+
+```shell
+cd $MMDETROOT
+
+pip install -r requirements/multimodal.txt
+pip install emoji ddd-dataset
+pip install git+https://github.com/lvis-dataset/lvis-api.git"
+```
+
+请注意由于 LVIS 第三方库暂时不支持 numpy 1.24,因此请确保您的 numpy 版本符合要求。建议安装 numpy 1.23 版本。
+
+## 说明
+
+### BERT 权重下载
+
+MM Grounding DINO 采用了 BERT 作为语言模型,需要访问 https://huggingface.co/, 如果您因为网络访问问题遇到连接错误,可以在有网络访问权限的电脑上下载所需文件并保存在本地。最后,修改配置文件中的 `lang_model_name` 字段为本地路径即可。具体请参考以下代码:
+
+```python
+from transformers import BertConfig, BertModel
+from transformers import AutoTokenizer
+
+config = BertConfig.from_pretrained("bert-base-uncased")
+model = BertModel.from_pretrained("bert-base-uncased", add_pooling_layer=False, config=config)
+tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+
+config.save_pretrained("your path/bert-base-uncased")
+model.save_pretrained("your path/bert-base-uncased")
+tokenizer.save_pretrained("your path/bert-base-uncased")
+```
+
+### NLTK 权重下载
+
+MM Grounding DINO 在进行 Phrase Grounding 推理时候可能会进行名词短语提取,虽然会在运行时候下载特定的模型,但是考虑到有些用户运行环境无法联网,因此可以提前下载到 `~/nltk_data` 路径下
+
+```python
+import nltk
+nltk.download('punkt', download_dir='~/nltk_data')
+nltk.download('averaged_perceptron_tagger', download_dir='~/nltk_data')
+```
+
+### MM Grounding DINO-T 模型权重下载
+
+为了方便演示,您可以提前下载 MM Grounding DINO-T 模型权重到当前路径下
+
+```shell
+wget load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa
+```
+
+## 推理
+
+在推理前,为了更好的体验不同图片的推理效果,建议您先下载 [这些图片](https://github.com/microsoft/X-Decoder/tree/main/inference_demo/images) 到当前路径下
+
+MM Grounding DINO 支持了闭集目标检测,开放词汇目标检测,Phrase Grounding 和指代性表达式理解 4 种推理方式,下面详细说明。
+
+**(1) 闭集目标检测**
+
+由于 MM Grounding DINO 是预训练模型,理论上可以应用于任何闭集检测数据集,目前我们支持了常用的 coco/voc/cityscapes/objects365v1/lvis 等,下面以 coco 为例
+
+```shell
+python demo/image_demo.py images/animals.png \
+ configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
+ --texts '$: coco'
+```
+
+会在当前路径下生成 `outputs/vis/animals.png` 的预测结果,如下图所示
+
+
+
+
+
+由于鸵鸟并不在 COCO 80 类中, 因此不会检测出来。
+
+需要注意,由于 objects365v1 和 lvis 类别很多,如果直接将类别名全部输入到网络中,会超过 256 个 token 导致模型预测效果极差,此时我们需要通过 `--chunked-size` 参数进行截断预测, 同时预测时间会比较长。
+
+```shell
+python demo/image_demo.py images/animals.png \
+ configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
+ --texts '$: lvis' --chunked-size 70 \
+ --palette random
+```
+
+
+
+
+
+不同的 `--chunked-size` 会导致不同的预测效果,您可以自行尝试。
+
+**(2) 开放词汇目标检测**
+
+开放词汇目标检测是指在推理时候,可以输入任意的类别名
+
+```shell
+python demo/image_demo.py images/animals.png \
+ configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
+ --texts 'zebra. giraffe' -c
+```
+
+
+
+
+
+**(3) Phrase Grounding**
+
+Phrase Grounding 是指的用户输入一句语言描述,模型自动对其涉及到的名词短语想对应的 bbox 进行检测,有两种用法
+
+1. 通过 NLTK 库自动提取名词短语,然后进行检测
+
+```shell
+python demo/image_demo.py images/apples.jpg \
+ configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
+ --texts 'There are many apples here.'
+```
+
+
+
+
+
+程序内部会自动切分出 `many apples` 作为名词短语,然后检测出对应物体。不同的输入描述对预测结果影响很大。
+
+2. 用户自己指定句子中哪些为名词短语,避免 NLTK 提取错误的情况
+
+```shell
+python demo/image_demo.py images/fruit.jpg \
+ configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
+ --texts 'The picture contains watermelon, flower, and a white bottle.' \
+ --tokens-positive "[[[21,31]], [[45,59]]]" --pred-score-thr 0.12
+```
+
+21,31 对应的名词短语为 `watermelon`,45,59 对应的名词短语为 `a white bottle`。
+
+
+
+
+
+**(4) 指代性表达式理解**
+
+指代性表达式理解是指的用户输入一句语言描述,模型自动对其涉及到的指代性表达式进行理解, 不需要进行名词短语提取。
+
+```shell
+python demo/image_demo.py images/apples.jpg \
+ configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
+ --texts 'red apple.' \
+ --tokens-positive -1
+```
+
+
+
+
+
+## 评测
+
+我们所提供的评测脚本都是统一的,你只需要提前准备好数据,然后运行相关配置就可以了
+
+(1) Zero-Shot COCO2017 val
+
+```shell
+# 单卡
+python tools/test.py configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
+
+# 8 卡
+./tools/dist_test.sh configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth 8
+```
+
+(2) Zero-Shot ODinW13
+
+```shell
+# 单卡
+python tools/test.py configs/mm_grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw13.py \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
+
+# 8 卡
+./tools/dist_test.sh configs/mm_grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw13.py \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth 8
+```
+
+## 评测数据集结果可视化
+
+为了方便大家对模型预测结果进行可视化和分析,我们支持了评测数据集预测结果可视化,以指代性表达式理解为例用法如下:
+
+```shell
+python tools/test.py configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_pretrain_zeroshot_refexp \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth --work-dir refcoco_result --show-dir save_path
+```
+
+模型在推理过程中会将可视化结果保存到 `refcoco_result/{当前时间戳}/save_path` 路径下。其余评测数据集可视化只需要替换配置文件即可。
+
+下面展示一些数据集的可视化结果: 左图为 GT,右图为预测结果
+
+1. COCO2017 val 结果:
+
+
+
+
+
+2. Flickr30k Entities 结果:
+
+
+
+
+
+3. DOD 结果:
+
+
+
+
+
+4. RefCOCO val 结果:
+
+
+
+
+
+5. RefCOCO testA 结果:
+
+
+
+
+
+6. gRefCOCO val 结果:
+
+
+
+
+
+## 模型训练
+
+如果想复现我们的结果,你可以在准备好数据集后,直接通过如下命令进行训练
+
+```shell
+# 单机 8 卡训练仅包括 obj365v1 数据集
+./tools/dist_train.sh configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py 8
+# 单机 8 卡训练包括 obj365v1/goldg/grit/v3det 数据集,其余数据集类似
+./tools/dist_train.sh configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det.py 8
+```
+
+多机训练的用法请参考 [train.md](../../docs/zh_cn/user_guides/train.md)。MM-Grounding-DINO T 模型默认采用的是 32 张 3090Ti,如果你的总 bs 数不是 32x4=128,那么你需要手动的线性调整学习率。
+
+### 预训练自定义格式说明
+
+为了统一不同数据集的预训练格式,我们参考 [Open-GroundingDino](https://github.com/longzw1997/Open-GroundingDino) 所设计的格式。具体来说分成 2 种格式
+
+**(1) 目标检测数据格式 OD**
+
+```text
+{"filename": "obj365_train_000000734304.jpg",
+ "height": 512,
+ "width": 769,
+ "detection": {
+ "instances": [
+ {"bbox": [109.4768676992, 346.0190429696, 135.1918335098, 365.3641967616], "label": 2, "category": "chair"},
+ {"bbox": [58.612365705900004, 323.2281494016, 242.6005859067, 451.4166870016], "label": 8, "category": "car"}
+ ]
+ }
+}
+```
+
+label字典中所对应的数值需要和相应的 label_map 一致。 instances 列表中的每一项都对应一个 bbox (x1y1x2y2 格式)。
+
+**(2) phrase grounding 数据格式 VG**
+
+```text
+{"filename": "2405116.jpg",
+ "height": 375,
+ "width": 500,
+ "grounding":
+ {"caption": "Two surfers walking down the shore. sand on the beach.",
+ "regions": [
+ {"bbox": [206, 156, 282, 248], "phrase": "Two surfers", "tokens_positive": [[0, 3], [4, 11]]},
+ {"bbox": [303, 338, 443, 343], "phrase": "sand", "tokens_positive": [[36, 40]]},
+ {"bbox": [[327, 223, 421, 282], [300, 200, 400, 210]], "phrase": "beach", "tokens_positive": [[48, 53]]}
+ ]
+ }
+```
+
+tokens_positive 表示当前 phrase 在 caption 中的字符位置。
+
+## 自定义数据集微调训练案例
+
+为了方便用户针对自定义数据集进行下游微调,我们特意提供了以简单的 cat 数据集为例的微调训练案例。
+
+### 1 数据准备
+
+```shell
+cd mmdetection
+wget https://download.openmmlab.com/mmyolo/data/cat_dataset.zip
+unzip cat_dataset.zip -d data/cat/
+```
+
+cat 数据集是一个单类别数据集,包含 144 张图片,已经转换为 coco 格式。
+
+
+
+
+
+### 2 配置准备
+
+由于 cat 数据集的简单性和数量较少,我们使用 8 卡训练 20 个 epoch,相应的缩放学习率,不训练语言模型,只训练视觉模型。
+
+详细的配置信息可以在 [grounding_dino_swin-t_finetune_8xb4_20e_cat](grounding_dino_swin-t_finetune_8xb4_20e_cat.py) 中找到。
+
+### 3 可视化和 Zero-Shot 评估
+
+由于 MM Grounding DINO 是一个开放的检测模型,所以即使没有在 cat 数据集上训练,也可以进行检测和评估。
+
+单张图片的可视化结果如下:
+
+```shell
+cd mmdetection
+python demo/image_demo.py data/cat/images/IMG_20211205_120756.jpg configs/mm_grounding_dino/grounding_dino_swin-t_finetune_8xb4_20e_cat.py --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth --texts cat.
+```
+
+测试集上的 Zero-Shot 评估结果如下:
+
+```shell
+python tools/test.py configs/mm_grounding_dino/grounding_dino_swin-t_finetune_8xb4_20e_cat.py grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
+```
+
+```text
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.881
+ Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 1.000
+ Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.929
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000
+ Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.881
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.913
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.913
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.913
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000
+ Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.913
+```
+
+### 4 模型训练
+
+```shell
+./tools/dist_train.sh configs/mm_grounding_dino/grounding_dino_swin-t_finetune_8xb4_20e_cat.py 8 --work-dir cat_work_dir
+```
+
+模型将会保存性能最佳的模型。在第 16 epoch 时候达到最佳,性能如下所示:
+
+```text
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.901
+ Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 1.000
+ Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.930
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000
+ Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.901
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.967
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.967
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.967
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000
+ Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.967
+```
+
+我们可以发现,经过微调训练后,cat 数据集的训练性能从 88.1 提升到了 90.1。同时由于数据集比较小,评估指标波动比较大。
+
+## 模型自训练伪标签迭代生成和优化 pipeline
+
+为了方便用户从头构建自己的数据集或者希望利用模型推理能力进行自举式伪标签迭代生成和优化,不断修改伪标签来提升模型性能,我们特意提供了相关的 pipeline。
+
+由于我们定义了两种数据格式,为了演示我们也将分别进行说明。
+
+### 1 目标检测格式
+
+此处我们依然采用上述的 cat 数据集为例,假设我们目前只有一系列图片和预定义的类别,并不存在标注。
+
+1. 生成初始 odvg 格式文件
+
+```python
+import os
+import cv2
+import json
+import jsonlines
+
+data_root = 'data/cat'
+images_path = os.path.join(data_root, 'images')
+out_path = os.path.join(data_root, 'cat_train_od.json')
+metas = []
+for files in os.listdir(images_path):
+ img = cv2.imread(os.path.join(images_path, files))
+ height, width, _ = img.shape
+ metas.append({"filename": files, "height": height, "width": width})
+
+with jsonlines.open(out_path, mode='w') as writer:
+ writer.write_all(metas)
+
+# 生成 label_map.json,由于只有一个类别,所以只需要写一个 cat 即可
+label_map_path = os.path.join(data_root, 'cat_label_map.json')
+with open(label_map_path, 'w') as f:
+ json.dump({'0': 'cat'}, f)
+```
+
+会在 `data/cat` 目录下生成 `cat_train_od.json` 和 `cat_label_map.json` 两个文件。
+
+2. 使用预训练模型进行推理,并保存结果
+
+我们提供了直接可用的 [配置](grounding_dino_swin-t_pretrain_pseudo-labeling_cat.py), 如果你是其他数据集可以参考这个配置进行修改。
+
+```shell
+python tools/test.py configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_pseudo-labeling_cat.py \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
+```
+
+会在 `data/cat` 目录下新生成 `cat_train_od_v1.json` 文件,你可以手动打开确认或者使用 [脚本](../../tools/analysis_tools/browse_grounding_raw.py) 可视化效果
+
+```shell
+python tools/analysis_tools/browse_grounding_raw.py data/cat/ cat_train_od_v1.json images --label-map-file cat_label_map.json -o your_output_dir --not-show
+```
+
+会在 your_output_dir 目录下生成可视化结果
+
+3. 继续训练提高性能
+
+在得到伪标签后,你可以混合一些预训练数据联合进行继续预训练,提升模型在当前数据集上的性能,然后重新运行 2 步骤,得到更准确的伪标签,如此循环迭代即可。
+
+### 2 Phrase Grounding 格式
+
+1. 生成初始 odvg 格式文件
+
+Phrase Grounding 的自举流程要求初始时候提供每张图片对应的 caption 和提前切割好的 phrase 信息。以 flickr30k entities 图片为例,生成的典型的文件应该如下所示:
+
+```text
+[
+{"filename": "3028766968.jpg",
+ "height": 375,
+ "width": 500,
+ "grounding":
+ {"caption": "Man with a black shirt on sit behind a desk sorting threw a giant stack of people work with a smirk on his face .",
+ "regions": [
+ {"bbox": [0, 0, 1, 1], "phrase": "a giant stack of people", "tokens_positive": [[58, 81]]},
+ {"bbox": [0, 0, 1, 1], "phrase": "a black shirt", "tokens_positive": [[9, 22]]},
+ {"bbox": [0, 0, 1, 1], "phrase": "a desk", "tokens_positive": [[37, 43]]},
+ {"bbox": [0, 0, 1, 1], "phrase": "his face", "tokens_positive": [[103, 111]]},
+ {"bbox": [0, 0, 1, 1], "phrase": "Man", "tokens_positive": [[0, 3]]}]}}
+{"filename": "6944134083.jpg",
+ "height": 319,
+ "width": 500,
+ "grounding":
+ {"caption": "Two men are competing in a horse race .",
+ "regions": [
+ {"bbox": [0, 0, 1, 1], "phrase": "Two men", "tokens_positive": [[0, 7]]}]}}
+]
+```
+
+初始时候 bbox 必须要设置为 `[0, 0, 1, 1]`,因为这能确保程序正常运行,但是 bbox 的值并不会被使用。
+
+```text
+{"filename": "3028766968.jpg", "height": 375, "width": 500, "grounding": {"caption": "Man with a black shirt on sit behind a desk sorting threw a giant stack of people work with a smirk on his face .", "regions": [{"bbox": [0, 0, 1, 1], "phrase": "a giant stack of people", "tokens_positive": [[58, 81]]}, {"bbox": [0, 0, 1, 1], "phrase": "a black shirt", "tokens_positive": [[9, 22]]}, {"bbox": [0, 0, 1, 1], "phrase": "a desk", "tokens_positive": [[37, 43]]}, {"bbox": [0, 0, 1, 1], "phrase": "his face", "tokens_positive": [[103, 111]]}, {"bbox": [0, 0, 1, 1], "phrase": "Man", "tokens_positive": [[0, 3]]}]}}
+{"filename": "6944134083.jpg", "height": 319, "width": 500, "grounding": {"caption": "Two men are competing in a horse race .", "regions": [{"bbox": [0, 0, 1, 1], "phrase": "Two men", "tokens_positive": [[0, 7]]}]}}
+```
+
+你可直接复制上面的文本,并假设将文本内容粘贴到命名为 `flickr_simple_train_vg.json` 文件中,并放置于提前准备好的 `data/flickr30k_entities` 数据集目录下,具体见数据准备文档。
+
+2. 使用预训练模型进行推理,并保存结果
+
+我们提供了直接可用的 [配置](grounding_dino_swin-t_pretrain_pseudo-labeling_flickr30k.py), 如果你是其他数据集可以参考这个配置进行修改。
+
+```shell
+python tools/test.py configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_pseudo-labeling_flickr30k.py \
+ grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
+```
+
+会在 `data/flickr30k_entities` 目录下新生成 `flickr_simple_train_vg_v1.json` 文件,你可以手动打开确认或者使用 [脚本](../../tools/analysis_tools/browse_grounding_raw.py) 可视化效果
+
+```shell
+python tools/analysis_tools/browse_grounding_raw.py data/flickr30k_entities/ flickr_simple_train_vg_v1.json flickr30k_images -o your_output_dir --not-show
+```
+
+会在 `your_output_dir` 目录下生成可视化结果,如下图所示:
+
+
+
+
+
+3. 继续训练提高性能
+
+在得到伪标签后,你可以混合一些预训练数据联合进行继续预训练,提升模型在当前数据集上的性能,然后重新运行 2 步骤,得到更准确的伪标签,如此循环迭代即可。
diff --git a/configs/rtmdet/README.md b/configs/rtmdet/README.md
index 4574dd613c1..1677184af76 100644
--- a/configs/rtmdet/README.md
+++ b/configs/rtmdet/README.md
@@ -20,14 +20,17 @@ In this paper, we aim to design an efficient real-time object detector that exce
### Object Detection
-| Model | size | box AP | Params(M) | FLOPS(G) | TRT-FP16-Latency(ms) RTX3090 | TRT-FP16-Latency(ms) T4 | Config | Download |
-| :---------: | :--: | :----: | :-------: | :------: | :-----------------------------: | :------------------------: | :----------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| RTMDet-tiny | 640 | 41.1 | 4.8 | 8.1 | 0.98 | 2.34 | [config](./rtmdet_tiny_8xb32-300e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_tiny_8xb32-300e_coco/rtmdet_tiny_8xb32-300e_coco_20220902_112414-78e30dcc.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_tiny_8xb32-300e_coco/rtmdet_tiny_8xb32-300e_coco_20220902_112414.log.json) |
-| RTMDet-s | 640 | 44.6 | 8.89 | 14.8 | 1.22 | 2.96 | [config](./rtmdet_s_8xb32-300e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_s_8xb32-300e_coco/rtmdet_s_8xb32-300e_coco_20220905_161602-387a891e.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_s_8xb32-300e_coco/rtmdet_s_8xb32-300e_coco_20220905_161602.log.json) |
-| RTMDet-m | 640 | 49.4 | 24.71 | 39.27 | 1.62 | 6.41 | [config](./rtmdet_m_8xb32-300e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_m_8xb32-300e_coco/rtmdet_m_8xb32-300e_coco_20220719_112220-229f527c.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_m_8xb32-300e_coco/rtmdet_m_8xb32-300e_coco_20220719_112220.log.json) |
-| RTMDet-l | 640 | 51.5 | 52.3 | 80.23 | 2.44 | 10.32 | [config](./rtmdet_l_8xb32-300e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_l_8xb32-300e_coco/rtmdet_l_8xb32-300e_coco_20220719_112030-5a0be7c4.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_l_8xb32-300e_coco/rtmdet_l_8xb32-300e_coco_20220719_112030.log.json) |
-| RTMDet-x | 640 | 52.8 | 94.86 | 141.67 | 3.10 | 18.80 | [config](./rtmdet_x_8xb32-300e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_x_8xb32-300e_coco/rtmdet_x_8xb32-300e_coco_20220715_230555-cc79b9ae.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_x_8xb32-300e_coco/rtmdet_x_8xb32-300e_coco_20220715_230555.log.json) |
-| RTMDet-x-P6 | 1280 | 54.9 | | | | | [config](./rtmdet_x_p6_4xb8-300e_coco.py) | [model](https://github.com/orange0-jp/orange-weights/releases/download/v0.1.0rtmdet-p6/rtmdet_x_p6_4xb8-300e_coco-bf32be58.pth) |
+| Model | size | box AP | Params(M) | FLOPS(G) | TRT-FP16-Latency(ms) RTX3090 | TRT-FP16-Latency(ms) T4 | Config | Download |
+| :-----------------: | :--: | :----: | :-------: | :------: | :-----------------------------: | :------------------------: | :------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| RTMDet-tiny | 640 | 41.1 | 4.8 | 8.1 | 0.98 | 2.34 | [config](./rtmdet_tiny_8xb32-300e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_tiny_8xb32-300e_coco/rtmdet_tiny_8xb32-300e_coco_20220902_112414-78e30dcc.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_tiny_8xb32-300e_coco/rtmdet_tiny_8xb32-300e_coco_20220902_112414.log.json) |
+| RTMDet-s | 640 | 44.6 | 8.89 | 14.8 | 1.22 | 2.96 | [config](./rtmdet_s_8xb32-300e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_s_8xb32-300e_coco/rtmdet_s_8xb32-300e_coco_20220905_161602-387a891e.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_s_8xb32-300e_coco/rtmdet_s_8xb32-300e_coco_20220905_161602.log.json) |
+| RTMDet-m | 640 | 49.4 | 24.71 | 39.27 | 1.62 | 6.41 | [config](./rtmdet_m_8xb32-300e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_m_8xb32-300e_coco/rtmdet_m_8xb32-300e_coco_20220719_112220-229f527c.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_m_8xb32-300e_coco/rtmdet_m_8xb32-300e_coco_20220719_112220.log.json) |
+| RTMDet-l | 640 | 51.5 | 52.3 | 80.23 | 2.44 | 10.32 | [config](./rtmdet_l_8xb32-300e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_l_8xb32-300e_coco/rtmdet_l_8xb32-300e_coco_20220719_112030-5a0be7c4.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_l_8xb32-300e_coco/rtmdet_l_8xb32-300e_coco_20220719_112030.log.json) |
+| RTMDet-x | 640 | 52.8 | 94.86 | 141.67 | 3.10 | 18.80 | [config](./rtmdet_x_8xb32-300e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_x_8xb32-300e_coco/rtmdet_x_8xb32-300e_coco_20220715_230555-cc79b9ae.pth) \| [log](https://download.openmmlab.com/mmdetection/v3.0/rtmdet/rtmdet_x_8xb32-300e_coco/rtmdet_x_8xb32-300e_coco_20220715_230555.log.json) |
+| RTMDet-x-P6 | 1280 | 54.9 | | | | | [config](./rtmdet_x_p6_4xb8-300e_coco.py) | [model](https://github.com/orange0-jp/orange-weights/releases/download/v0.1.0rtmdet-p6/rtmdet_x_p6_4xb8-300e_coco-bf32be58.pth) |
+| RTMDet-l-ConvNeXt-B | 640 | 53.1 | | | | | [config](./rtmdet_l_convnext_b_4xb32-100e_coco.py) | [model](https://github.com/orange0-jp/orange-weights/releases/download/v0.1.0rtmdet-swin-convnext/rtmdet_l_convnext_b_4xb32-100e_coco-d4731b3d.pth) |
+| RTMDet-l-Swin-B | 640 | 52.4 | | | | | [config](./rtmdet_l_swin_b_4xb32-100e_coco.py) | [model](https://github.com/orange0-jp/orange-weights/releases/download/v0.1.0rtmdet-swin-convnext/rtmdet_l_swin_b_4xb32-100e_coco-0828ce5d.pth) |
+| RTMDet-l-Swin-B-P6 | 1280 | 56.4 | | | | | [config](./rtmdet_l_swin_b_p6_4xb16-100e_coco.py) | [model](https://github.com/orange0-jp/orange-weights/releases/download/v0.1.0rtmdet-swin-convnext/rtmdet_l_swin_b_p6_4xb16-100e_coco-a1486b6f.pth) |
**Note**:
diff --git a/configs/rtmdet/metafile.yml b/configs/rtmdet/metafile.yml
index 7dc72e130be..a62abcb2faa 100644
--- a/configs/rtmdet/metafile.yml
+++ b/configs/rtmdet/metafile.yml
@@ -104,6 +104,48 @@ Models:
box AP: 54.9
Weights: https://github.com/orange0-jp/orange-weights/releases/download/v0.1.0rtmdet-p6/rtmdet_x_p6_4xb8-300e_coco-bf32be58.pth
+ - Name: rtmdet_l_convnext_b_4xb32-100e_coco
+ Alias:
+ - rtmdet-l_convnext_b
+ In Collection: RTMDet
+ Config: configs/rtmdet/rtmdet_l_convnext_b_4xb32-100e_coco.py
+ Metadata:
+ Epochs: 100
+ Results:
+ - Task: Object Detection
+ Dataset: COCO
+ Metrics:
+ box AP: 53.1
+ Weights: https://github.com/orange0-jp/orange-weights/releases/download/v0.1.0rtmdet-swin-convnext/rtmdet_l_convnext_b_4xb32-100e_coco-d4731b3d.pth
+
+ - Name: rtmdet_l_swin_b_4xb32-100e_coco
+ Alias:
+ - rtmdet-l_swin_b
+ In Collection: RTMDet
+ Config: configs/rtmdet/rtmdet_l_swin_b_4xb32-100e_coco.py
+ Metadata:
+ Epochs: 100
+ Results:
+ - Task: Object Detection
+ Dataset: COCO
+ Metrics:
+ box AP: 52.4
+ Weights: https://github.com/orange0-jp/orange-weights/releases/download/v0.1.0rtmdet-swin-convnext/rtmdet_l_swin_b_4xb32-100e_coco-0828ce5d.pth
+
+ - Name: rtmdet_l_swin_b_p6_4xb16-100e_coco
+ Alias:
+ - rtmdet-l_swin_b_p6
+ In Collection: RTMDet
+ Config: configs/rtmdet/rtmdet_l_swin_b_p6_4xb16-100e_coco.py
+ Metadata:
+ Epochs: 100
+ Results:
+ - Task: Object Detection
+ Dataset: COCO
+ Metrics:
+ box AP: 56.4
+ Weights: https://github.com/orange0-jp/orange-weights/releases/download/v0.1.0rtmdet-swin-convnext/rtmdet_l_swin_b_p6_4xb16-100e_coco-a1486b6f.pth
+
- Name: rtmdet-ins_tiny_8xb32-300e_coco
Alias:
- rtmdet-ins-t
diff --git a/configs/rtmdet/rtmdet_l_convnext_b_4xb32-100e_coco.py b/configs/rtmdet/rtmdet_l_convnext_b_4xb32-100e_coco.py
new file mode 100644
index 00000000000..85af292bcab
--- /dev/null
+++ b/configs/rtmdet/rtmdet_l_convnext_b_4xb32-100e_coco.py
@@ -0,0 +1,81 @@
+_base_ = './rtmdet_l_8xb32-300e_coco.py'
+
+custom_imports = dict(
+ imports=['mmpretrain.models'], allow_failed_imports=False)
+
+norm_cfg = dict(type='GN', num_groups=32)
+checkpoint_file = 'https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-pre-3rdparty_in1k-384px_20221219-4570f792.pth' # noqa
+model = dict(
+ type='RTMDet',
+ data_preprocessor=dict(
+ _delete_=True,
+ type='DetDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ bgr_to_rgb=True,
+ batch_augments=None),
+ backbone=dict(
+ _delete_=True,
+ type='mmpretrain.ConvNeXt',
+ arch='base',
+ out_indices=[1, 2, 3],
+ drop_path_rate=0.7,
+ layer_scale_init_value=1.0,
+ gap_before_final_norm=False,
+ with_cp=True,
+ init_cfg=dict(
+ type='Pretrained', checkpoint=checkpoint_file,
+ prefix='backbone.')),
+ neck=dict(in_channels=[256, 512, 1024], norm_cfg=norm_cfg),
+ bbox_head=dict(norm_cfg=norm_cfg))
+
+max_epochs = 100
+stage2_num_epochs = 10
+interval = 10
+base_lr = 0.001
+
+train_cfg = dict(
+ max_epochs=max_epochs,
+ val_interval=interval,
+ dynamic_intervals=[(max_epochs - stage2_num_epochs, 1)])
+
+optim_wrapper = dict(
+ constructor='LearningRateDecayOptimizerConstructor',
+ paramwise_cfg={
+ 'decay_rate': 0.8,
+ 'decay_type': 'layer_wise',
+ 'num_layers': 12
+ },
+ optimizer=dict(lr=base_lr))
+
+# learning rate
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1.0e-5,
+ by_epoch=False,
+ begin=0,
+ end=1000),
+ dict(
+ # use cosine lr from 50 to 100 epoch
+ type='CosineAnnealingLR',
+ eta_min=base_lr * 0.05,
+ begin=max_epochs // 2,
+ end=max_epochs,
+ T_max=max_epochs // 2,
+ by_epoch=True,
+ convert_to_iter_based=True),
+]
+
+custom_hooks = [
+ dict(
+ type='EMAHook',
+ ema_type='ExpMomentumEMA',
+ momentum=0.0002,
+ update_buffers=True,
+ priority=49),
+ dict(
+ type='PipelineSwitchHook',
+ switch_epoch=max_epochs - stage2_num_epochs,
+ switch_pipeline={{_base_.train_pipeline_stage2}})
+]
diff --git a/configs/rtmdet/rtmdet_l_swin_b_4xb32-100e_coco.py b/configs/rtmdet/rtmdet_l_swin_b_4xb32-100e_coco.py
new file mode 100644
index 00000000000..84b0e0fa7d1
--- /dev/null
+++ b/configs/rtmdet/rtmdet_l_swin_b_4xb32-100e_coco.py
@@ -0,0 +1,78 @@
+_base_ = './rtmdet_l_8xb32-300e_coco.py'
+
+norm_cfg = dict(type='GN', num_groups=32)
+checkpoint = 'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window12_384_22k.pth' # noqa
+model = dict(
+ type='RTMDet',
+ data_preprocessor=dict(
+ _delete_=True,
+ type='DetDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ bgr_to_rgb=True,
+ batch_augments=None),
+ backbone=dict(
+ _delete_=True,
+ type='SwinTransformer',
+ pretrain_img_size=384,
+ embed_dims=128,
+ depths=[2, 2, 18, 2],
+ num_heads=[4, 8, 16, 32],
+ window_size=12,
+ mlp_ratio=4,
+ qkv_bias=True,
+ qk_scale=None,
+ drop_rate=0.,
+ attn_drop_rate=0.,
+ drop_path_rate=0.3,
+ patch_norm=True,
+ out_indices=(1, 2, 3),
+ with_cp=True,
+ convert_weights=True,
+ init_cfg=dict(type='Pretrained', checkpoint=checkpoint)),
+ neck=dict(in_channels=[256, 512, 1024], norm_cfg=norm_cfg),
+ bbox_head=dict(norm_cfg=norm_cfg))
+
+max_epochs = 100
+stage2_num_epochs = 10
+interval = 10
+base_lr = 0.001
+
+train_cfg = dict(
+ max_epochs=max_epochs,
+ val_interval=interval,
+ dynamic_intervals=[(max_epochs - stage2_num_epochs, 1)])
+
+optim_wrapper = dict(optimizer=dict(lr=base_lr))
+
+# learning rate
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1.0e-5,
+ by_epoch=False,
+ begin=0,
+ end=1000),
+ dict(
+ # use cosine lr from 50 to 100 epoch
+ type='CosineAnnealingLR',
+ eta_min=base_lr * 0.05,
+ begin=max_epochs // 2,
+ end=max_epochs,
+ T_max=max_epochs // 2,
+ by_epoch=True,
+ convert_to_iter_based=True),
+]
+
+custom_hooks = [
+ dict(
+ type='EMAHook',
+ ema_type='ExpMomentumEMA',
+ momentum=0.0002,
+ update_buffers=True,
+ priority=49),
+ dict(
+ type='PipelineSwitchHook',
+ switch_epoch=max_epochs - stage2_num_epochs,
+ switch_pipeline={{_base_.train_pipeline_stage2}})
+]
diff --git a/configs/rtmdet/rtmdet_l_swin_b_p6_4xb16-100e_coco.py b/configs/rtmdet/rtmdet_l_swin_b_p6_4xb16-100e_coco.py
new file mode 100644
index 00000000000..37d4215c3f0
--- /dev/null
+++ b/configs/rtmdet/rtmdet_l_swin_b_p6_4xb16-100e_coco.py
@@ -0,0 +1,114 @@
+_base_ = './rtmdet_l_swin_b_4xb32-100e_coco.py'
+
+model = dict(
+ backbone=dict(
+ depths=[2, 2, 18, 2, 1],
+ num_heads=[4, 8, 16, 32, 64],
+ strides=(4, 2, 2, 2, 2),
+ out_indices=(1, 2, 3, 4)),
+ neck=dict(in_channels=[256, 512, 1024, 2048]),
+ bbox_head=dict(
+ anchor_generator=dict(
+ type='MlvlPointGenerator', offset=0, strides=[8, 16, 32, 64])))
+
+train_pipeline = [
+ dict(type='LoadImageFromFile', backend_args={{_base_.backend_args}}),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(type='CachedMosaic', img_scale=(1280, 1280), pad_val=114.0),
+ dict(
+ type='RandomResize',
+ scale=(2560, 2560),
+ ratio_range=(0.1, 2.0),
+ keep_ratio=True),
+ dict(type='RandomCrop', crop_size=(1280, 1280)),
+ dict(type='YOLOXHSVRandomAug'),
+ dict(type='RandomFlip', prob=0.5),
+ dict(type='Pad', size=(1280, 1280), pad_val=dict(img=(114, 114, 114))),
+ dict(
+ type='CachedMixUp',
+ img_scale=(1280, 1280),
+ ratio_range=(1.0, 1.0),
+ max_cached_images=20,
+ pad_val=(114, 114, 114)),
+ dict(type='PackDetInputs')
+]
+
+train_pipeline_stage2 = [
+ dict(type='LoadImageFromFile', backend_args={{_base_.backend_args}}),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(
+ type='RandomResize',
+ scale=(1280, 1280),
+ ratio_range=(0.1, 2.0),
+ keep_ratio=True),
+ dict(type='RandomCrop', crop_size=(1280, 1280)),
+ dict(type='YOLOXHSVRandomAug'),
+ dict(type='RandomFlip', prob=0.5),
+ dict(type='Pad', size=(1280, 1280), pad_val=dict(img=(114, 114, 114))),
+ dict(type='PackDetInputs')
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile', backend_args={{_base_.backend_args}}),
+ dict(type='Resize', scale=(1280, 1280), keep_ratio=True),
+ dict(type='Pad', size=(1280, 1280), pad_val=dict(img=(114, 114, 114))),
+ dict(type='LoadAnnotations', with_bbox=True),
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor'))
+]
+
+train_dataloader = dict(
+ batch_size=16, num_workers=20, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(num_workers=20, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+max_epochs = 100
+stage2_num_epochs = 10
+
+custom_hooks = [
+ dict(
+ type='EMAHook',
+ ema_type='ExpMomentumEMA',
+ momentum=0.0002,
+ update_buffers=True,
+ priority=49),
+ dict(
+ type='PipelineSwitchHook',
+ switch_epoch=max_epochs - stage2_num_epochs,
+ switch_pipeline=train_pipeline_stage2)
+]
+
+img_scales = [(1280, 1280), (640, 640), (1920, 1920)]
+tta_pipeline = [
+ dict(type='LoadImageFromFile', backend_args=None),
+ dict(
+ type='TestTimeAug',
+ transforms=[
+ [
+ dict(type='Resize', scale=s, keep_ratio=True)
+ for s in img_scales
+ ],
+ [
+ # ``RandomFlip`` must be placed before ``Pad``, otherwise
+ # bounding box coordinates after flipping cannot be
+ # recovered correctly.
+ dict(type='RandomFlip', prob=1.),
+ dict(type='RandomFlip', prob=0.)
+ ],
+ [
+ dict(
+ type='Pad',
+ size=(1920, 1920),
+ pad_val=dict(img=(114, 114, 114))),
+ ],
+ [dict(type='LoadAnnotations', with_bbox=True)],
+ [
+ dict(
+ type='PackDetInputs',
+ meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction'))
+ ]
+ ])
+]
diff --git a/configs/solo/solo_r50_fpn_3x_coco.py b/configs/solo/solo_r50_fpn_3x_coco.py
index 98a9505538c..0d5abbd2f4d 100644
--- a/configs/solo/solo_r50_fpn_3x_coco.py
+++ b/configs/solo/solo_r50_fpn_3x_coco.py
@@ -15,7 +15,7 @@
# training schedule for 3x
max_epochs = 36
-train_cfg = dict(by_epoch=True, max_epochs=max_epochs)
+train_cfg = dict(max_epochs=max_epochs)
# learning rate
param_scheduler = [
diff --git a/demo/MMDet_InstanceSeg_Tutorial.ipynb b/demo/MMDet_InstanceSeg_Tutorial.ipynb
index 1cd020e5750..4b63ba340b2 100644
--- a/demo/MMDet_InstanceSeg_Tutorial.ipynb
+++ b/demo/MMDet_InstanceSeg_Tutorial.ipynb
@@ -411,7 +411,7 @@
"outputs": [
{
"data": {
- "image/png": "\n",
+ "image/png": "",
"text/plain": [
""
]
@@ -540,7 +540,7 @@
"outputs": [
{
"data": {
- "image/png": "\n",
+ "image/png": "",
"text/plain": [
""
]
@@ -751,7 +751,7 @@
" annotations = []\n",
" images = []\n",
" obj_count = 0\n",
- " for idx, v in enumerate(mmengine.track_iter_progress(data_infos.values())):\n",
+ " for idx, v in enumerate(mmengine.track_iter_progress(list(data_infos.values()))):\n",
" filename = v['filename']\n",
" img_path = osp.join(image_prefix, filename)\n",
" height, width = mmcv.imread(img_path).shape[:2]\n",
@@ -2088,7 +2088,7 @@
"outputs": [
{
"data": {
- "image/png": "\n",
+ "image/png": "",
"text/plain": [
""
]
diff --git a/demo/image_demo.py b/demo/image_demo.py
index 2e2c27adbf2..1f994cb40ea 100644
--- a/demo/image_demo.py
+++ b/demo/image_demo.py
@@ -28,6 +28,25 @@
glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365 \
--texts 'There are a lot of cars here.'
+ python demo/image_demo.py demo/demo.jpg \
+ glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365 \
+ --texts '$: coco'
+
+ python demo/image_demo.py demo/demo.jpg \
+ glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365 \
+ --texts '$: lvis' --pred-score-thr 0.7 \
+ --palette random --chunked-size 80
+
+ python demo/image_demo.py demo/demo.jpg \
+ grounding_dino_swin-t_pretrain_obj365_goldg_cap4m \
+ --texts '$: lvis' --pred-score-thr 0.4 \
+ --palette random --chunked-size 80
+
+ python demo/image_demo.py demo/demo.jpg \
+ grounding_dino_swin-t_pretrain_obj365_goldg_cap4m \
+ --texts "a red car in the upper right corner" \
+ --tokens-positive -1
+
Visualize prediction results::
python demo/image_demo.py demo/demo.jpg rtmdet-ins-s --show
@@ -36,11 +55,13 @@
--show
"""
+import ast
from argparse import ArgumentParser
from mmengine.logging import print_log
from mmdet.apis import DetInferencer
+from mmdet.evaluation import get_classes
def parse_args():
@@ -60,7 +81,12 @@ def parse_args():
type=str,
default='outputs',
help='Output directory of images or prediction results.')
- parser.add_argument('--texts', help='text prompt')
+ # Once you input a format similar to $: xxx, it indicates that
+ # the prompt is based on the dataset class name.
+ # support $: coco, $: voc, $: cityscapes, $: lvis, $: imagenet_det.
+ # detail to `mmdet/evaluation/functional/class_names.py`
+ parser.add_argument(
+ '--texts', help='text prompt, such as "bench . car .", "$: coco"')
parser.add_argument(
'--device', default='cuda:0', help='Device used for inference')
parser.add_argument(
@@ -91,7 +117,7 @@ def parse_args():
default='none',
choices=['coco', 'voc', 'citys', 'random', 'none'],
help='Color palette used for visualization')
- # only for GLIP
+ # only for GLIP and Grounding DINO
parser.add_argument(
'--custom-entities',
'-c',
@@ -99,6 +125,22 @@ def parse_args():
help='Whether to customize entity names? '
'If so, the input text should be '
'"cls_name1 . cls_name2 . cls_name3 ." format')
+ parser.add_argument(
+ '--chunked-size',
+ '-s',
+ type=int,
+ default=-1,
+ help='If the number of categories is very large, '
+ 'you can specify this parameter to truncate multiple predictions.')
+ # only for Grounding DINO
+ parser.add_argument(
+ '--tokens-positive',
+ '-p',
+ type=str,
+ help='Used to specify which locations in the input text are of '
+ 'interest to the user. -1 indicates that no area is of interest, '
+ 'None indicates ignoring this parameter. '
+ 'The two-dimensional array represents the start and end positions.')
call_args = vars(parser.parse_args())
@@ -111,6 +153,16 @@ def parse_args():
call_args['weights'] = call_args['model']
call_args['model'] = None
+ if call_args['texts'] is not None:
+ if call_args['texts'].startswith('$:'):
+ dataset_name = call_args['texts'][3:].strip()
+ class_names = get_classes(dataset_name)
+ call_args['texts'] = [tuple(class_names)]
+
+ if call_args['tokens_positive'] is not None:
+ call_args['tokens_positive'] = ast.literal_eval(
+ call_args['tokens_positive'])
+
init_kws = ['model', 'weights', 'device', 'palette']
init_args = {}
for init_kw in init_kws:
@@ -125,6 +177,10 @@ def main():
# may consume too much memory if your input folder has a lot of images.
# We will be optimized later.
inferencer = DetInferencer(**init_args)
+
+ chunked_size = call_args.pop('chunked_size')
+ inferencer.model.test_cfg.chunked_size = chunked_size
+
inferencer(**call_args)
if call_args['out_dir'] != '' and not (call_args['no_save_vis']
diff --git a/docker/serve/Dockerfile b/docker/serve/Dockerfile
index 872918972f0..aa307cf6963 100644
--- a/docker/serve/Dockerfile
+++ b/docker/serve/Dockerfile
@@ -4,7 +4,7 @@ ARG CUDNN="8"
FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel
ARG MMCV="2.0.0rc4"
-ARG MMDET="3.2.0"
+ARG MMDET="3.3.0"
ENV PYTHONUNBUFFERED TRUE
diff --git a/docker/serve_cn/Dockerfile b/docker/serve_cn/Dockerfile
index 510906432b7..894e15dd714 100644
--- a/docker/serve_cn/Dockerfile
+++ b/docker/serve_cn/Dockerfile
@@ -4,7 +4,7 @@ ARG CUDNN="8"
FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel
ARG MMCV="2.0.0rc4"
-ARG MMDET="3.2.0"
+ARG MMDET="3.3.0"
ENV PYTHONUNBUFFERED TRUE
diff --git a/docs/en/notes/changelog.md b/docs/en/notes/changelog.md
index 4d48a0a0d22..00ed8f1c1e4 100644
--- a/docs/en/notes/changelog.md
+++ b/docs/en/notes/changelog.md
@@ -1,6 +1,34 @@
# Changelog of v3.x
-## v3.1.0 (12/10/2023)
+## v3.3.0 (05/01/2024)
+
+### Highlights
+
+Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community.
+
+### New Features
+
+- Add RTMDet Swin / ConvNeXt backbone and results (#11259)
+- Add `odinw` configs and evaluation results of `GLIP` (#11175)
+- Add optional score threshold option to `coco_error_analysis.py` (#11117)
+- Add new configs for `panoptic_fpn` (#11109)
+- Replace partially weighted download links with OpenXLab for the `Faster-RCNN` (#11173)
+
+### Bug Fixes
+
+- Fix `Grounding DINO` nan when class tokens exceeds 256 (#11066)
+- Fix the `CO-DETR` config files error (#11325)
+- Fix `CO-DETR` load_from url in config (#11220)
+- Fixed mask shape after Albu postprocess (#11280)
+- Fix bug in `convert_coco_format` and `youtubevis2coco` (#11251, #11086)
+
+### Contributors
+
+A total of 15 developers contributed to this release.
+
+Thank @adnan-mujagic, @Cycyes, @ilcopione, @returnL, @honeybadger78, @okotaku, @xushilin1, @keyhsw, @guyleaf, @Crescent-Saturn, @LRJKD, @aaronzs, @Divadi, @AwePhD, @hhaAndroid
+
+## v3.2.0 (12/10/2023)
### Highlights
diff --git a/docs/en/notes/faq.md b/docs/en/notes/faq.md
index 9e3c1a7852b..f1a176e4d04 100644
--- a/docs/en/notes/faq.md
+++ b/docs/en/notes/faq.md
@@ -47,6 +47,7 @@ Compatible MMDetection, MMEngine, and MMCV versions are shown as below. Please c
| MMDetection version | MMCV version | MMEngine version |
| :-----------------: | :---------------------: | :----------------------: |
| main | mmcv>=2.0.0, \<2.2.0 | mmengine>=0.7.1, \<1.0.0 |
+| 3.3.0 | mmcv>=2.0.0, \<2.2.0 | mmengine>=0.7.1, \<1.0.0 |
| 3.2.0 | mmcv>=2.0.0, \<2.2.0 | mmengine>=0.7.1, \<1.0.0 |
| 3.1.0 | mmcv>=2.0.0, \<2.1.0 | mmengine>=0.7.1, \<1.0.0 |
| 3.0.0 | mmcv>=2.0.0, \<2.1.0 | mmengine>=0.7.1, \<1.0.0 |
diff --git a/docs/en/user_guides/tracking_dataset_prepare.md b/docs/en/user_guides/tracking_dataset_prepare.md
index 2c38569c9a1..56a4b77fc6e 100644
--- a/docs/en/user_guides/tracking_dataset_prepare.md
+++ b/docs/en/user_guides/tracking_dataset_prepare.md
@@ -92,10 +92,10 @@ python ./tools/dataset_converters/mot2reid.py -i ./data/MOT17/ -o ./data/MOT17/r
python ./tools/dataset_converters/crowdhuman2coco.py -i ./data/crowdhuman -o ./data/crowdhuman/annotations
# YouTube-VIS 2019
-python ./tools/dataset_converters/youtubevis/youtubevis2coco.py -i ./data/youtube_vis_2019 -o ./data/youtube_vis_2019/annotations --version 2019
+python ./tools/dataset_converters/youtubevis2coco.py -i ./data/youtube_vis_2019 -o ./data/youtube_vis_2019/annotations --version 2019
# YouTube-VIS 2021
-python ./tools/dataset_converters/youtubevis/youtubevis2coco.py -i ./data/youtube_vis_2021 -o ./data/youtube_vis_2021/annotations --version 2021
+python ./tools/dataset_converters/youtubevis2coco.py -i ./data/youtube_vis_2021 -o ./data/youtube_vis_2021/annotations --version 2021
```
diff --git a/docs/zh_cn/notes/faq.md b/docs/zh_cn/notes/faq.md
index 8268bd11562..2b4237c7411 100644
--- a/docs/zh_cn/notes/faq.md
+++ b/docs/zh_cn/notes/faq.md
@@ -47,6 +47,7 @@ export DYNAMO_CACHE_SIZE_LIMIT = 4
| MMDetection 版本 | MMCV 版本 | MMEngine 版本 |
| :--------------: | :---------------------: | :----------------------: |
| main | mmcv>=2.0.0, \<2.2.0 | mmengine>=0.7.1, \<1.0.0 |
+ | 3.3.0 | mmcv>=2.0.0, \<2.2.0 | mmengine>=0.7.1, \<1.0.0 |
| 3.2.0 | mmcv>=2.0.0, \<2.2.0 | mmengine>=0.7.1, \<1.0.0 |
| 3.1.0 | mmcv>=2.0.0, \<2.1.0 | mmengine>=0.7.1, \<1.0.0 |
| 3.0.0 | mmcv>=2.0.0, \<2.1.0 | mmengine>=0.7.1, \<1.0.0 |
diff --git a/docs/zh_cn/user_guides/dataset_prepare.md b/docs/zh_cn/user_guides/dataset_prepare.md
index a8bf32011a7..1caad856af0 100644
--- a/docs/zh_cn/user_guides/dataset_prepare.md
+++ b/docs/zh_cn/user_guides/dataset_prepare.md
@@ -305,3 +305,58 @@ mim download mmdet --dataset voc2012
# download coco2017 and preprocess by MIM
mim download mmdet --dataset coco2017
```
+
+### ODinW 数据集准备
+
+ODinW 数据集来自 GLIP 论文,用于评估预训练模型泛化性能。一共包括 ODinW-13 和 ODinW-35 两个版本,其中 ODinW-35 包括了 ODinW-13 的所有数据。 目前数据托管在 [huggingface](https://huggingface.co/GLIPModel/GLIP)
+
+请确保你提前安装好了 [git lfs](https://git-lfs.com), 然后按照如下命令下载
+
+```shell
+cd mmdetection
+
+git lfs install
+# 我们不需要下载权重
+GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/GLIPModel/GLIP
+
+cd GLIP
+git lfs pull --include="odinw_35"
+```
+
+下载完成后,目录结构如下所示:
+
+```text
+mmdetection
+├── GLIP
+| ├── odinw_35
+| | ├── AerialMaritimeDrone.zip
+| | ├── AmericanSignLanguageLetters.zip
+...
+```
+
+你可以采用如下命令全部解压并移动到 `mmdetection/data` 路径下:
+
+```shell
+#!/bin/bash
+
+folder="./GLIP/odinw_35/"
+
+find "$folder" -type f -name "*.zip" | while read -r file; do
+ unzip "$file" -d "$(dirname "$file")"
+done
+
+mv GLIP/odinw_35 data/
+```
+
+最终结构如下所示:
+
+```text
+mmdetection
+├── tools
+├── configs
+├── data
+| ├── odinw_35
+| | ├── AerialMaritimeDrone
+...
+│ ├── coco
+```
diff --git a/docs/zh_cn/user_guides/tracking_dataset_prepare.md b/docs/zh_cn/user_guides/tracking_dataset_prepare.md
index c99f1885e05..0db495b54c9 100644
--- a/docs/zh_cn/user_guides/tracking_dataset_prepare.md
+++ b/docs/zh_cn/user_guides/tracking_dataset_prepare.md
@@ -90,10 +90,10 @@ python ./tools/dataset_converters/mot2reid.py -i ./data/MOT17/ -o ./data/MOT17/r
python ./tools/dataset_converters/crowdhuman2coco.py -i ./data/crowdhuman -o ./data/crowdhuman/annotations
# YouTube-VIS 2019
-python ./tools/dataset_converters/youtubevis/youtubevis2coco.py -i ./data/youtube_vis_2019 -o ./data/youtube_vis_2019/annotations --version 2019
+python ./tools/dataset_converters/youtubevis2coco.py -i ./data/youtube_vis_2019 -o ./data/youtube_vis_2019/annotations --version 2019
# YouTube-VIS 2021
-python ./tools/dataset_converters/youtubevis/youtubevis2coco.py -i ./data/youtube_vis_2021 -o ./data/youtube_vis_2021/annotations --version 2021
+python ./tools/dataset_converters/youtubevis2coco.py -i ./data/youtube_vis_2021 -o ./data/youtube_vis_2021/annotations --version 2021
```
diff --git a/mmdet/apis/det_inferencer.py b/mmdet/apis/det_inferencer.py
index 9efbb00cbe9..ce8532eb786 100644
--- a/mmdet/apis/det_inferencer.py
+++ b/mmdet/apis/det_inferencer.py
@@ -313,8 +313,10 @@ def __call__(
texts: Optional[Union[str, list]] = None,
# by open panoptic task
stuff_texts: Optional[Union[str, list]] = None,
- # by GLIP
+ # by GLIP and Grounding DINO
custom_entities: bool = False,
+ # by Grounding DINO
+ tokens_positive: Optional[Union[int, list]] = None,
**kwargs) -> dict:
"""Call the inferencer.
@@ -343,7 +345,7 @@ def __call__(
stuff_texts (str | list[str]): Stuff text prompts of open
panoptic task. Defaults to None.
custom_entities (bool): Whether to use custom entities.
- Defaults to False. Only used in GLIP.
+ Defaults to False. Only used in GLIP and Grounding DINO.
**kwargs: Other keyword arguments passed to :meth:`preprocess`,
:meth:`forward`, :meth:`visualize` and :meth:`postprocess`.
Each key in kwargs should be in the corresponding set of
@@ -366,6 +368,10 @@ def __call__(
texts = [texts] * len(ori_inputs)
if stuff_texts is not None and isinstance(stuff_texts, str):
stuff_texts = [stuff_texts] * len(ori_inputs)
+
+ # Currently only supports bs=1
+ tokens_positive = [tokens_positive] * len(ori_inputs)
+
if texts is not None:
assert len(texts) == len(ori_inputs)
for i in range(len(texts)):
@@ -373,13 +379,15 @@ def __call__(
ori_inputs[i] = {
'text': texts[i],
'img_path': ori_inputs[i],
- 'custom_entities': custom_entities
+ 'custom_entities': custom_entities,
+ 'tokens_positive': tokens_positive[i]
}
else:
ori_inputs[i] = {
'text': texts[i],
'img': ori_inputs[i],
- 'custom_entities': custom_entities
+ 'custom_entities': custom_entities,
+ 'tokens_positive': tokens_positive[i]
}
if stuff_texts is not None:
assert len(stuff_texts) == len(ori_inputs)
diff --git a/mmdet/configs/panoptic_fpn/panoptic_fpn_r101_fpn_1x_coco.py b/mmdet/configs/panoptic_fpn/panoptic_fpn_r101_fpn_1x_coco.py
new file mode 100644
index 00000000000..c6059780da1
--- /dev/null
+++ b/mmdet/configs/panoptic_fpn/panoptic_fpn_r101_fpn_1x_coco.py
@@ -0,0 +1,13 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+from mmengine.model.weight_init import PretrainedInit
+
+with read_base():
+ from .panoptic_fpn_r50_fpn_1x_coco import *
+
+model.update(
+ dict(
+ backbone=dict(
+ depth=101,
+ init_cfg=dict(
+ type=PretrainedInit, checkpoint='torchvision://resnet101'))))
diff --git a/mmdet/configs/panoptic_fpn/panoptic_fpn_r101_fpn_ms_3x_coco.py b/mmdet/configs/panoptic_fpn/panoptic_fpn_r101_fpn_ms_3x_coco.py
new file mode 100644
index 00000000000..c02c3237f81
--- /dev/null
+++ b/mmdet/configs/panoptic_fpn/panoptic_fpn_r101_fpn_ms_3x_coco.py
@@ -0,0 +1,13 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+from mmengine.model.weight_init import PretrainedInit
+
+with read_base():
+ from .panoptic_fpn_r50_fpn_ms_3x_coco import *
+
+model.update(
+ dict(
+ backbone=dict(
+ depth=101,
+ init_cfg=dict(
+ type=PretrainedInit, checkpoint='torchvision://resnet101'))))
diff --git a/mmdet/configs/panoptic_fpn/panoptic_fpn_r50_fpn_ms_3x_coco.py b/mmdet/configs/panoptic_fpn/panoptic_fpn_r50_fpn_ms_3x_coco.py
new file mode 100644
index 00000000000..25ebe5d67c4
--- /dev/null
+++ b/mmdet/configs/panoptic_fpn/panoptic_fpn_r50_fpn_ms_3x_coco.py
@@ -0,0 +1,45 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+from mmengine.optim.scheduler.lr_scheduler import LinearLR, MultiStepLR
+
+with read_base():
+ from .panoptic_fpn_r50_fpn_1x_coco import *
+
+from mmcv.transforms import RandomResize
+from mmcv.transforms.loading import LoadImageFromFile
+
+from mmdet.datasets.transforms.formatting import PackDetInputs
+from mmdet.datasets.transforms.loading import LoadPanopticAnnotations
+from mmdet.datasets.transforms.transforms import RandomFlip
+
+# In mstrain 3x config, img_scale=[(1333, 640), (1333, 800)],
+# multiscale_mode='range'
+train_pipeline = [
+ dict(type=LoadImageFromFile),
+ dict(
+ type=LoadPanopticAnnotations,
+ with_bbox=True,
+ with_mask=True,
+ with_seg=True),
+ dict(type=RandomResize, scale=[(1333, 640), (1333, 800)], keep_ratio=True),
+ dict(type=RandomFlip, prob=0.5),
+ dict(type=PackDetInputs)
+]
+
+train_dataloader.update(dict(dataset=dict(pipeline=train_pipeline)))
+
+# TODO: Use RepeatDataset to speed up training
+# training schedule for 3x
+train_cfg.update(dict(max_epochs=36, val_interval=3))
+
+# learning rate
+param_scheduler = [
+ dict(type=LinearLR, start_factor=0.001, by_epoch=False, begin=0, end=500),
+ dict(
+ type=MultiStepLR,
+ begin=0,
+ end=36,
+ by_epoch=True,
+ milestones=[24, 33],
+ gamma=0.1)
+]
diff --git a/mmdet/datasets/__init__.py b/mmdet/datasets/__init__.py
index 044efe4cad7..670c207cacf 100644
--- a/mmdet/datasets/__init__.py
+++ b/mmdet/datasets/__init__.py
@@ -12,17 +12,22 @@
from .crowdhuman import CrowdHumanDataset
from .dataset_wrappers import ConcatDataset, MultiImageMixDataset
from .deepfashion import DeepFashionDataset
+from .dod import DODDataset
from .dsdl import DSDLDetDataset
+from .flickr30k import Flickr30kDataset
from .isaid import iSAIDDataset
from .lvis import LVISDataset, LVISV1Dataset, LVISV05Dataset
+from .mdetr_style_refcoco import MDETRStyleRefCocoDataset
from .mot_challenge_dataset import MOTChallengeDataset
from .objects365 import Objects365V1Dataset, Objects365V2Dataset
+from .odvg import ODVGDataset
from .openimages import OpenImagesChallengeDataset, OpenImagesDataset
from .refcoco import RefCocoDataset
from .reid_dataset import ReIDDataset
from .samplers import (AspectRatioBatchSampler, ClassAwareSampler,
- GroupMultiSourceSampler, MultiSourceSampler,
- TrackAspectRatioBatchSampler, TrackImgSampler)
+ CustomSampleSizeSampler, GroupMultiSourceSampler,
+ MultiSourceSampler, TrackAspectRatioBatchSampler,
+ TrackImgSampler)
from .utils import get_loading_pipeline
from .v3det import V3DetDataset
from .voc import VOCDataset
@@ -42,5 +47,7 @@
'ReIDDataset', 'YouTubeVISDataset', 'TrackAspectRatioBatchSampler',
'ADE20KPanopticDataset', 'CocoCaptionDataset', 'RefCocoDataset',
'BaseSegDataset', 'ADE20KSegDataset', 'CocoSegDataset',
- 'ADE20KInstanceDataset', 'iSAIDDataset', 'V3DetDataset', 'ConcatDataset'
+ 'ADE20KInstanceDataset', 'iSAIDDataset', 'V3DetDataset', 'ConcatDataset',
+ 'ODVGDataset', 'MDETRStyleRefCocoDataset', 'DODDataset',
+ 'CustomSampleSizeSampler', 'Flickr30kDataset'
]
diff --git a/mmdet/datasets/api_wrappers/coco_api.py b/mmdet/datasets/api_wrappers/coco_api.py
index 40f7f2c9b93..b2d11a122e1 100644
--- a/mmdet/datasets/api_wrappers/coco_api.py
+++ b/mmdet/datasets/api_wrappers/coco_api.py
@@ -92,7 +92,7 @@ def createIndex(self) -> None:
if 'images' in self.dataset:
for img_info in self.dataset['images']:
img_info['segm_file'] = img_info['file_name'].replace(
- 'jpg', 'png')
+ '.jpg', '.png')
imgs[img_info['id']] = img_info
if 'categories' in self.dataset:
diff --git a/mmdet/datasets/base_det_dataset.py b/mmdet/datasets/base_det_dataset.py
index 57bc7098387..8b3876d5c06 100644
--- a/mmdet/datasets/base_det_dataset.py
+++ b/mmdet/datasets/base_det_dataset.py
@@ -21,6 +21,8 @@ class BaseDetDataset(BaseDataset):
corresponding backend. Defaults to None.
return_classes (bool): Whether to return class information
for open vocabulary-based algorithms. Defaults to False.
+ caption_prompt (dict, optional): Prompt for captioning.
+ Defaults to None.
"""
def __init__(self,
@@ -30,11 +32,16 @@ def __init__(self,
file_client_args: dict = None,
backend_args: dict = None,
return_classes: bool = False,
+ caption_prompt: Optional[dict] = None,
**kwargs) -> None:
self.seg_map_suffix = seg_map_suffix
self.proposal_file = proposal_file
self.backend_args = backend_args
self.return_classes = return_classes
+ self.caption_prompt = caption_prompt
+ if self.caption_prompt is not None:
+ assert self.return_classes, \
+ 'return_classes must be True when using caption_prompt'
if file_client_args is not None:
raise RuntimeError(
'The `file_client_args` is deprecated, '
diff --git a/mmdet/datasets/coco.py b/mmdet/datasets/coco.py
index 277b75988da..1cf21c4e667 100644
--- a/mmdet/datasets/coco.py
+++ b/mmdet/datasets/coco.py
@@ -129,6 +129,7 @@ def parse_data_info(self, raw_data_info: dict) -> Union[dict, List[dict]]:
if self.return_classes:
data_info['text'] = self.metainfo['classes']
+ data_info['caption_prompt'] = self.caption_prompt
data_info['custom_entities'] = True
instances = []
diff --git a/mmdet/datasets/coco_panoptic.py b/mmdet/datasets/coco_panoptic.py
index d5ca7855509..b7a200e01d3 100644
--- a/mmdet/datasets/coco_panoptic.py
+++ b/mmdet/datasets/coco_panoptic.py
@@ -208,7 +208,7 @@ def parse_data_info(self, raw_data_info: dict) -> dict:
if self.data_prefix.get('seg', None):
seg_map_path = osp.join(
self.data_prefix['seg'],
- img_info['file_name'].replace('jpg', 'png'))
+ img_info['file_name'].replace('.jpg', '.png'))
else:
seg_map_path = None
data_info['img_path'] = img_path
diff --git a/mmdet/datasets/dataset_wrappers.py b/mmdet/datasets/dataset_wrappers.py
index e651e2b9902..d4e26e07c0f 100644
--- a/mmdet/datasets/dataset_wrappers.py
+++ b/mmdet/datasets/dataset_wrappers.py
@@ -247,6 +247,14 @@ def __init__(self,
if not lazy_init:
self.full_init()
+ if is_all_same:
+ self._metainfo.update(
+ dict(cumulative_sizes=self.cumulative_sizes))
+ else:
+ for i, dataset in enumerate(self.datasets):
+ self._metainfo[i].update(
+ dict(cumulative_sizes=self.cumulative_sizes))
+
def get_dataset_source(self, idx: int) -> int:
dataset_idx, _ = self._get_ori_dataset_idx(idx)
return dataset_idx
diff --git a/mmdet/datasets/dod.py b/mmdet/datasets/dod.py
new file mode 100644
index 00000000000..152d32aaf70
--- /dev/null
+++ b/mmdet/datasets/dod.py
@@ -0,0 +1,78 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from typing import List, Optional
+
+import numpy as np
+
+from mmdet.registry import DATASETS
+from .base_det_dataset import BaseDetDataset
+
+try:
+ from d_cube import D3
+except ImportError:
+ D3 = None
+from .api_wrappers import COCO
+
+
+@DATASETS.register_module()
+class DODDataset(BaseDetDataset):
+
+ def __init__(self,
+ *args,
+ data_root: Optional[str] = '',
+ data_prefix: dict = dict(img_path=''),
+ **kwargs) -> None:
+ if D3 is None:
+ raise ImportError(
+ 'Please install d3 by `pip install ddd-dataset`.')
+ pkl_anno_path = osp.join(data_root, data_prefix['anno'])
+ self.img_root = osp.join(data_root, data_prefix['img'])
+ self.d3 = D3(self.img_root, pkl_anno_path)
+
+ sent_infos = self.d3.load_sents()
+ classes = tuple([sent_info['raw_sent'] for sent_info in sent_infos])
+ super().__init__(
+ *args,
+ data_root=data_root,
+ data_prefix=data_prefix,
+ metainfo={'classes': classes},
+ **kwargs)
+
+ def load_data_list(self) -> List[dict]:
+ coco = COCO(self.ann_file)
+ data_list = []
+ img_ids = self.d3.get_img_ids()
+ for img_id in img_ids:
+ data_info = {}
+
+ img_info = self.d3.load_imgs(img_id)[0]
+ file_name = img_info['file_name']
+ img_path = osp.join(self.img_root, file_name)
+ data_info['img_path'] = img_path
+ data_info['img_id'] = img_id
+ data_info['height'] = img_info['height']
+ data_info['width'] = img_info['width']
+
+ group_ids = self.d3.get_group_ids(img_ids=[img_id])
+ sent_ids = self.d3.get_sent_ids(group_ids=group_ids)
+ sent_list = self.d3.load_sents(sent_ids=sent_ids)
+ text_list = [sent['raw_sent'] for sent in sent_list]
+ ann_ids = coco.get_ann_ids(img_ids=[img_id])
+ anno = coco.load_anns(ann_ids)
+
+ data_info['text'] = text_list
+ data_info['sent_ids'] = np.array([s for s in sent_ids])
+ data_info['custom_entities'] = True
+
+ instances = []
+ for i, ann in enumerate(anno):
+ instance = {}
+ x1, y1, w, h = ann['bbox']
+ bbox = [x1, y1, x1 + w, y1 + h]
+ instance['ignore_flag'] = 0
+ instance['bbox'] = bbox
+ instance['bbox_label'] = ann['category_id'] - 1
+ instances.append(instance)
+ data_info['instances'] = instances
+ data_list.append(data_info)
+ return data_list
diff --git a/mmdet/datasets/flickr30k.py b/mmdet/datasets/flickr30k.py
new file mode 100644
index 00000000000..0c76a41bc96
--- /dev/null
+++ b/mmdet/datasets/flickr30k.py
@@ -0,0 +1,81 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from typing import List
+
+from pycocotools.coco import COCO
+
+from mmdet.registry import DATASETS
+from .base_det_dataset import BaseDetDataset
+
+
+def convert_phrase_ids(phrase_ids: list) -> list:
+ unique_elements = sorted(set(phrase_ids))
+ element_to_new_label = {
+ element: label
+ for label, element in enumerate(unique_elements)
+ }
+ phrase_ids = [element_to_new_label[element] for element in phrase_ids]
+ return phrase_ids
+
+
+@DATASETS.register_module()
+class Flickr30kDataset(BaseDetDataset):
+ """Flickr30K Dataset."""
+
+ def load_data_list(self) -> List[dict]:
+
+ self.coco = COCO(self.ann_file)
+
+ self.ids = sorted(list(self.coco.imgs.keys()))
+
+ data_list = []
+ for img_id in self.ids:
+ if isinstance(img_id, str):
+ ann_ids = self.coco.getAnnIds(imgIds=[img_id], iscrowd=None)
+ else:
+ ann_ids = self.coco.getAnnIds(imgIds=img_id, iscrowd=None)
+
+ coco_img = self.coco.loadImgs(img_id)[0]
+
+ caption = coco_img['caption']
+ file_name = coco_img['file_name']
+ img_path = osp.join(self.data_prefix['img'], file_name)
+ width = coco_img['width']
+ height = coco_img['height']
+ tokens_positive = coco_img['tokens_positive_eval']
+ phrases = [caption[i[0][0]:i[0][1]] for i in tokens_positive]
+ phrase_ids = []
+
+ instances = []
+ annos = self.coco.loadAnns(ann_ids)
+ for anno in annos:
+ instance = {
+ 'bbox': [
+ anno['bbox'][0], anno['bbox'][1],
+ anno['bbox'][0] + anno['bbox'][2],
+ anno['bbox'][1] + anno['bbox'][3]
+ ],
+ 'bbox_label':
+ anno['category_id'],
+ 'ignore_flag':
+ anno['iscrowd']
+ }
+ phrase_ids.append(anno['phrase_ids'])
+ instances.append(instance)
+
+ phrase_ids = convert_phrase_ids(phrase_ids)
+
+ data_list.append(
+ dict(
+ img_path=img_path,
+ img_id=img_id,
+ height=height,
+ width=width,
+ instances=instances,
+ text=caption,
+ phrase_ids=phrase_ids,
+ tokens_positive=tokens_positive,
+ phrases=phrases,
+ ))
+
+ return data_list
diff --git a/mmdet/datasets/mdetr_style_refcoco.py b/mmdet/datasets/mdetr_style_refcoco.py
new file mode 100644
index 00000000000..cc56dec49db
--- /dev/null
+++ b/mmdet/datasets/mdetr_style_refcoco.py
@@ -0,0 +1,57 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from typing import List
+
+from mmengine.fileio import get_local_path
+
+from mmdet.datasets import BaseDetDataset
+from mmdet.registry import DATASETS
+from .api_wrappers import COCO
+
+
+@DATASETS.register_module()
+class MDETRStyleRefCocoDataset(BaseDetDataset):
+ """RefCOCO dataset.
+
+ Only support evaluation now.
+ """
+
+ def load_data_list(self) -> List[dict]:
+ with get_local_path(
+ self.ann_file, backend_args=self.backend_args) as local_path:
+ coco = COCO(local_path)
+
+ img_ids = coco.get_img_ids()
+
+ data_infos = []
+ for img_id in img_ids:
+ raw_img_info = coco.load_imgs([img_id])[0]
+ ann_ids = coco.get_ann_ids(img_ids=[img_id])
+ raw_ann_info = coco.load_anns(ann_ids)
+
+ data_info = {}
+ img_path = osp.join(self.data_prefix['img'],
+ raw_img_info['file_name'])
+ data_info['img_path'] = img_path
+ data_info['img_id'] = img_id
+ data_info['height'] = raw_img_info['height']
+ data_info['width'] = raw_img_info['width']
+ data_info['dataset_mode'] = raw_img_info['dataset_name']
+
+ data_info['text'] = raw_img_info['caption']
+ data_info['custom_entities'] = False
+ data_info['tokens_positive'] = -1
+
+ instances = []
+ for i, ann in enumerate(raw_ann_info):
+ instance = {}
+ x1, y1, w, h = ann['bbox']
+ bbox = [x1, y1, x1 + w, y1 + h]
+ instance['bbox'] = bbox
+ instance['bbox_label'] = ann['category_id']
+ instance['ignore_flag'] = 0
+ instances.append(instance)
+
+ data_info['instances'] = instances
+ data_infos.append(data_info)
+ return data_infos
diff --git a/mmdet/datasets/odvg.py b/mmdet/datasets/odvg.py
new file mode 100644
index 00000000000..c73865f2ea7
--- /dev/null
+++ b/mmdet/datasets/odvg.py
@@ -0,0 +1,106 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import json
+import os.path as osp
+from typing import List, Optional
+
+from mmengine.fileio import get_local_path
+
+from mmdet.registry import DATASETS
+from .base_det_dataset import BaseDetDataset
+
+
+@DATASETS.register_module()
+class ODVGDataset(BaseDetDataset):
+ """object detection and visual grounding dataset."""
+
+ def __init__(self,
+ *args,
+ data_root: str = '',
+ label_map_file: Optional[str] = None,
+ need_text: bool = True,
+ **kwargs) -> None:
+ self.dataset_mode = 'VG'
+ self.need_text = need_text
+ if label_map_file:
+ label_map_file = osp.join(data_root, label_map_file)
+ with open(label_map_file, 'r') as file:
+ self.label_map = json.load(file)
+ self.dataset_mode = 'OD'
+ super().__init__(*args, data_root=data_root, **kwargs)
+ assert self.return_classes is True
+
+ def load_data_list(self) -> List[dict]:
+ with get_local_path(
+ self.ann_file, backend_args=self.backend_args) as local_path:
+ with open(local_path, 'r') as f:
+ data_list = [json.loads(line) for line in f]
+
+ out_data_list = []
+ for data in data_list:
+ data_info = {}
+ img_path = osp.join(self.data_prefix['img'], data['filename'])
+ data_info['img_path'] = img_path
+ data_info['height'] = data['height']
+ data_info['width'] = data['width']
+ if self.dataset_mode == 'OD':
+ if self.need_text:
+ data_info['text'] = self.label_map
+ anno = data.get('detection', {})
+ instances = [obj for obj in anno.get('instances', [])]
+ bboxes = [obj['bbox'] for obj in instances]
+ bbox_labels = [str(obj['label']) for obj in instances]
+
+ instances = []
+ for bbox, label in zip(bboxes, bbox_labels):
+ instance = {}
+ x1, y1, x2, y2 = bbox
+ inter_w = max(0, min(x2, data['width']) - max(x1, 0))
+ inter_h = max(0, min(y2, data['height']) - max(y1, 0))
+ if inter_w * inter_h == 0:
+ continue
+ if (x2 - x1) < 1 or (y2 - y1) < 1:
+ continue
+ instance['ignore_flag'] = 0
+ instance['bbox'] = bbox
+ instance['bbox_label'] = int(label)
+ instances.append(instance)
+ data_info['instances'] = instances
+ data_info['dataset_mode'] = self.dataset_mode
+ out_data_list.append(data_info)
+ else:
+ anno = data['grounding']
+ data_info['text'] = anno['caption']
+ regions = anno['regions']
+
+ instances = []
+ phrases = {}
+ for i, region in enumerate(regions):
+ bbox = region['bbox']
+ phrase = region['phrase']
+ tokens_positive = region['tokens_positive']
+ if not isinstance(bbox[0], list):
+ bbox = [bbox]
+ for box in bbox:
+ instance = {}
+ x1, y1, x2, y2 = box
+ inter_w = max(0, min(x2, data['width']) - max(x1, 0))
+ inter_h = max(0, min(y2, data['height']) - max(y1, 0))
+ if inter_w * inter_h == 0:
+ continue
+ if (x2 - x1) < 1 or (y2 - y1) < 1:
+ continue
+ instance['ignore_flag'] = 0
+ instance['bbox'] = box
+ instance['bbox_label'] = i
+ phrases[i] = {
+ 'phrase': phrase,
+ 'tokens_positive': tokens_positive
+ }
+ instances.append(instance)
+ data_info['instances'] = instances
+ data_info['phrases'] = phrases
+ data_info['dataset_mode'] = self.dataset_mode
+ out_data_list.append(data_info)
+
+ del data_list
+ return out_data_list
diff --git a/mmdet/datasets/samplers/__init__.py b/mmdet/datasets/samplers/__init__.py
index a942ff2199c..9ea0e4cb062 100644
--- a/mmdet/datasets/samplers/__init__.py
+++ b/mmdet/datasets/samplers/__init__.py
@@ -3,6 +3,7 @@
MultiDataAspectRatioBatchSampler,
TrackAspectRatioBatchSampler)
from .class_aware_sampler import ClassAwareSampler
+from .custom_sample_size_sampler import CustomSampleSizeSampler
from .multi_data_sampler import MultiDataSampler
from .multi_source_sampler import GroupMultiSourceSampler, MultiSourceSampler
from .track_img_sampler import TrackImgSampler
@@ -11,5 +12,5 @@
'ClassAwareSampler', 'AspectRatioBatchSampler', 'MultiSourceSampler',
'GroupMultiSourceSampler', 'TrackImgSampler',
'TrackAspectRatioBatchSampler', 'MultiDataSampler',
- 'MultiDataAspectRatioBatchSampler'
+ 'MultiDataAspectRatioBatchSampler', 'CustomSampleSizeSampler'
]
diff --git a/mmdet/datasets/samplers/custom_sample_size_sampler.py b/mmdet/datasets/samplers/custom_sample_size_sampler.py
new file mode 100644
index 00000000000..6bedf6c66be
--- /dev/null
+++ b/mmdet/datasets/samplers/custom_sample_size_sampler.py
@@ -0,0 +1,111 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from typing import Iterator, Optional, Sequence, Sized
+
+import torch
+from mmengine.dist import get_dist_info, sync_random_seed
+from torch.utils.data import Sampler
+
+from mmdet.registry import DATA_SAMPLERS
+from .class_aware_sampler import RandomCycleIter
+
+
+@DATA_SAMPLERS.register_module()
+class CustomSampleSizeSampler(Sampler):
+
+ def __init__(self,
+ dataset: Sized,
+ dataset_size: Sequence[int],
+ ratio_mode: bool = False,
+ seed: Optional[int] = None,
+ round_up: bool = True) -> None:
+ assert len(dataset.datasets) == len(dataset_size)
+ rank, world_size = get_dist_info()
+ self.rank = rank
+ self.world_size = world_size
+
+ self.dataset = dataset
+ if seed is None:
+ seed = sync_random_seed()
+ self.seed = seed
+ self.epoch = 0
+ self.round_up = round_up
+
+ total_size = 0
+ total_size_fake = 0
+ self.dataset_index = []
+ self.dataset_cycle_iter = []
+ new_dataset_size = []
+ for dataset, size in zip(dataset.datasets, dataset_size):
+ self.dataset_index.append(
+ list(range(total_size_fake,
+ len(dataset) + total_size_fake)))
+ total_size_fake += len(dataset)
+ if size == -1:
+ total_size += len(dataset)
+ self.dataset_cycle_iter.append(None)
+ new_dataset_size.append(-1)
+ else:
+ if ratio_mode:
+ size = int(size * len(dataset))
+ assert size <= len(
+ dataset
+ ), f'dataset size {size} is larger than ' \
+ f'dataset length {len(dataset)}'
+ total_size += size
+ new_dataset_size.append(size)
+
+ g = torch.Generator()
+ g.manual_seed(self.seed)
+ self.dataset_cycle_iter.append(
+ RandomCycleIter(self.dataset_index[-1], generator=g))
+ self.dataset_size = new_dataset_size
+
+ if self.round_up:
+ self.num_samples = math.ceil(total_size / world_size)
+ self.total_size = self.num_samples * self.world_size
+ else:
+ self.num_samples = math.ceil((total_size - rank) / world_size)
+ self.total_size = total_size
+
+ def __iter__(self) -> Iterator[int]:
+ """Iterate the indices."""
+ # deterministically shuffle based on epoch and seed
+ g = torch.Generator()
+ g.manual_seed(self.seed + self.epoch)
+
+ out_index = []
+ for data_size, data_index, cycle_iter in zip(self.dataset_size,
+ self.dataset_index,
+ self.dataset_cycle_iter):
+ if data_size == -1:
+ out_index += data_index
+ else:
+ index = [next(cycle_iter) for _ in range(data_size)]
+ out_index += index
+
+ index = torch.randperm(len(out_index), generator=g).numpy().tolist()
+ indices = [out_index[i] for i in index]
+
+ if self.round_up:
+ indices = (
+ indices *
+ int(self.total_size / len(indices) + 1))[:self.total_size]
+ indices = indices[self.rank:self.total_size:self.world_size]
+ return iter(indices)
+
+ def __len__(self) -> int:
+ """The number of samples in this rank."""
+ return self.num_samples
+
+ def set_epoch(self, epoch: int) -> None:
+ """Sets the epoch for this sampler.
+
+ When :attr:`shuffle=True`, this ensures all replicas use a different
+ random ordering for each epoch. Otherwise, the next iteration of this
+ sampler will yield the same ordering.
+
+ Args:
+ epoch (int): Epoch number.
+ """
+ self.epoch = epoch
diff --git a/mmdet/datasets/transforms/__init__.py b/mmdet/datasets/transforms/__init__.py
index 1f30d6c1352..ab3478feb00 100644
--- a/mmdet/datasets/transforms/__init__.py
+++ b/mmdet/datasets/transforms/__init__.py
@@ -13,6 +13,7 @@
LoadEmptyAnnotations, LoadImageFromNDArray,
LoadMultiChannelImageFromFiles, LoadPanopticAnnotations,
LoadProposals, LoadTrackAnnotations)
+from .text_transformers import LoadTextAnnotations, RandomSamplingNegPos
from .transformers_glip import GTBoxSubOne_GLIP, RandomFlip_GLIP
from .transforms import (Albu, CachedMixUp, CachedMosaic, CopyPaste, CutOut,
Expand, FixScaleResize, FixShapeResize,
@@ -39,5 +40,6 @@
'FixShapeResize', 'ProposalBroadcaster', 'InferencerLoader',
'LoadTrackAnnotations', 'BaseFrameSample', 'UniformRefFrameSample',
'PackTrackInputs', 'PackReIDInputs', 'FixScaleResize',
- 'ResizeShortestEdge', 'GTBoxSubOne_GLIP', 'RandomFlip_GLIP'
+ 'ResizeShortestEdge', 'GTBoxSubOne_GLIP', 'RandomFlip_GLIP',
+ 'RandomSamplingNegPos', 'LoadTextAnnotations'
]
diff --git a/mmdet/datasets/transforms/text_transformers.py b/mmdet/datasets/transforms/text_transformers.py
new file mode 100644
index 00000000000..25304d5fe45
--- /dev/null
+++ b/mmdet/datasets/transforms/text_transformers.py
@@ -0,0 +1,255 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import json
+
+from mmcv.transforms import BaseTransform
+
+from mmdet.registry import TRANSFORMS
+from mmdet.structures.bbox import BaseBoxes
+
+try:
+ from transformers import AutoTokenizer
+ from transformers import BertModel as HFBertModel
+except ImportError:
+ AutoTokenizer = None
+ HFBertModel = None
+
+import random
+import re
+
+import numpy as np
+
+
+def clean_name(name):
+ name = re.sub(r'\(.*\)', '', name)
+ name = re.sub(r'_', ' ', name)
+ name = re.sub(r' ', ' ', name)
+ name = name.lower()
+ return name
+
+
+def check_for_positive_overflow(gt_bboxes, gt_labels, text, tokenizer,
+ max_tokens):
+ # Check if we have too many positive labels
+ # generate a caption by appending the positive labels
+ positive_label_list = np.unique(gt_labels).tolist()
+ # random shuffule so we can sample different annotations
+ # at different epochs
+ random.shuffle(positive_label_list)
+
+ kept_lables = []
+ length = 0
+
+ for index, label in enumerate(positive_label_list):
+
+ label_text = clean_name(text[str(label)]) + '. '
+
+ tokenized = tokenizer.tokenize(label_text)
+
+ length += len(tokenized)
+
+ if length > max_tokens:
+ break
+ else:
+ kept_lables.append(label)
+
+ keep_box_index = []
+ keep_gt_labels = []
+ for i in range(len(gt_labels)):
+ if gt_labels[i] in kept_lables:
+ keep_box_index.append(i)
+ keep_gt_labels.append(gt_labels[i])
+
+ return gt_bboxes[keep_box_index], np.array(
+ keep_gt_labels, dtype=np.long), length
+
+
+def generate_senetence_given_labels(positive_label_list, negative_label_list,
+ text):
+ label_to_positions = {}
+
+ label_list = negative_label_list + positive_label_list
+
+ random.shuffle(label_list)
+
+ pheso_caption = ''
+
+ label_remap_dict = {}
+ for index, label in enumerate(label_list):
+
+ start_index = len(pheso_caption)
+
+ pheso_caption += clean_name(text[str(label)])
+
+ end_index = len(pheso_caption)
+
+ if label in positive_label_list:
+ label_to_positions[index] = [[start_index, end_index]]
+ label_remap_dict[int(label)] = index
+
+ # if index != len(label_list) - 1:
+ # pheso_caption += '. '
+ pheso_caption += '. '
+
+ return label_to_positions, pheso_caption, label_remap_dict
+
+
+@TRANSFORMS.register_module()
+class RandomSamplingNegPos(BaseTransform):
+
+ def __init__(self,
+ tokenizer_name,
+ num_sample_negative=85,
+ max_tokens=256,
+ full_sampling_prob=0.5,
+ label_map_file=None):
+ if AutoTokenizer is None:
+ raise RuntimeError(
+ 'transformers is not installed, please install it by: '
+ 'pip install transformers.')
+
+ self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
+ self.num_sample_negative = num_sample_negative
+ self.full_sampling_prob = full_sampling_prob
+ self.max_tokens = max_tokens
+ self.label_map = None
+ if label_map_file:
+ with open(label_map_file, 'r') as file:
+ self.label_map = json.load(file)
+
+ def transform(self, results: dict) -> dict:
+ if 'phrases' in results:
+ return self.vg_aug(results)
+ else:
+ return self.od_aug(results)
+
+ def vg_aug(self, results):
+ gt_bboxes = results['gt_bboxes']
+ if isinstance(gt_bboxes, BaseBoxes):
+ gt_bboxes = gt_bboxes.tensor
+ gt_labels = results['gt_bboxes_labels']
+ text = results['text'].lower().strip()
+ if not text.endswith('.'):
+ text = text + '. '
+
+ phrases = results['phrases']
+ # TODO: add neg
+ positive_label_list = np.unique(gt_labels).tolist()
+ label_to_positions = {}
+ for label in positive_label_list:
+ label_to_positions[label] = phrases[label]['tokens_positive']
+
+ results['gt_bboxes'] = gt_bboxes
+ results['gt_bboxes_labels'] = gt_labels
+
+ results['text'] = text
+ results['tokens_positive'] = label_to_positions
+ return results
+
+ def od_aug(self, results):
+ gt_bboxes = results['gt_bboxes']
+ if isinstance(gt_bboxes, BaseBoxes):
+ gt_bboxes = gt_bboxes.tensor
+ gt_labels = results['gt_bboxes_labels']
+
+ if 'text' not in results:
+ assert self.label_map is not None
+ text = self.label_map
+ else:
+ text = results['text']
+
+ original_box_num = len(gt_labels)
+ # If the category name is in the format of 'a/b' (in object365),
+ # we randomly select one of them.
+ for key, value in text.items():
+ if '/' in value:
+ text[key] = random.choice(value.split('/')).strip()
+
+ gt_bboxes, gt_labels, positive_caption_length = \
+ check_for_positive_overflow(gt_bboxes, gt_labels,
+ text, self.tokenizer, self.max_tokens)
+
+ if len(gt_bboxes) < original_box_num:
+ print('WARNING: removed {} boxes due to positive caption overflow'.
+ format(original_box_num - len(gt_bboxes)))
+
+ valid_negative_indexes = list(text.keys())
+
+ positive_label_list = np.unique(gt_labels).tolist()
+ full_negative = self.num_sample_negative
+
+ if full_negative > len(valid_negative_indexes):
+ full_negative = len(valid_negative_indexes)
+
+ outer_prob = random.random()
+
+ if outer_prob < self.full_sampling_prob:
+ # c. probability_full: add both all positive and all negatives
+ num_negatives = full_negative
+ else:
+ if random.random() < 1.0:
+ num_negatives = np.random.choice(max(1, full_negative)) + 1
+ else:
+ num_negatives = full_negative
+
+ # Keep some negatives
+ negative_label_list = set()
+ if num_negatives != -1:
+ if num_negatives > len(valid_negative_indexes):
+ num_negatives = len(valid_negative_indexes)
+
+ for i in np.random.choice(
+ valid_negative_indexes, size=num_negatives, replace=False):
+ if i not in positive_label_list:
+ negative_label_list.add(i)
+
+ random.shuffle(positive_label_list)
+
+ negative_label_list = list(negative_label_list)
+ random.shuffle(negative_label_list)
+
+ negative_max_length = self.max_tokens - positive_caption_length
+ screened_negative_label_list = []
+
+ for negative_label in negative_label_list:
+ label_text = clean_name(text[str(negative_label)]) + '. '
+
+ tokenized = self.tokenizer.tokenize(label_text)
+
+ negative_max_length -= len(tokenized)
+
+ if negative_max_length > 0:
+ screened_negative_label_list.append(negative_label)
+ else:
+ break
+ negative_label_list = screened_negative_label_list
+ label_to_positions, pheso_caption, label_remap_dict = \
+ generate_senetence_given_labels(positive_label_list,
+ negative_label_list, text)
+
+ # label remap
+ if len(gt_labels) > 0:
+ gt_labels = np.vectorize(lambda x: label_remap_dict[x])(gt_labels)
+
+ results['gt_bboxes'] = gt_bboxes
+ results['gt_bboxes_labels'] = gt_labels
+
+ results['text'] = pheso_caption
+ results['tokens_positive'] = label_to_positions
+
+ return results
+
+
+@TRANSFORMS.register_module()
+class LoadTextAnnotations(BaseTransform):
+
+ def transform(self, results: dict) -> dict:
+ if 'phrases' in results:
+ tokens_positive = [
+ phrase['tokens_positive']
+ for phrase in results['phrases'].values()
+ ]
+ results['tokens_positive'] = tokens_positive
+ else:
+ text = results['text']
+ results['text'] = list(text.values())
+ return results
diff --git a/mmdet/datasets/transforms/transforms.py b/mmdet/datasets/transforms/transforms.py
index 4ac2bf75b54..c50b987db33 100644
--- a/mmdet/datasets/transforms/transforms.py
+++ b/mmdet/datasets/transforms/transforms.py
@@ -1766,8 +1766,10 @@ def _postprocess_results(
results['masks'] = np.array(
[results['masks'][i] for i in results['idx_mapper']])
results['masks'] = ori_masks.__class__(
- results['masks'], ori_masks.height, ori_masks.width)
-
+ results['masks'],
+ results['masks'][0].shape[0],
+ results['masks'][0].shape[1],
+ )
if (not len(results['idx_mapper'])
and self.skip_img_without_anno):
return None
diff --git a/mmdet/engine/hooks/__init__.py b/mmdet/engine/hooks/__init__.py
index bfc03693b24..889fa557ade 100644
--- a/mmdet/engine/hooks/__init__.py
+++ b/mmdet/engine/hooks/__init__.py
@@ -7,12 +7,15 @@
from .set_epoch_info_hook import SetEpochInfoHook
from .sync_norm_hook import SyncNormHook
from .utils import trigger_visualization_hook
-from .visualization_hook import DetVisualizationHook, TrackVisualizationHook
+from .visualization_hook import (DetVisualizationHook,
+ GroundingVisualizationHook,
+ TrackVisualizationHook)
from .yolox_mode_switch_hook import YOLOXModeSwitchHook
__all__ = [
'YOLOXModeSwitchHook', 'SyncNormHook', 'CheckInvalidLossHook',
'SetEpochInfoHook', 'MemoryProfilerHook', 'DetVisualizationHook',
'NumClassCheckHook', 'MeanTeacherHook', 'trigger_visualization_hook',
- 'PipelineSwitchHook', 'TrackVisualizationHook'
+ 'PipelineSwitchHook', 'TrackVisualizationHook',
+ 'GroundingVisualizationHook'
]
diff --git a/mmdet/engine/hooks/visualization_hook.py b/mmdet/engine/hooks/visualization_hook.py
index fad0f907ebc..3408186b6ef 100644
--- a/mmdet/engine/hooks/visualization_hook.py
+++ b/mmdet/engine/hooks/visualization_hook.py
@@ -4,6 +4,7 @@
from typing import Optional, Sequence
import mmcv
+import numpy as np
from mmengine.fileio import get
from mmengine.hooks import Hook
from mmengine.runner import Runner
@@ -13,6 +14,8 @@
from mmdet.datasets.samplers import TrackImgSampler
from mmdet.registry import HOOKS
from mmdet.structures import DetDataSample, TrackDataSample
+from mmdet.structures.bbox import BaseBoxes
+from mmdet.visualization.palette import _get_adaptive_scales
@HOOKS.register_module()
@@ -219,7 +222,7 @@ def after_val_iter(self, runner: Runner, batch_idx: int, data_batch: dict,
if self.draw is False:
return
- assert len(outputs) == 1,\
+ assert len(outputs) == 1, \
'only batch_size=1 is supported while validating.'
sampler = runner.val_dataloader.sampler
@@ -310,3 +313,203 @@ def visualize_single_image(self, img_data_sample: DetDataSample,
pred_score_thr=self.score_thr,
out_file=out_file,
step=step)
+
+
+def draw_all_character(visualizer, characters, w):
+ start_index = 2
+ y_index = 5
+ for char in characters:
+ if isinstance(char, str):
+ visualizer.draw_texts(
+ str(char),
+ positions=np.array([start_index, y_index]),
+ colors=(0, 0, 0),
+ font_families='monospace')
+ start_index += len(char) * 8
+ else:
+ visualizer.draw_texts(
+ str(char[0]),
+ positions=np.array([start_index, y_index]),
+ colors=char[1],
+ font_families='monospace')
+ start_index += len(char[0]) * 8
+
+ if start_index > w - 10:
+ start_index = 2
+ y_index += 15
+
+ drawn_text = visualizer.get_image()
+ return drawn_text
+
+
+@HOOKS.register_module()
+class GroundingVisualizationHook(DetVisualizationHook):
+
+ def after_test_iter(self, runner: Runner, batch_idx: int, data_batch: dict,
+ outputs: Sequence[DetDataSample]) -> None:
+ """Run after every testing iterations.
+
+ Args:
+ runner (:obj:`Runner`): The runner of the testing process.
+ batch_idx (int): The index of the current batch in the val loop.
+ data_batch (dict): Data from dataloader.
+ outputs (Sequence[:obj:`DetDataSample`]): A batch of data samples
+ that contain annotations and predictions.
+ """
+ if self.draw is False:
+ return
+
+ if self.test_out_dir is not None:
+ self.test_out_dir = osp.join(runner.work_dir, runner.timestamp,
+ self.test_out_dir)
+ mkdir_or_exist(self.test_out_dir)
+
+ for data_sample in outputs:
+ data_sample = data_sample.cpu()
+
+ self._test_index += 1
+
+ img_path = data_sample.img_path
+ img_bytes = get(img_path, backend_args=self.backend_args)
+ img = mmcv.imfrombytes(img_bytes, channel_order='rgb')
+
+ out_file = None
+ if self.test_out_dir is not None:
+ out_file = osp.basename(img_path)
+ out_file = osp.join(self.test_out_dir, out_file)
+
+ text = data_sample.text
+ if isinstance(text, str): # VG
+ gt_instances = data_sample.gt_instances
+ tokens_positive = data_sample.tokens_positive
+ if 'phrase_ids' in data_sample:
+ # flickr30k
+ gt_labels = data_sample.phrase_ids
+ else:
+ gt_labels = gt_instances.labels
+ gt_bboxes = gt_instances.get('bboxes', None)
+ if gt_bboxes is not None and isinstance(gt_bboxes, BaseBoxes):
+ gt_instances.bboxes = gt_bboxes.tensor
+ print(gt_labels, tokens_positive, gt_bboxes, img_path)
+ pred_instances = data_sample.pred_instances
+ pred_instances = pred_instances[
+ pred_instances.scores > self.score_thr]
+ pred_labels = pred_instances.labels
+ pred_bboxes = pred_instances.bboxes
+ pred_scores = pred_instances.scores
+
+ max_label = 0
+ if len(gt_labels) > 0:
+ max_label = max(gt_labels)
+ if len(pred_labels) > 0:
+ max_label = max(max(pred_labels), max_label)
+
+ max_label = int(max(max_label, 0))
+ palette = np.random.randint(0, 256, size=(max_label + 1, 3))
+ bbox_palette = [tuple(c) for c in palette]
+ # bbox_palette = get_palette('random', max_label + 1)
+ if len(gt_labels) >= len(pred_labels):
+ colors = [bbox_palette[label] for label in gt_labels]
+ else:
+ colors = [bbox_palette[label] for label in pred_labels]
+
+ self._visualizer.set_image(img)
+
+ for label, bbox, color in zip(gt_labels, gt_bboxes, colors):
+ self._visualizer.draw_bboxes(
+ bbox, edge_colors=color, face_colors=color, alpha=0.3)
+ self._visualizer.draw_bboxes(
+ bbox, edge_colors=color, alpha=1)
+
+ drawn_img = self._visualizer.get_image()
+
+ new_image = np.ones(
+ (100, img.shape[1], 3), dtype=np.uint8) * 255
+ self._visualizer.set_image(new_image)
+
+ if tokens_positive == -1: # REC
+ gt_tokens_positive = [[]]
+ else: # Phrase Grounding
+ gt_tokens_positive = [
+ tokens_positive[label] for label in gt_labels
+ ]
+ split_by_character = [char for char in text]
+ characters = []
+ start_index = 0
+ end_index = 0
+ for w in split_by_character:
+ end_index += len(w)
+ is_find = False
+ for i, positive in enumerate(gt_tokens_positive):
+ for p in positive:
+ if start_index >= p[0] and end_index <= p[1]:
+ characters.append([w, colors[i]])
+ is_find = True
+ break
+ if is_find:
+ break
+ if not is_find:
+ characters.append([w, (0, 0, 0)])
+ start_index = end_index
+
+ drawn_text = draw_all_character(self._visualizer, characters,
+ img.shape[1])
+ drawn_gt_img = np.concatenate((drawn_img, drawn_text), axis=0)
+
+ self._visualizer.set_image(img)
+
+ for label, bbox, color in zip(pred_labels, pred_bboxes,
+ colors):
+ self._visualizer.draw_bboxes(
+ bbox, edge_colors=color, face_colors=color, alpha=0.3)
+ self._visualizer.draw_bboxes(
+ bbox, edge_colors=color, alpha=1)
+ print(pred_labels, pred_bboxes, pred_scores, colors)
+ areas = (pred_bboxes[:, 3] - pred_bboxes[:, 1]) * (
+ pred_bboxes[:, 2] - pred_bboxes[:, 0])
+ scales = _get_adaptive_scales(areas)
+ score = [str(round(s.item(), 2)) for s in pred_scores]
+ font_sizes = [int(13 * scales[i]) for i in range(len(scales))]
+ self._visualizer.draw_texts(
+ score,
+ pred_bboxes[:, :2].int(),
+ colors=(255, 255, 255),
+ font_sizes=font_sizes,
+ bboxes=[{
+ 'facecolor': 'black',
+ 'alpha': 0.8,
+ 'pad': 0.7,
+ 'edgecolor': 'none'
+ }] * len(pred_bboxes))
+
+ drawn_img = self._visualizer.get_image()
+
+ new_image = np.ones(
+ (100, img.shape[1], 3), dtype=np.uint8) * 255
+ self._visualizer.set_image(new_image)
+ drawn_text = draw_all_character(self._visualizer, characters,
+ img.shape[1])
+ drawn_pred_img = np.concatenate((drawn_img, drawn_text),
+ axis=0)
+ drawn_img = np.concatenate((drawn_gt_img, drawn_pred_img),
+ axis=1)
+
+ if self.show:
+ self._visualizer.show(
+ drawn_img,
+ win_name=osp.basename(img_path),
+ wait_time=self.wait_time)
+ if out_file is not None:
+ mmcv.imwrite(drawn_img[..., ::-1], out_file)
+ else:
+ self.add_image('test_img', drawn_img, self._test_index)
+ else: # OD
+ self._visualizer.add_datasample(
+ osp.basename(img_path) if self.show else 'test_img',
+ img,
+ data_sample=data_sample,
+ show=self.show,
+ wait_time=self.wait_time,
+ pred_score_thr=self.score_thr,
+ out_file=out_file,
+ step=self._test_index)
diff --git a/mmdet/evaluation/__init__.py b/mmdet/evaluation/__init__.py
index f70dc226d30..126dea092eb 100644
--- a/mmdet/evaluation/__init__.py
+++ b/mmdet/evaluation/__init__.py
@@ -1,3 +1,4 @@
# Copyright (c) OpenMMLab. All rights reserved.
+from .evaluator import * # noqa: F401,F403
from .functional import * # noqa: F401,F403
from .metrics import * # noqa: F401,F403
diff --git a/mmdet/evaluation/evaluator/__init__.py b/mmdet/evaluation/evaluator/__init__.py
new file mode 100644
index 00000000000..6b13fe99548
--- /dev/null
+++ b/mmdet/evaluation/evaluator/__init__.py
@@ -0,0 +1,4 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .multi_datasets_evaluator import MultiDatasetsEvaluator
+
+__all__ = ['MultiDatasetsEvaluator']
diff --git a/mmdet/evaluation/evaluator/multi_datasets_evaluator.py b/mmdet/evaluation/evaluator/multi_datasets_evaluator.py
new file mode 100644
index 00000000000..5cff1cf210e
--- /dev/null
+++ b/mmdet/evaluation/evaluator/multi_datasets_evaluator.py
@@ -0,0 +1,111 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import warnings
+from collections import OrderedDict
+from typing import Sequence, Union
+
+from mmengine.dist import (broadcast_object_list, collect_results,
+ is_main_process)
+from mmengine.evaluator import BaseMetric, Evaluator
+from mmengine.evaluator.metric import _to_cpu
+from mmengine.registry import EVALUATOR
+
+from mmdet.utils import ConfigType
+
+
+@EVALUATOR.register_module()
+class MultiDatasetsEvaluator(Evaluator):
+ """Wrapper class to compose class: `ConcatDataset` and multiple
+ :class:`BaseMetric` instances.
+ The metrics will be evaluated on each dataset slice separately. The name of
+ the each metric is the concatenation of the dataset prefix, the metric
+ prefix and the key of metric - e.g.
+ `dataset_prefix/metric_prefix/accuracy`.
+
+ Args:
+ metrics (dict or BaseMetric or Sequence): The config of metrics.
+ dataset_prefixes (Sequence[str]): The prefix of each dataset. The
+ length of this sequence should be the same as the length of the
+ datasets.
+ """
+
+ def __init__(self, metrics: Union[ConfigType, BaseMetric, Sequence],
+ dataset_prefixes: Sequence[str]) -> None:
+ super().__init__(metrics)
+ self.dataset_prefixes = dataset_prefixes
+ self._setups = False
+
+ def _get_cumulative_sizes(self):
+ # ConcatDataset have a property `cumulative_sizes`
+ if isinstance(self.dataset_meta, Sequence):
+ dataset_slices = self.dataset_meta[0]['cumulative_sizes']
+ if not self._setups:
+ self._setups = True
+ for dataset_meta, metric in zip(self.dataset_meta,
+ self.metrics):
+ metric.dataset_meta = dataset_meta
+ else:
+ dataset_slices = self.dataset_meta['cumulative_sizes']
+ return dataset_slices
+
+ def evaluate(self, size: int) -> dict:
+ """Invoke ``evaluate`` method of each metric and collect the metrics
+ dictionary.
+
+ Args:
+ size (int): Length of the entire validation dataset. When batch
+ size > 1, the dataloader may pad some data samples to make
+ sure all ranks have the same length of dataset slice. The
+ ``collect_results`` function will drop the padded data based on
+ this size.
+
+ Returns:
+ dict: Evaluation results of all metrics. The keys are the names
+ of the metrics, and the values are corresponding results.
+ """
+ metrics_results = OrderedDict()
+ dataset_slices = self._get_cumulative_sizes()
+ assert len(dataset_slices) == len(self.dataset_prefixes)
+
+ for dataset_prefix, start, end, metric in zip(
+ self.dataset_prefixes, [0] + dataset_slices[:-1],
+ dataset_slices, self.metrics):
+ if len(metric.results) == 0:
+ warnings.warn(
+ f'{metric.__class__.__name__} got empty `self.results`.'
+ 'Please ensure that the processed results are properly '
+ 'added into `self.results` in `process` method.')
+
+ results = collect_results(metric.results, size,
+ metric.collect_device)
+
+ if is_main_process():
+ # cast all tensors in results list to cpu
+ results = _to_cpu(results)
+ _metrics = metric.compute_metrics(
+ results[start:end]) # type: ignore
+
+ if metric.prefix:
+ final_prefix = '/'.join((dataset_prefix, metric.prefix))
+ else:
+ final_prefix = dataset_prefix
+ print(f'================{final_prefix}================')
+ metric_results = {
+ '/'.join((final_prefix, k)): v
+ for k, v in _metrics.items()
+ }
+
+ # Check metric name conflicts
+ for name in metric_results.keys():
+ if name in metrics_results:
+ raise ValueError(
+ 'There are multiple evaluation results with '
+ f'the same metric name {name}. Please make '
+ 'sure all metrics have different prefixes.')
+ metrics_results.update(metric_results)
+ metric.results.clear()
+ if is_main_process():
+ metrics_results = [metrics_results]
+ else:
+ metrics_results = [None] # type: ignore
+ broadcast_object_list(metrics_results)
+ return metrics_results[0]
diff --git a/mmdet/evaluation/functional/class_names.py b/mmdet/evaluation/functional/class_names.py
index d0ea7094685..623a89cfdc0 100644
--- a/mmdet/evaluation/functional/class_names.py
+++ b/mmdet/evaluation/functional/class_names.py
@@ -485,6 +485,250 @@ def objects365v2_classes() -> list:
]
+def lvis_classes() -> list:
+ """Class names of LVIS."""
+ return [
+ 'aerosol_can', 'air_conditioner', 'airplane', 'alarm_clock', 'alcohol',
+ 'alligator', 'almond', 'ambulance', 'amplifier', 'anklet', 'antenna',
+ 'apple', 'applesauce', 'apricot', 'apron', 'aquarium',
+ 'arctic_(type_of_shoe)', 'armband', 'armchair', 'armoire', 'armor',
+ 'artichoke', 'trash_can', 'ashtray', 'asparagus', 'atomizer',
+ 'avocado', 'award', 'awning', 'ax', 'baboon', 'baby_buggy',
+ 'basketball_backboard', 'backpack', 'handbag', 'suitcase', 'bagel',
+ 'bagpipe', 'baguet', 'bait', 'ball', 'ballet_skirt', 'balloon',
+ 'bamboo', 'banana', 'Band_Aid', 'bandage', 'bandanna', 'banjo',
+ 'banner', 'barbell', 'barge', 'barrel', 'barrette', 'barrow',
+ 'baseball_base', 'baseball', 'baseball_bat', 'baseball_cap',
+ 'baseball_glove', 'basket', 'basketball', 'bass_horn', 'bat_(animal)',
+ 'bath_mat', 'bath_towel', 'bathrobe', 'bathtub', 'batter_(food)',
+ 'battery', 'beachball', 'bead', 'bean_curd', 'beanbag', 'beanie',
+ 'bear', 'bed', 'bedpan', 'bedspread', 'cow', 'beef_(food)', 'beeper',
+ 'beer_bottle', 'beer_can', 'beetle', 'bell', 'bell_pepper', 'belt',
+ 'belt_buckle', 'bench', 'beret', 'bib', 'Bible', 'bicycle', 'visor',
+ 'billboard', 'binder', 'binoculars', 'bird', 'birdfeeder', 'birdbath',
+ 'birdcage', 'birdhouse', 'birthday_cake', 'birthday_card',
+ 'pirate_flag', 'black_sheep', 'blackberry', 'blackboard', 'blanket',
+ 'blazer', 'blender', 'blimp', 'blinker', 'blouse', 'blueberry',
+ 'gameboard', 'boat', 'bob', 'bobbin', 'bobby_pin', 'boiled_egg',
+ 'bolo_tie', 'deadbolt', 'bolt', 'bonnet', 'book', 'bookcase',
+ 'booklet', 'bookmark', 'boom_microphone', 'boot', 'bottle',
+ 'bottle_opener', 'bouquet', 'bow_(weapon)', 'bow_(decorative_ribbons)',
+ 'bow-tie', 'bowl', 'pipe_bowl', 'bowler_hat', 'bowling_ball', 'box',
+ 'boxing_glove', 'suspenders', 'bracelet', 'brass_plaque', 'brassiere',
+ 'bread-bin', 'bread', 'breechcloth', 'bridal_gown', 'briefcase',
+ 'broccoli', 'broach', 'broom', 'brownie', 'brussels_sprouts',
+ 'bubble_gum', 'bucket', 'horse_buggy', 'bull', 'bulldog', 'bulldozer',
+ 'bullet_train', 'bulletin_board', 'bulletproof_vest', 'bullhorn',
+ 'bun', 'bunk_bed', 'buoy', 'burrito', 'bus_(vehicle)', 'business_card',
+ 'butter', 'butterfly', 'button', 'cab_(taxi)', 'cabana', 'cabin_car',
+ 'cabinet', 'locker', 'cake', 'calculator', 'calendar', 'calf',
+ 'camcorder', 'camel', 'camera', 'camera_lens', 'camper_(vehicle)',
+ 'can', 'can_opener', 'candle', 'candle_holder', 'candy_bar',
+ 'candy_cane', 'walking_cane', 'canister', 'canoe', 'cantaloup',
+ 'canteen', 'cap_(headwear)', 'bottle_cap', 'cape', 'cappuccino',
+ 'car_(automobile)', 'railcar_(part_of_a_train)', 'elevator_car',
+ 'car_battery', 'identity_card', 'card', 'cardigan', 'cargo_ship',
+ 'carnation', 'horse_carriage', 'carrot', 'tote_bag', 'cart', 'carton',
+ 'cash_register', 'casserole', 'cassette', 'cast', 'cat', 'cauliflower',
+ 'cayenne_(spice)', 'CD_player', 'celery', 'cellular_telephone',
+ 'chain_mail', 'chair', 'chaise_longue', 'chalice', 'chandelier',
+ 'chap', 'checkbook', 'checkerboard', 'cherry', 'chessboard',
+ 'chicken_(animal)', 'chickpea', 'chili_(vegetable)', 'chime',
+ 'chinaware', 'crisp_(potato_chip)', 'poker_chip', 'chocolate_bar',
+ 'chocolate_cake', 'chocolate_milk', 'chocolate_mousse', 'choker',
+ 'chopping_board', 'chopstick', 'Christmas_tree', 'slide', 'cider',
+ 'cigar_box', 'cigarette', 'cigarette_case', 'cistern', 'clarinet',
+ 'clasp', 'cleansing_agent', 'cleat_(for_securing_rope)', 'clementine',
+ 'clip', 'clipboard', 'clippers_(for_plants)', 'cloak', 'clock',
+ 'clock_tower', 'clothes_hamper', 'clothespin', 'clutch_bag', 'coaster',
+ 'coat', 'coat_hanger', 'coatrack', 'cock', 'cockroach',
+ 'cocoa_(beverage)', 'coconut', 'coffee_maker', 'coffee_table',
+ 'coffeepot', 'coil', 'coin', 'colander', 'coleslaw',
+ 'coloring_material', 'combination_lock', 'pacifier', 'comic_book',
+ 'compass', 'computer_keyboard', 'condiment', 'cone', 'control',
+ 'convertible_(automobile)', 'sofa_bed', 'cooker', 'cookie',
+ 'cooking_utensil', 'cooler_(for_food)', 'cork_(bottle_plug)',
+ 'corkboard', 'corkscrew', 'edible_corn', 'cornbread', 'cornet',
+ 'cornice', 'cornmeal', 'corset', 'costume', 'cougar', 'coverall',
+ 'cowbell', 'cowboy_hat', 'crab_(animal)', 'crabmeat', 'cracker',
+ 'crape', 'crate', 'crayon', 'cream_pitcher', 'crescent_roll', 'crib',
+ 'crock_pot', 'crossbar', 'crouton', 'crow', 'crowbar', 'crown',
+ 'crucifix', 'cruise_ship', 'police_cruiser', 'crumb', 'crutch',
+ 'cub_(animal)', 'cube', 'cucumber', 'cufflink', 'cup', 'trophy_cup',
+ 'cupboard', 'cupcake', 'hair_curler', 'curling_iron', 'curtain',
+ 'cushion', 'cylinder', 'cymbal', 'dagger', 'dalmatian', 'dartboard',
+ 'date_(fruit)', 'deck_chair', 'deer', 'dental_floss', 'desk',
+ 'detergent', 'diaper', 'diary', 'die', 'dinghy', 'dining_table', 'tux',
+ 'dish', 'dish_antenna', 'dishrag', 'dishtowel', 'dishwasher',
+ 'dishwasher_detergent', 'dispenser', 'diving_board', 'Dixie_cup',
+ 'dog', 'dog_collar', 'doll', 'dollar', 'dollhouse', 'dolphin',
+ 'domestic_ass', 'doorknob', 'doormat', 'doughnut', 'dove', 'dragonfly',
+ 'drawer', 'underdrawers', 'dress', 'dress_hat', 'dress_suit',
+ 'dresser', 'drill', 'drone', 'dropper', 'drum_(musical_instrument)',
+ 'drumstick', 'duck', 'duckling', 'duct_tape', 'duffel_bag', 'dumbbell',
+ 'dumpster', 'dustpan', 'eagle', 'earphone', 'earplug', 'earring',
+ 'easel', 'eclair', 'eel', 'egg', 'egg_roll', 'egg_yolk', 'eggbeater',
+ 'eggplant', 'electric_chair', 'refrigerator', 'elephant', 'elk',
+ 'envelope', 'eraser', 'escargot', 'eyepatch', 'falcon', 'fan',
+ 'faucet', 'fedora', 'ferret', 'Ferris_wheel', 'ferry', 'fig_(fruit)',
+ 'fighter_jet', 'figurine', 'file_cabinet', 'file_(tool)', 'fire_alarm',
+ 'fire_engine', 'fire_extinguisher', 'fire_hose', 'fireplace',
+ 'fireplug', 'first-aid_kit', 'fish', 'fish_(food)', 'fishbowl',
+ 'fishing_rod', 'flag', 'flagpole', 'flamingo', 'flannel', 'flap',
+ 'flash', 'flashlight', 'fleece', 'flip-flop_(sandal)',
+ 'flipper_(footwear)', 'flower_arrangement', 'flute_glass', 'foal',
+ 'folding_chair', 'food_processor', 'football_(American)',
+ 'football_helmet', 'footstool', 'fork', 'forklift', 'freight_car',
+ 'French_toast', 'freshener', 'frisbee', 'frog', 'fruit_juice',
+ 'frying_pan', 'fudge', 'funnel', 'futon', 'gag', 'garbage',
+ 'garbage_truck', 'garden_hose', 'gargle', 'gargoyle', 'garlic',
+ 'gasmask', 'gazelle', 'gelatin', 'gemstone', 'generator',
+ 'giant_panda', 'gift_wrap', 'ginger', 'giraffe', 'cincture',
+ 'glass_(drink_container)', 'globe', 'glove', 'goat', 'goggles',
+ 'goldfish', 'golf_club', 'golfcart', 'gondola_(boat)', 'goose',
+ 'gorilla', 'gourd', 'grape', 'grater', 'gravestone', 'gravy_boat',
+ 'green_bean', 'green_onion', 'griddle', 'grill', 'grits', 'grizzly',
+ 'grocery_bag', 'guitar', 'gull', 'gun', 'hairbrush', 'hairnet',
+ 'hairpin', 'halter_top', 'ham', 'hamburger', 'hammer', 'hammock',
+ 'hamper', 'hamster', 'hair_dryer', 'hand_glass', 'hand_towel',
+ 'handcart', 'handcuff', 'handkerchief', 'handle', 'handsaw',
+ 'hardback_book', 'harmonium', 'hat', 'hatbox', 'veil', 'headband',
+ 'headboard', 'headlight', 'headscarf', 'headset',
+ 'headstall_(for_horses)', 'heart', 'heater', 'helicopter', 'helmet',
+ 'heron', 'highchair', 'hinge', 'hippopotamus', 'hockey_stick', 'hog',
+ 'home_plate_(baseball)', 'honey', 'fume_hood', 'hook', 'hookah',
+ 'hornet', 'horse', 'hose', 'hot-air_balloon', 'hotplate', 'hot_sauce',
+ 'hourglass', 'houseboat', 'hummingbird', 'hummus', 'polar_bear',
+ 'icecream', 'popsicle', 'ice_maker', 'ice_pack', 'ice_skate',
+ 'igniter', 'inhaler', 'iPod', 'iron_(for_clothing)', 'ironing_board',
+ 'jacket', 'jam', 'jar', 'jean', 'jeep', 'jelly_bean', 'jersey',
+ 'jet_plane', 'jewel', 'jewelry', 'joystick', 'jumpsuit', 'kayak',
+ 'keg', 'kennel', 'kettle', 'key', 'keycard', 'kilt', 'kimono',
+ 'kitchen_sink', 'kitchen_table', 'kite', 'kitten', 'kiwi_fruit',
+ 'knee_pad', 'knife', 'knitting_needle', 'knob', 'knocker_(on_a_door)',
+ 'koala', 'lab_coat', 'ladder', 'ladle', 'ladybug', 'lamb_(animal)',
+ 'lamb-chop', 'lamp', 'lamppost', 'lampshade', 'lantern', 'lanyard',
+ 'laptop_computer', 'lasagna', 'latch', 'lawn_mower', 'leather',
+ 'legging_(clothing)', 'Lego', 'legume', 'lemon', 'lemonade', 'lettuce',
+ 'license_plate', 'life_buoy', 'life_jacket', 'lightbulb',
+ 'lightning_rod', 'lime', 'limousine', 'lion', 'lip_balm', 'liquor',
+ 'lizard', 'log', 'lollipop', 'speaker_(stereo_equipment)', 'loveseat',
+ 'machine_gun', 'magazine', 'magnet', 'mail_slot', 'mailbox_(at_home)',
+ 'mallard', 'mallet', 'mammoth', 'manatee', 'mandarin_orange', 'manger',
+ 'manhole', 'map', 'marker', 'martini', 'mascot', 'mashed_potato',
+ 'masher', 'mask', 'mast', 'mat_(gym_equipment)', 'matchbox',
+ 'mattress', 'measuring_cup', 'measuring_stick', 'meatball', 'medicine',
+ 'melon', 'microphone', 'microscope', 'microwave_oven', 'milestone',
+ 'milk', 'milk_can', 'milkshake', 'minivan', 'mint_candy', 'mirror',
+ 'mitten', 'mixer_(kitchen_tool)', 'money',
+ 'monitor_(computer_equipment) computer_monitor', 'monkey', 'motor',
+ 'motor_scooter', 'motor_vehicle', 'motorcycle', 'mound_(baseball)',
+ 'mouse_(computer_equipment)', 'mousepad', 'muffin', 'mug', 'mushroom',
+ 'music_stool', 'musical_instrument', 'nailfile', 'napkin',
+ 'neckerchief', 'necklace', 'necktie', 'needle', 'nest', 'newspaper',
+ 'newsstand', 'nightshirt', 'nosebag_(for_animals)',
+ 'noseband_(for_animals)', 'notebook', 'notepad', 'nut', 'nutcracker',
+ 'oar', 'octopus_(food)', 'octopus_(animal)', 'oil_lamp', 'olive_oil',
+ 'omelet', 'onion', 'orange_(fruit)', 'orange_juice', 'ostrich',
+ 'ottoman', 'oven', 'overalls_(clothing)', 'owl', 'packet', 'inkpad',
+ 'pad', 'paddle', 'padlock', 'paintbrush', 'painting', 'pajamas',
+ 'palette', 'pan_(for_cooking)', 'pan_(metal_container)', 'pancake',
+ 'pantyhose', 'papaya', 'paper_plate', 'paper_towel', 'paperback_book',
+ 'paperweight', 'parachute', 'parakeet', 'parasail_(sports)', 'parasol',
+ 'parchment', 'parka', 'parking_meter', 'parrot',
+ 'passenger_car_(part_of_a_train)', 'passenger_ship', 'passport',
+ 'pastry', 'patty_(food)', 'pea_(food)', 'peach', 'peanut_butter',
+ 'pear', 'peeler_(tool_for_fruit_and_vegetables)', 'wooden_leg',
+ 'pegboard', 'pelican', 'pen', 'pencil', 'pencil_box',
+ 'pencil_sharpener', 'pendulum', 'penguin', 'pennant', 'penny_(coin)',
+ 'pepper', 'pepper_mill', 'perfume', 'persimmon', 'person', 'pet',
+ 'pew_(church_bench)', 'phonebook', 'phonograph_record', 'piano',
+ 'pickle', 'pickup_truck', 'pie', 'pigeon', 'piggy_bank', 'pillow',
+ 'pin_(non_jewelry)', 'pineapple', 'pinecone', 'ping-pong_ball',
+ 'pinwheel', 'tobacco_pipe', 'pipe', 'pistol', 'pita_(bread)',
+ 'pitcher_(vessel_for_liquid)', 'pitchfork', 'pizza', 'place_mat',
+ 'plate', 'platter', 'playpen', 'pliers', 'plow_(farm_equipment)',
+ 'plume', 'pocket_watch', 'pocketknife', 'poker_(fire_stirring_tool)',
+ 'pole', 'polo_shirt', 'poncho', 'pony', 'pool_table', 'pop_(soda)',
+ 'postbox_(public)', 'postcard', 'poster', 'pot', 'flowerpot', 'potato',
+ 'potholder', 'pottery', 'pouch', 'power_shovel', 'prawn', 'pretzel',
+ 'printer', 'projectile_(weapon)', 'projector', 'propeller', 'prune',
+ 'pudding', 'puffer_(fish)', 'puffin', 'pug-dog', 'pumpkin', 'puncher',
+ 'puppet', 'puppy', 'quesadilla', 'quiche', 'quilt', 'rabbit',
+ 'race_car', 'racket', 'radar', 'radiator', 'radio_receiver', 'radish',
+ 'raft', 'rag_doll', 'raincoat', 'ram_(animal)', 'raspberry', 'rat',
+ 'razorblade', 'reamer_(juicer)', 'rearview_mirror', 'receipt',
+ 'recliner', 'record_player', 'reflector', 'remote_control',
+ 'rhinoceros', 'rib_(food)', 'rifle', 'ring', 'river_boat', 'road_map',
+ 'robe', 'rocking_chair', 'rodent', 'roller_skate', 'Rollerblade',
+ 'rolling_pin', 'root_beer', 'router_(computer_equipment)',
+ 'rubber_band', 'runner_(carpet)', 'plastic_bag',
+ 'saddle_(on_an_animal)', 'saddle_blanket', 'saddlebag', 'safety_pin',
+ 'sail', 'salad', 'salad_plate', 'salami', 'salmon_(fish)',
+ 'salmon_(food)', 'salsa', 'saltshaker', 'sandal_(type_of_shoe)',
+ 'sandwich', 'satchel', 'saucepan', 'saucer', 'sausage', 'sawhorse',
+ 'saxophone', 'scale_(measuring_instrument)', 'scarecrow', 'scarf',
+ 'school_bus', 'scissors', 'scoreboard', 'scraper', 'screwdriver',
+ 'scrubbing_brush', 'sculpture', 'seabird', 'seahorse', 'seaplane',
+ 'seashell', 'sewing_machine', 'shaker', 'shampoo', 'shark',
+ 'sharpener', 'Sharpie', 'shaver_(electric)', 'shaving_cream', 'shawl',
+ 'shears', 'sheep', 'shepherd_dog', 'sherbert', 'shield', 'shirt',
+ 'shoe', 'shopping_bag', 'shopping_cart', 'short_pants', 'shot_glass',
+ 'shoulder_bag', 'shovel', 'shower_head', 'shower_cap',
+ 'shower_curtain', 'shredder_(for_paper)', 'signboard', 'silo', 'sink',
+ 'skateboard', 'skewer', 'ski', 'ski_boot', 'ski_parka', 'ski_pole',
+ 'skirt', 'skullcap', 'sled', 'sleeping_bag', 'sling_(bandage)',
+ 'slipper_(footwear)', 'smoothie', 'snake', 'snowboard', 'snowman',
+ 'snowmobile', 'soap', 'soccer_ball', 'sock', 'sofa', 'softball',
+ 'solar_array', 'sombrero', 'soup', 'soup_bowl', 'soupspoon',
+ 'sour_cream', 'soya_milk', 'space_shuttle', 'sparkler_(fireworks)',
+ 'spatula', 'spear', 'spectacles', 'spice_rack', 'spider', 'crawfish',
+ 'sponge', 'spoon', 'sportswear', 'spotlight', 'squid_(food)',
+ 'squirrel', 'stagecoach', 'stapler_(stapling_machine)', 'starfish',
+ 'statue_(sculpture)', 'steak_(food)', 'steak_knife', 'steering_wheel',
+ 'stepladder', 'step_stool', 'stereo_(sound_system)', 'stew', 'stirrer',
+ 'stirrup', 'stool', 'stop_sign', 'brake_light', 'stove', 'strainer',
+ 'strap', 'straw_(for_drinking)', 'strawberry', 'street_sign',
+ 'streetlight', 'string_cheese', 'stylus', 'subwoofer', 'sugar_bowl',
+ 'sugarcane_(plant)', 'suit_(clothing)', 'sunflower', 'sunglasses',
+ 'sunhat', 'surfboard', 'sushi', 'mop', 'sweat_pants', 'sweatband',
+ 'sweater', 'sweatshirt', 'sweet_potato', 'swimsuit', 'sword',
+ 'syringe', 'Tabasco_sauce', 'table-tennis_table', 'table',
+ 'table_lamp', 'tablecloth', 'tachometer', 'taco', 'tag', 'taillight',
+ 'tambourine', 'army_tank', 'tank_(storage_vessel)',
+ 'tank_top_(clothing)', 'tape_(sticky_cloth_or_paper)', 'tape_measure',
+ 'tapestry', 'tarp', 'tartan', 'tassel', 'tea_bag', 'teacup',
+ 'teakettle', 'teapot', 'teddy_bear', 'telephone', 'telephone_booth',
+ 'telephone_pole', 'telephoto_lens', 'television_camera',
+ 'television_set', 'tennis_ball', 'tennis_racket', 'tequila',
+ 'thermometer', 'thermos_bottle', 'thermostat', 'thimble', 'thread',
+ 'thumbtack', 'tiara', 'tiger', 'tights_(clothing)', 'timer', 'tinfoil',
+ 'tinsel', 'tissue_paper', 'toast_(food)', 'toaster', 'toaster_oven',
+ 'toilet', 'toilet_tissue', 'tomato', 'tongs', 'toolbox', 'toothbrush',
+ 'toothpaste', 'toothpick', 'cover', 'tortilla', 'tow_truck', 'towel',
+ 'towel_rack', 'toy', 'tractor_(farm_equipment)', 'traffic_light',
+ 'dirt_bike', 'trailer_truck', 'train_(railroad_vehicle)', 'trampoline',
+ 'tray', 'trench_coat', 'triangle_(musical_instrument)', 'tricycle',
+ 'tripod', 'trousers', 'truck', 'truffle_(chocolate)', 'trunk', 'vat',
+ 'turban', 'turkey_(food)', 'turnip', 'turtle', 'turtleneck_(clothing)',
+ 'typewriter', 'umbrella', 'underwear', 'unicycle', 'urinal', 'urn',
+ 'vacuum_cleaner', 'vase', 'vending_machine', 'vent', 'vest',
+ 'videotape', 'vinegar', 'violin', 'vodka', 'volleyball', 'vulture',
+ 'waffle', 'waffle_iron', 'wagon', 'wagon_wheel', 'walking_stick',
+ 'wall_clock', 'wall_socket', 'wallet', 'walrus', 'wardrobe',
+ 'washbasin', 'automatic_washer', 'watch', 'water_bottle',
+ 'water_cooler', 'water_faucet', 'water_heater', 'water_jug',
+ 'water_gun', 'water_scooter', 'water_ski', 'water_tower',
+ 'watering_can', 'watermelon', 'weathervane', 'webcam', 'wedding_cake',
+ 'wedding_ring', 'wet_suit', 'wheel', 'wheelchair', 'whipped_cream',
+ 'whistle', 'wig', 'wind_chime', 'windmill', 'window_box_(for_plants)',
+ 'windshield_wiper', 'windsock', 'wine_bottle', 'wine_bucket',
+ 'wineglass', 'blinder_(for_horses)', 'wok', 'wolf', 'wooden_spoon',
+ 'wreath', 'wrench', 'wristband', 'wristlet', 'yacht', 'yogurt',
+ 'yoke_(animal_equipment)', 'zebra', 'zucchini'
+ ]
+
+
dataset_aliases = {
'voc': ['voc', 'pascal_voc', 'voc07', 'voc12'],
'imagenet_det': ['det', 'imagenet_det', 'ilsvrc_det'],
@@ -496,7 +740,8 @@ def objects365v2_classes() -> list:
'oid_challenge': ['oid_challenge', 'openimages_challenge'],
'oid_v6': ['oid_v6', 'openimages_v6'],
'objects365v1': ['objects365v1', 'obj365v1'],
- 'objects365v2': ['objects365v2', 'obj365v2']
+ 'objects365v2': ['objects365v2', 'obj365v2'],
+ 'lvis': ['lvis', 'lvis_v1'],
}
diff --git a/mmdet/evaluation/metrics/__init__.py b/mmdet/evaluation/metrics/__init__.py
index e1ec0e46250..8ad040cf6ff 100644
--- a/mmdet/evaluation/metrics/__init__.py
+++ b/mmdet/evaluation/metrics/__init__.py
@@ -7,11 +7,17 @@
from .coco_panoptic_metric import CocoPanopticMetric
from .coco_video_metric import CocoVideoMetric
from .crowdhuman_metric import CrowdHumanMetric
+from .dod_metric import DODCocoMetric
from .dump_det_results import DumpDetResults
+from .dump_odvg_results import DumpODVGResults
from .dump_proposals_metric import DumpProposals
+from .flickr30k_metric import Flickr30kMetric
+from .grefcoco_metric import gRefCOCOMetric
from .lvis_metric import LVISMetric
from .mot_challenge_metric import MOTChallengeMetric
from .openimages_metric import OpenImagesMetric
+from .ov_coco_metric import OVCocoMetric
+from .refexp_metric import RefExpMetric
from .refseg_metric import RefSegMetric
from .reid_metric import ReIDMetrics
from .semseg_metric import SemSegMetric
@@ -23,5 +29,7 @@
'VOCMetric', 'LVISMetric', 'CrowdHumanMetric', 'DumpProposals',
'CocoOccludedSeparatedMetric', 'DumpDetResults', 'BaseVideoMetric',
'MOTChallengeMetric', 'CocoVideoMetric', 'ReIDMetrics', 'YouTubeVISMetric',
- 'COCOCaptionMetric', 'SemSegMetric', 'RefSegMetric'
+ 'COCOCaptionMetric', 'SemSegMetric', 'RefSegMetric', 'RefExpMetric',
+ 'gRefCOCOMetric', 'DODCocoMetric', 'DumpODVGResults', 'Flickr30kMetric',
+ 'OVCocoMetric'
]
diff --git a/mmdet/evaluation/metrics/coco_panoptic_metric.py b/mmdet/evaluation/metrics/coco_panoptic_metric.py
index 1554c0908d1..f86be916f9c 100644
--- a/mmdet/evaluation/metrics/coco_panoptic_metric.py
+++ b/mmdet/evaluation/metrics/coco_panoptic_metric.py
@@ -190,7 +190,7 @@ def gt_to_coco_json(self, gt_dicts: Sequence[dict],
}
segments_info.append(new_segment_info)
- segm_file = image_info['file_name'].replace('jpg', 'png')
+ segm_file = image_info['file_name'].replace('.jpg', '.png')
annotation = dict(
image_id=img_id,
segments_info=segments_info,
@@ -330,7 +330,7 @@ def _compute_batch_pq_stats(self, data_samples: Sequence[dict]):
# parse pred
img_id = data_sample['img_id']
segm_file = osp.basename(data_sample['img_path']).replace(
- 'jpg', 'png')
+ '.jpg', '.png')
result = self._parse_predictions(
pred=data_sample,
img_id=img_id,
@@ -397,7 +397,7 @@ def _process_gt_and_predictions(self, data_samples: Sequence[dict]):
# parse pred
img_id = data_sample['img_id']
segm_file = osp.basename(data_sample['img_path']).replace(
- 'jpg', 'png')
+ '.jpg', '.png')
result = self._parse_predictions(
pred=data_sample, img_id=img_id, segm_file=segm_file)
diff --git a/mmdet/evaluation/metrics/dod_metric.py b/mmdet/evaluation/metrics/dod_metric.py
new file mode 100644
index 00000000000..b47d07219da
--- /dev/null
+++ b/mmdet/evaluation/metrics/dod_metric.py
@@ -0,0 +1,169 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from collections import defaultdict
+from typing import List, Optional, Sequence
+
+import numpy as np
+from mmengine.evaluator import BaseMetric
+from mmengine.fileio import get_local_path
+from mmengine.logging import MMLogger
+
+from mmdet.datasets.api_wrappers import COCO, COCOeval
+from mmdet.registry import METRICS
+
+
+@METRICS.register_module()
+class DODCocoMetric(BaseMetric):
+
+ default_prefix: Optional[str] = 'dod'
+
+ def __init__(self,
+ ann_file: Optional[str] = None,
+ collect_device: str = 'cpu',
+ outfile_prefix: Optional[str] = None,
+ backend_args: dict = None,
+ prefix: Optional[str] = None) -> None:
+ super().__init__(collect_device=collect_device, prefix=prefix)
+ self.outfile_prefix = outfile_prefix
+ with get_local_path(ann_file, backend_args=backend_args) as local_path:
+ self._coco_api = COCO(local_path)
+
+ def process(self, data_batch: dict, data_samples: Sequence[dict]) -> None:
+ for data_sample in data_samples:
+ result = dict()
+ pred = data_sample['pred_instances']
+ result['img_id'] = data_sample['img_id']
+ result['bboxes'] = pred['bboxes'].cpu().numpy()
+ result['scores'] = pred['scores'].cpu().numpy()
+
+ result['labels'] = pred['labels'].cpu().numpy()
+ result['labels'] = data_sample['sent_ids'][result['labels']]
+ self.results.append(result)
+
+ def xyxy2xywh(self, bbox: np.ndarray) -> list:
+ """Convert ``xyxy`` style bounding boxes to ``xywh`` style for COCO
+ evaluation.
+
+ Args:
+ bbox (numpy.ndarray): The bounding boxes, shape (4, ), in
+ ``xyxy`` order.
+
+ Returns:
+ list[float]: The converted bounding boxes, in ``xywh`` order.
+ """
+
+ _bbox: List = bbox.tolist()
+ return [
+ _bbox[0],
+ _bbox[1],
+ _bbox[2] - _bbox[0],
+ _bbox[3] - _bbox[1],
+ ]
+
+ def results2json(self, results: Sequence[dict]) -> list:
+ """Dump the detection results to a COCO style json file.
+
+ There are 3 types of results: proposals, bbox predictions, mask
+ predictions, and they have different data types. This method will
+ automatically recognize the type, and dump them to json files.
+
+ Args:
+ results (Sequence[dict]): Testing results of the
+ dataset.
+
+ Returns:
+ dict: Possible keys are "bbox", "segm", "proposal", and
+ values are corresponding filenames.
+ """
+ bbox_json_results = []
+ for idx, result in enumerate(results):
+ image_id = result.get('img_id', idx)
+ labels = result['labels']
+ bboxes = result['bboxes']
+ scores = result['scores']
+ for i, label in enumerate(labels):
+ data = dict()
+ data['image_id'] = image_id
+ data['bbox'] = self.xyxy2xywh(bboxes[i])
+ data['score'] = float(scores[i])
+ data['category_id'] = label
+ bbox_json_results.append(data)
+ return bbox_json_results
+
+ def compute_metrics(self, results: list) -> dict:
+ logger: MMLogger = MMLogger.get_current_instance()
+ result_files = self.results2json(results)
+ d3_res = self._coco_api.loadRes(result_files)
+ cocoEval = COCOeval(self._coco_api, d3_res, 'bbox')
+ cocoEval.evaluate()
+ cocoEval.accumulate()
+ cocoEval.summarize()
+
+ aps = cocoEval.eval['precision'][:, :, :, 0, -1]
+ category_ids = self._coco_api.getCatIds()
+ category_names = [
+ cat['name'] for cat in self._coco_api.loadCats(category_ids)
+ ]
+
+ aps_lens = defaultdict(list)
+ counter_lens = defaultdict(int)
+ for i in range(len(category_names)):
+ ap = aps[:, :, i]
+ ap_value = ap[ap > -1].mean()
+ if not np.isnan(ap_value):
+ len_ref = len(category_names[i].split(' '))
+ aps_lens[len_ref].append(ap_value)
+ counter_lens[len_ref] += 1
+
+ ap_sum_short = sum([sum(aps_lens[i]) for i in range(0, 4)])
+ ap_sum_mid = sum([sum(aps_lens[i]) for i in range(4, 7)])
+ ap_sum_long = sum([sum(aps_lens[i]) for i in range(7, 10)])
+ ap_sum_very_long = sum([
+ sum(aps_lens[i]) for i in range(10,
+ max(counter_lens.keys()) + 1)
+ ])
+ c_sum_short = sum([counter_lens[i] for i in range(1, 4)])
+ c_sum_mid = sum([counter_lens[i] for i in range(4, 7)])
+ c_sum_long = sum([counter_lens[i] for i in range(7, 10)])
+ c_sum_very_long = sum(
+ [counter_lens[i] for i in range(10,
+ max(counter_lens.keys()) + 1)])
+ map_short = ap_sum_short / c_sum_short
+ map_mid = ap_sum_mid / c_sum_mid
+ map_long = ap_sum_long / c_sum_long
+ map_very_long = ap_sum_very_long / c_sum_very_long
+
+ coco_metric_names = {
+ 'mAP': 0,
+ 'mAP_50': 1,
+ 'mAP_75': 2,
+ 'mAP_s': 3,
+ 'mAP_m': 4,
+ 'mAP_l': 5,
+ 'AR@100': 6,
+ 'AR@300': 7,
+ 'AR@1000': 8,
+ 'AR_s@1000': 9,
+ 'AR_m@1000': 10,
+ 'AR_l@1000': 11
+ }
+ metric_items = ['mAP', 'mAP_50', 'mAP_75', 'mAP_s', 'mAP_m', 'mAP_l']
+
+ eval_results = {}
+ for metric_item in metric_items:
+ key = f'{metric_item}'
+ val = cocoEval.stats[coco_metric_names[metric_item]]
+ eval_results[key] = float(f'{round(val, 3)}')
+
+ ap = cocoEval.stats[:6]
+ logger.info(f'mAP_copypaste: {ap[0]:.3f} '
+ f'{ap[1]:.3f} {ap[2]:.3f} {ap[3]:.3f} '
+ f'{ap[4]:.3f} {ap[5]:.3f}')
+
+ logger.info(f'mAP over reference length: short - {map_short:.4f}, '
+ f'mid - {map_mid:.4f}, long - {map_long:.4f}, '
+ f'very long - {map_very_long:.4f}')
+ eval_results['mAP_short'] = float(f'{round(map_short, 3)}')
+ eval_results['mAP_mid'] = float(f'{round(map_mid, 3)}')
+ eval_results['mAP_long'] = float(f'{round(map_long, 3)}')
+ eval_results['mAP_very_long'] = float(f'{round(map_very_long, 3)}')
+ return eval_results
diff --git a/mmdet/evaluation/metrics/dump_odvg_results.py b/mmdet/evaluation/metrics/dump_odvg_results.py
new file mode 100644
index 00000000000..a1446b05380
--- /dev/null
+++ b/mmdet/evaluation/metrics/dump_odvg_results.py
@@ -0,0 +1,138 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Any, Optional, Sequence
+
+from mmcv.ops import batched_nms
+from mmengine.evaluator import BaseMetric
+from mmengine.logging import print_log
+
+from mmdet.registry import METRICS
+
+try:
+ import jsonlines
+except ImportError:
+ jsonlines = None
+
+
+@METRICS.register_module()
+class DumpODVGResults(BaseMetric):
+ default_prefix: Optional[str] = 'pl_odvg'
+
+ def __init__(self,
+ outfile_path,
+ img_prefix: str,
+ score_thr: float = 0.1,
+ collect_device: str = 'cpu',
+ nms_thr: float = 0.5,
+ prefix: Optional[str] = None) -> None:
+ super().__init__(collect_device=collect_device, prefix=prefix)
+ self.outfile_path = outfile_path
+ self.score_thr = score_thr
+ self.img_prefix = img_prefix
+ self.nms_thr = nms_thr
+
+ if jsonlines is None:
+ raise ImportError('Please run "pip install jsonlines" to install '
+ 'this package.')
+
+ def process(self, data_batch: Any, data_samples: Sequence[dict]) -> None:
+ for data_sample in data_samples:
+ result = {}
+
+ filename = data_sample['img_path']
+ filename = filename.replace(self.img_prefix, '')
+ if filename.startswith('/'):
+ filename = filename[1:]
+ result['filename'] = filename
+
+ height = data_sample['ori_shape'][0]
+ width = data_sample['ori_shape'][1]
+ result['height'] = height
+ result['width'] = width
+
+ pred_instances = data_sample['pred_instances']
+
+ bboxes = pred_instances['bboxes'].cpu()
+ scores = pred_instances['scores'].cpu()
+ labels = pred_instances['labels'].cpu()
+
+ bboxes = bboxes[scores > self.score_thr]
+ labels = labels[scores > self.score_thr]
+ scores = scores[scores > self.score_thr]
+
+ if 'tokens_positive' in data_sample:
+ task = 'vg'
+ else:
+ task = 'od'
+
+ if task == 'od':
+ classes_name = data_sample['text']
+ result['detection'] = {}
+
+ if len(bboxes) > 0:
+ det_bboxes, keep = batched_nms(
+ bboxes, scores, labels,
+ dict(type='nms', iou_threshold=self.nms_thr))
+ _scores = det_bboxes[:, -1]
+ _bboxes = det_bboxes[:, :-1]
+ _labels = labels[keep]
+
+ instances = []
+ _bboxes = _bboxes.numpy().tolist()
+ _scores = _scores.numpy().tolist()
+ _labels = _labels.numpy().tolist()
+ for bbox, score, label in zip(_bboxes, _scores, _labels):
+ round_bbox = [round(b, 2) for b in bbox]
+ round_score = round(score, 2)
+ instances.append({
+ 'bbox': round_bbox,
+ 'score': round_score,
+ 'label': label,
+ 'category': classes_name[label]
+ })
+ result['detection']['instances'] = instances
+ else:
+ result['detection']['instances'] = []
+ self.results.append(result)
+ else:
+ caption = data_sample['text']
+ result['grounding'] = {}
+ result['grounding']['caption'] = caption
+
+ tokens_positive = data_sample['tokens_positive']
+
+ region_list = []
+ for label, positive in enumerate(tokens_positive):
+ phrase = [caption[pos[0]:pos[1]] for pos in positive]
+
+ _bboxes = bboxes[labels == label]
+ _scores = scores[labels == label]
+ det_bboxes, _ = batched_nms(
+ _bboxes,
+ _scores,
+ None,
+ dict(type='nms', iou_threshold=self.nms_thr),
+ class_agnostic=True)
+ _scores = det_bboxes[:, -1].numpy().tolist()
+ _bboxes = det_bboxes[:, :-1].numpy().tolist()
+
+ round_bboxes = []
+ for bbox in _bboxes:
+ round_bboxes.append([round(b, 2) for b in bbox])
+ _scores = [[round(s, 2) for s in _scores]]
+ region = {
+ 'phrase': phrase,
+ 'bbox': round_bboxes,
+ 'score': _scores,
+ 'tokens_positive': positive
+ }
+ region_list.append(region)
+ result['grounding']['regions'] = region_list
+ self.results.append(result)
+
+ def compute_metrics(self, results: list) -> dict:
+ with jsonlines.open(self.outfile_path, mode='w') as writer:
+ writer.write_all(results)
+ print_log(
+ f'Results has been saved to {self.outfile_path}.',
+ logger='current')
+ return {}
diff --git a/mmdet/evaluation/metrics/flickr30k_metric.py b/mmdet/evaluation/metrics/flickr30k_metric.py
new file mode 100644
index 00000000000..f8b64bfda46
--- /dev/null
+++ b/mmdet/evaluation/metrics/flickr30k_metric.py
@@ -0,0 +1,165 @@
+# Copyright (c) OpenMMLab. All rights reserved
+from collections import defaultdict
+from typing import Dict, List, Optional, Sequence
+
+import numpy as np
+from mmengine.evaluator import BaseMetric
+from mmengine.logging import MMLogger
+
+from mmdet.registry import METRICS
+from ..functional import bbox_overlaps
+
+
+class RecallTracker:
+ """Utility class to track recall@k for various k, split by categories."""
+
+ def __init__(self, topk: Sequence[int]):
+ """
+ Parameters:
+ - topk : tuple of ints corresponding to the recalls being
+ tracked (eg, recall@1, recall@10, ...)
+ """
+
+ self.total_byk_bycat: Dict[int, Dict[str, int]] = {
+ k: defaultdict(int)
+ for k in topk
+ }
+ self.positives_byk_bycat: Dict[int, Dict[str, int]] = {
+ k: defaultdict(int)
+ for k in topk
+ }
+
+ def add_positive(self, k: int, category: str):
+ """Log a positive hit @k for given category."""
+ if k not in self.total_byk_bycat:
+ raise RuntimeError(f'{k} is not a valid recall threshold')
+ self.total_byk_bycat[k][category] += 1
+ self.positives_byk_bycat[k][category] += 1
+
+ def add_negative(self, k: int, category: str):
+ """Log a negative hit @k for given category."""
+ if k not in self.total_byk_bycat:
+ raise RuntimeError(f'{k} is not a valid recall threshold')
+ self.total_byk_bycat[k][category] += 1
+
+ def report(self) -> Dict[str, Dict[str, float]]:
+ """Return a condensed report of the results as a dict of dict.
+
+ report[k][cat] is the recall@k for the given category
+ """
+ report: Dict[str, Dict[str, float]] = {}
+ for k in self.total_byk_bycat:
+ assert k in self.positives_byk_bycat
+ report[str(k)] = {
+ cat:
+ self.positives_byk_bycat[k][cat] / self.total_byk_bycat[k][cat]
+ for cat in self.total_byk_bycat[k]
+ }
+ return report
+
+
+@METRICS.register_module()
+class Flickr30kMetric(BaseMetric):
+ """Phrase Grounding Metric."""
+
+ def __init__(
+ self,
+ topk: Sequence[int] = (1, 5, 10, -1),
+ iou_thrs: float = 0.5,
+ merge_boxes: bool = False,
+ collect_device: str = 'cpu',
+ prefix: Optional[str] = None,
+ ) -> None:
+ super().__init__(collect_device=collect_device, prefix=prefix)
+
+ self.iou_thrs = iou_thrs
+ self.topk = topk
+ self.merge = merge_boxes
+
+ def merge_boxes(self, boxes: List[List[int]]) -> List[List[int]]:
+ """Return the boxes corresponding to the smallest enclosing box
+ containing all the provided boxes The boxes are expected in [x1, y1,
+ x2, y2] format."""
+ if len(boxes) == 1:
+ return boxes
+
+ np_boxes = np.asarray(boxes)
+
+ return [[
+ np.boxes[:, 0].min(), np_boxes[:, 1].min(), np_boxes[:, 2].max(),
+ np_boxes[:, 3].max()
+ ]]
+
+ def process(self, data_batch: dict, data_samples: Sequence[dict]) -> None:
+ """Process one batch of data samples and predictions.
+
+ The processed
+ results should be stored in ``self.results``, which will be used to
+ compute the metrics when all batches have been processed.
+ Args:
+ data_batch (dict): A batch of data from the dataloader.
+ data_samples (Sequence[dict]): A batch of data samples that
+ contain annotations and predictions.
+ """
+ for data_sample in data_samples:
+ pred = data_sample['pred_instances']
+ gt = data_sample['gt_instances']['bboxes']
+ gt_label = data_sample['phrase_ids']
+ phrases = data_sample['phrases']
+ assert len(gt) == len(gt_label)
+
+ self.results.append((pred, gt, gt_label, phrases))
+
+ def compute_metrics(self, results: list) -> Dict[str, float]:
+ """Compute the metrics from processed results.
+
+ Args:
+ results (list): The processed results of each batch.
+ Returns:
+ Dict[str, float]: The computed metrics. The keys are the names of
+ the metrics, and the values are corresponding results.
+ """
+ logger: MMLogger = MMLogger.get_current_instance()
+
+ pred_list, gt_list, gt_label_list, phrase_list = zip(*results)
+
+ recall_tracker = RecallTracker(self.topk)
+
+ for pred, gt_boxes, gt_labels, phrases in zip(pred_list, gt_list,
+ gt_label_list,
+ phrase_list):
+ pred_boxes = pred['bboxes'].cpu().numpy()
+ pred_labels = pred['labels'].cpu().numpy()
+ for i, phrase in enumerate(phrases):
+ cur_index = pred_labels == i
+ cur_boxes = pred_boxes[cur_index]
+ tar_index = [
+ index for index, value in enumerate(gt_labels)
+ if value == i
+ ]
+ tar_boxes = gt_boxes[tar_index]
+ if self.merge:
+ tar_boxes = self.merge_boxes(tar_boxes)
+ if len(cur_boxes) == 0:
+ cur_boxes = [[0., 0., 0., 0.]]
+ ious = bbox_overlaps(
+ np.asarray(cur_boxes), np.asarray(tar_boxes))
+ for k in self.topk:
+ if k == -1:
+ maxi = ious.max()
+ else:
+ assert k > 0
+ maxi = ious[:k].max()
+ if maxi >= self.iou_thrs:
+ recall_tracker.add_positive(k, 'all')
+ # TODO: do not support class-wise evaluation yet
+ # for phrase_type in phrase['phrase_type']:
+ # recall_tracker.add_positive(k, phrase_type)
+ else:
+ recall_tracker.add_negative(k, 'all')
+ # for phrase_type in phrase['phrase_type']:
+ # recall_tracker.add_negative(k, phrase_type)
+
+ results = recall_tracker.report()
+ logger.info(results)
+ return results
diff --git a/mmdet/evaluation/metrics/grefcoco_metric.py b/mmdet/evaluation/metrics/grefcoco_metric.py
new file mode 100644
index 00000000000..55cc638c5e4
--- /dev/null
+++ b/mmdet/evaluation/metrics/grefcoco_metric.py
@@ -0,0 +1,122 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, Optional, Sequence
+
+import numpy as np
+import torch
+from mmengine.evaluator import BaseMetric
+from mmengine.fileio import get_local_path
+from mmengine.logging import MMLogger
+
+from mmdet.datasets.api_wrappers import COCO
+from mmdet.registry import METRICS
+from ..functional import bbox_overlaps
+
+
+# refer from https://github.com/henghuiding/gRefCOCO/blob/main/mdetr/datasets/refexp.py # noqa
+@METRICS.register_module()
+class gRefCOCOMetric(BaseMetric):
+ default_prefix: Optional[str] = 'grefcoco'
+
+ def __init__(self,
+ ann_file: Optional[str] = None,
+ metric: str = 'bbox',
+ iou_thrs: float = 0.5,
+ thresh_score: float = 0.7,
+ thresh_f1: float = 1.0,
+ **kwargs) -> None:
+ super().__init__(**kwargs)
+ self.metric = metric
+ self.iou_thrs = iou_thrs
+ self.thresh_score = thresh_score
+ self.thresh_f1 = thresh_f1
+
+ with get_local_path(ann_file) as local_path:
+ self.coco = COCO(local_path)
+
+ def process(self, data_batch: dict, data_samples: Sequence[dict]) -> None:
+ for data_sample in data_samples:
+ result = dict()
+ pred = data_sample['pred_instances']
+ result['img_id'] = data_sample['img_id']
+ result['bboxes'] = pred['bboxes'].cpu()
+ result['scores'] = pred['scores'].cpu()
+ self.results.append(result)
+
+ def compute_metrics(self, results: list) -> Dict[str, float]:
+ logger: MMLogger = MMLogger.get_current_instance()
+
+ correct_image = 0
+ num_image = 0
+ nt = {'TP': 0, 'TN': 0, 'FP': 0, 'FN': 0}
+
+ for result in results:
+ img_id = result['img_id']
+ TP = 0
+
+ ann_ids = self.coco.getAnnIds(imgIds=img_id)
+ target = self.coco.loadAnns(ann_ids[0])
+
+ converted_bbox_all = []
+ no_target_flag = False
+ for one_target in target:
+ if one_target['category_id'] == -1:
+ no_target_flag = True
+ target_bbox = one_target['bbox']
+ converted_bbox = [
+ target_bbox[0],
+ target_bbox[1],
+ target_bbox[2] + target_bbox[0],
+ target_bbox[3] + target_bbox[1],
+ ]
+ converted_bbox_all.append(
+ np.array(converted_bbox).reshape(-1, 4))
+ gt_bbox_all = np.concatenate(converted_bbox_all, axis=0)
+
+ idx = result['scores'] >= self.thresh_score
+ filtered_boxes = result['bboxes'][idx]
+
+ iou = bbox_overlaps(filtered_boxes.numpy(), gt_bbox_all)
+ iou = torch.from_numpy(iou)
+
+ num_prediction = filtered_boxes.shape[0]
+ num_gt = gt_bbox_all.shape[0]
+ if no_target_flag:
+ if num_prediction >= 1:
+ nt['FN'] += 1
+ else:
+ nt['TP'] += 1
+ if num_prediction >= 1:
+ f_1 = 0.
+ else:
+ f_1 = 1.0
+ else:
+ if num_prediction >= 1:
+ nt['TN'] += 1
+ else:
+ nt['FP'] += 1
+ for i in range(min(num_prediction, num_gt)):
+ top_value, top_index = torch.topk(iou.flatten(0, 1), 1)
+ if top_value < self.iou_thrs:
+ break
+ else:
+ top_index_x = top_index // num_gt
+ top_index_y = top_index % num_gt
+ TP += 1
+ iou[top_index_x[0], :] = 0.0
+ iou[:, top_index_y[0]] = 0.0
+ FP = num_prediction - TP
+ FN = num_gt - TP
+ f_1 = 2 * TP / (2 * TP + FP + FN)
+
+ if f_1 >= self.thresh_f1:
+ correct_image += 1
+ num_image += 1
+
+ score = correct_image / max(num_image, 1)
+ results = {
+ 'F1_score': score,
+ 'T_acc': nt['TN'] / (nt['TN'] + nt['FP']),
+ 'N_acc': nt['TP'] / (nt['TP'] + nt['FN'])
+ }
+ logger.info(results)
+ return results
diff --git a/mmdet/evaluation/metrics/lvis_metric.py b/mmdet/evaluation/metrics/lvis_metric.py
index e4dd6141c0e..a861c6ee7b4 100644
--- a/mmdet/evaluation/metrics/lvis_metric.py
+++ b/mmdet/evaluation/metrics/lvis_metric.py
@@ -1,14 +1,20 @@
# Copyright (c) OpenMMLab. All rights reserved.
import itertools
+import logging
import os.path as osp
import tempfile
import warnings
-from collections import OrderedDict
+from collections import OrderedDict, defaultdict
from typing import Dict, List, Optional, Sequence, Union
import numpy as np
+import torch
+from mmengine.dist import (all_gather_object, broadcast_object_list,
+ is_main_process)
+from mmengine.evaluator import BaseMetric
+from mmengine.evaluator.metric import _to_cpu
from mmengine.fileio import get_local_path
-from mmengine.logging import MMLogger
+from mmengine.logging import MMLogger, print_log
from terminaltables import AsciiTable
from mmdet.registry import METRICS
@@ -18,6 +24,7 @@
try:
import lvis
+
if getattr(lvis, '__version__', '0') >= '10.5.3':
warnings.warn(
'mmlvis is deprecated, please install official lvis-api by "pip install git+https://github.com/lvis-dataset/lvis-api.git"', # noqa: E501
@@ -362,3 +369,166 @@ def compute_metrics(self, results: list) -> Dict[str, float]:
if tmp_dir is not None:
tmp_dir.cleanup()
return eval_results
+
+
+def _merge_lists(listA, listB, maxN, key):
+ result = []
+ indA, indB = 0, 0
+ while (indA < len(listA) or indB < len(listB)) and len(result) < maxN:
+ if (indB < len(listB)) and (indA >= len(listA)
+ or key(listA[indA]) < key(listB[indB])):
+ result.append(listB[indB])
+ indB += 1
+ else:
+ result.append(listA[indA])
+ indA += 1
+ return result
+
+
+@METRICS.register_module()
+class LVISFixedAPMetric(BaseMetric):
+ default_prefix: Optional[str] = 'lvis_fixed_ap'
+
+ def __init__(self,
+ ann_file: str,
+ topk: int = 10000,
+ format_only: bool = False,
+ outfile_prefix: Optional[str] = None,
+ collect_device: str = 'cpu',
+ prefix: Optional[str] = None,
+ backend_args: dict = None) -> None:
+
+ if lvis is None:
+ raise RuntimeError(
+ 'Package lvis is not installed. Please run "pip install '
+ 'git+https://github.com/lvis-dataset/lvis-api.git".')
+ super().__init__(collect_device=collect_device, prefix=prefix)
+
+ self.format_only = format_only
+ if self.format_only:
+ assert outfile_prefix is not None, 'outfile_prefix must be not'
+ 'None when format_only is True, otherwise the result files will'
+ 'be saved to a temp directory which will be cleaned up at the end.'
+
+ self.outfile_prefix = outfile_prefix
+ self.backend_args = backend_args
+
+ with get_local_path(
+ ann_file, backend_args=self.backend_args) as local_path:
+ self._lvis_api = LVIS(local_path)
+
+ self.cat_ids = self._lvis_api.get_cat_ids()
+
+ self.results = {}
+ self.topk = topk
+
+ def process(self, data_batch: dict, data_samples: Sequence[dict]) -> None:
+ """Process one batch of data samples and predictions. The processed
+ results should be stored in ``self.results``, which will be used to
+ compute the metrics when all batches have been processed.
+
+ Args:
+ data_batch (dict): A batch of data from the dataloader.
+ data_samples (Sequence[dict]): A batch of data samples that
+ contain annotations and predictions.
+ """
+ cur_results = []
+ for data_sample in data_samples:
+ pred = data_sample['pred_instances']
+ xmin, ymin, xmax, ymax = pred['bboxes'].cpu().unbind(1)
+ boxes = torch.stack((xmin, ymin, xmax - xmin, ymax - ymin),
+ dim=1).tolist()
+
+ scores = pred['scores'].cpu().numpy()
+ labels = pred['labels'].cpu().numpy()
+
+ if len(boxes) == 0:
+ continue
+
+ cur_results.extend([{
+ 'image_id': data_sample['img_id'],
+ 'category_id': self.cat_ids[labels[k]],
+ 'bbox': box,
+ 'score': scores[k],
+ } for k, box in enumerate(boxes)])
+
+ by_cat = defaultdict(list)
+ for ann in cur_results:
+ by_cat[ann['category_id']].append(ann)
+
+ for cat, cat_anns in by_cat.items():
+ if cat not in self.results:
+ self.results[cat] = []
+
+ cur = sorted(
+ cat_anns, key=lambda x: x['score'], reverse=True)[:self.topk]
+ self.results[cat] = _merge_lists(
+ self.results[cat], cur, self.topk, key=lambda x: x['score'])
+
+ def compute_metrics(self, results: dict) -> dict:
+ logger: MMLogger = MMLogger.get_current_instance()
+
+ new_results = []
+
+ missing_dets_cats = set()
+ for cat, cat_anns in results.items():
+ if len(cat_anns) < self.topk:
+ missing_dets_cats.add(cat)
+ new_results.extend(
+ sorted(cat_anns, key=lambda x: x['score'],
+ reverse=True)[:self.topk])
+
+ if missing_dets_cats:
+ logger.info(
+ f'\n===\n'
+ f'{len(missing_dets_cats)} classes had less than {self.topk} '
+ f'detections!\n Outputting {self.topk} detections for each '
+ f'class will improve AP further.\n ===')
+
+ new_results = LVISResults(self._lvis_api, new_results, max_dets=-1)
+ lvis_eval = LVISEval(self._lvis_api, new_results, iou_type='bbox')
+ params = lvis_eval.params
+ params.max_dets = -1 # No limit on detections per image.
+ lvis_eval.run()
+ lvis_eval.print_results()
+ metrics = {
+ k: v
+ for k, v in lvis_eval.results.items() if k.startswith('AP')
+ }
+ logger.info(f'mAP_copypaste: {metrics}')
+ return metrics
+
+ def evaluate(self, size: int) -> dict:
+ if len(self.results) == 0:
+ print_log(
+ f'{self.__class__.__name__} got empty `self.results`. Please '
+ 'ensure that the processed results are properly added into '
+ '`self.results` in `process` method.',
+ logger='current',
+ level=logging.WARNING)
+
+ all_cats = all_gather_object(self.results)
+ results = defaultdict(list)
+ for cats in all_cats:
+ for cat, cat_anns in cats.items():
+ results[cat].extend(cat_anns)
+
+ if is_main_process():
+ # cast all tensors in results list to cpu
+ results = _to_cpu(results)
+ _metrics = self.compute_metrics(results) # type: ignore
+ # Add prefix to metric names
+ if self.prefix:
+ _metrics = {
+ '/'.join((self.prefix, k)): v
+ for k, v in _metrics.items()
+ }
+ metrics = [_metrics]
+ else:
+ metrics = [None] # type: ignore
+
+ broadcast_object_list(metrics)
+
+ # reset the results
+ self.results = {}
+ return metrics[0]
diff --git a/mmdet/evaluation/metrics/ov_coco_metric.py b/mmdet/evaluation/metrics/ov_coco_metric.py
new file mode 100644
index 00000000000..08cb9025149
--- /dev/null
+++ b/mmdet/evaluation/metrics/ov_coco_metric.py
@@ -0,0 +1,266 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import itertools
+import os.path as osp
+import tempfile
+from collections import OrderedDict
+from typing import Dict
+
+import numpy as np
+from mmengine.fileio import load
+from mmengine.logging import MMLogger
+from terminaltables import AsciiTable
+
+from mmdet.datasets.api_wrappers import COCO, COCOeval, COCOevalMP
+from mmdet.registry import METRICS
+from .coco_metric import CocoMetric
+
+
+@METRICS.register_module()
+class OVCocoMetric(CocoMetric):
+
+ def compute_metrics(self, results: list) -> Dict[str, float]:
+ """Compute the metrics from processed results.
+
+ Args:
+ results (list): The processed results of each batch.
+
+ Returns:
+ Dict[str, float]: The computed metrics. The keys are the names of
+ the metrics, and the values are corresponding results.
+ """
+ logger: MMLogger = MMLogger.get_current_instance()
+
+ # split gt and prediction list
+ gts, preds = zip(*results)
+
+ tmp_dir = None
+ if self.outfile_prefix is None:
+ tmp_dir = tempfile.TemporaryDirectory()
+ outfile_prefix = osp.join(tmp_dir.name, 'results')
+ else:
+ outfile_prefix = self.outfile_prefix
+
+ if self._coco_api is None:
+ # use converted gt json file to initialize coco api
+ logger.info('Converting ground truth to coco format...')
+ coco_json_path = self.gt_to_coco_json(
+ gt_dicts=gts, outfile_prefix=outfile_prefix)
+ self._coco_api = COCO(coco_json_path)
+
+ # handle lazy init
+ if self.cat_ids is None:
+ self.cat_ids = self._coco_api.get_cat_ids(
+ cat_names=self.dataset_meta['classes'])
+ self.base_cat_ids = self._coco_api.get_cat_ids(
+ cat_names=self.dataset_meta['base_classes'])
+ self.novel_cat_ids = self._coco_api.get_cat_ids(
+ cat_names=self.dataset_meta['novel_classes'])
+
+ if self.img_ids is None:
+ self.img_ids = self._coco_api.get_img_ids()
+
+ # convert predictions to coco format and dump to json file
+ result_files = self.results2json(preds, outfile_prefix)
+
+ eval_results = OrderedDict()
+ if self.format_only:
+ logger.info('results are saved in '
+ f'{osp.dirname(outfile_prefix)}')
+ return eval_results
+
+ for metric in self.metrics:
+ logger.info(f'Evaluating {metric}...')
+
+ # TODO: May refactor fast_eval_recall to an independent metric?
+ # fast eval recall
+ if metric == 'proposal_fast':
+ ar = self.fast_eval_recall(
+ preds, self.proposal_nums, self.iou_thrs, logger=logger)
+ log_msg = []
+ for i, num in enumerate(self.proposal_nums):
+ eval_results[f'AR@{num}'] = ar[i]
+ log_msg.append(f'\nAR@{num}\t{ar[i]:.4f}')
+ log_msg = ''.join(log_msg)
+ logger.info(log_msg)
+ continue
+
+ # evaluate proposal, bbox and segm
+ iou_type = 'bbox' if metric == 'proposal' else metric
+ if metric not in result_files:
+ raise KeyError(f'{metric} is not in results')
+ try:
+ predictions = load(result_files[metric])
+ if iou_type == 'segm':
+ # Refer to https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/coco.py#L331 # noqa
+ # When evaluating mask AP, if the results contain bbox,
+ # cocoapi will use the box area instead of the mask area
+ # for calculating the instance area. Though the overall AP
+ # is not affected, this leads to different
+ # small/medium/large mask AP results.
+ for x in predictions:
+ x.pop('bbox')
+ coco_dt = self._coco_api.loadRes(predictions)
+
+ except IndexError:
+ logger.error(
+ 'The testing results of the whole dataset is empty.')
+ break
+
+ if self.use_mp_eval:
+ coco_eval = COCOevalMP(self._coco_api, coco_dt, iou_type)
+ else:
+ coco_eval = COCOeval(self._coco_api, coco_dt, iou_type)
+
+ coco_eval.params.catIds = self.cat_ids
+ coco_eval.params.imgIds = self.img_ids
+ coco_eval.params.maxDets = list(self.proposal_nums)
+ coco_eval.params.iouThrs = self.iou_thrs
+
+ # mapping of cocoEval.stats
+ coco_metric_names = {
+ 'mAP': 0,
+ 'mAP_50': 1,
+ 'mAP_75': 2,
+ 'mAP_s': 3,
+ 'mAP_m': 4,
+ 'mAP_l': 5,
+ 'AR@100': 6,
+ 'AR@300': 7,
+ 'AR@1000': 8,
+ 'AR_s@1000': 9,
+ 'AR_m@1000': 10,
+ 'AR_l@1000': 11
+ }
+ metric_items = self.metric_items
+ if metric_items is not None:
+ for metric_item in metric_items:
+ if metric_item not in coco_metric_names:
+ raise KeyError(
+ f'metric item "{metric_item}" is not supported')
+
+ if metric == 'proposal':
+ coco_eval.params.useCats = 0
+ coco_eval.evaluate()
+ coco_eval.accumulate()
+ coco_eval.summarize()
+ if metric_items is None:
+ metric_items = [
+ 'AR@100', 'AR@300', 'AR@1000', 'AR_s@1000',
+ 'AR_m@1000', 'AR_l@1000'
+ ]
+
+ for item in metric_items:
+ val = float(
+ f'{coco_eval.stats[coco_metric_names[item]]:.3f}')
+ eval_results[item] = val
+ else:
+ coco_eval.evaluate()
+ coco_eval.accumulate()
+ coco_eval.summarize()
+ if self.classwise: # Compute per-category AP
+ # Compute per-category AP
+ # from https://github.com/facebookresearch/detectron2/
+ precisions = coco_eval.eval['precision']
+ # precision: (iou, recall, cls, area range, max dets)
+ assert len(self.cat_ids) == precisions.shape[2]
+
+ results_per_category = []
+ for idx, cat_id in enumerate(self.cat_ids):
+ t = []
+ # area range index 0: all area ranges
+ # max dets index -1: typically 100 per image
+ nm = self._coco_api.loadCats(cat_id)[0]
+ precision = precisions[:, :, idx, 0, -1]
+ precision = precision[precision > -1]
+ if precision.size:
+ ap = np.mean(precision)
+ else:
+ ap = float('nan')
+ t.append(f'{nm["name"]}')
+ t.append(f'{round(ap, 3)}')
+ eval_results[f'{nm["name"]}_precision'] = round(ap, 3)
+
+ # indexes of IoU @50 and @75
+ for iou in [0, 5]:
+ precision = precisions[iou, :, idx, 0, -1]
+ precision = precision[precision > -1]
+ if precision.size:
+ ap = np.mean(precision)
+ else:
+ ap = float('nan')
+ t.append(f'{round(ap, 3)}')
+
+ # indexes of area of small, median and large
+ for area in [1, 2, 3]:
+ precision = precisions[:, :, idx, area, -1]
+ precision = precision[precision > -1]
+ if precision.size:
+ ap = np.mean(precision)
+ else:
+ ap = float('nan')
+ t.append(f'{round(ap, 3)}')
+ results_per_category.append(tuple(t))
+
+ num_columns = len(results_per_category[0])
+ results_flatten = list(
+ itertools.chain(*results_per_category))
+ headers = [
+ 'category', 'mAP', 'mAP_50', 'mAP_75', 'mAP_s',
+ 'mAP_m', 'mAP_l'
+ ]
+ results_2d = itertools.zip_longest(*[
+ results_flatten[i::num_columns]
+ for i in range(num_columns)
+ ])
+ table_data = [headers]
+ table_data += [result for result in results_2d]
+ table = AsciiTable(table_data)
+ logger.info('\n' + table.table)
+
+ # ------------get novel_ap50 and base_ap50---------
+ precisions = coco_eval.eval['precision']
+ assert len(self.cat_ids) == precisions.shape[2]
+ base_inds, novel_inds = [], []
+
+ for idx, catId in enumerate(self.cat_ids):
+ if catId in self.base_cat_ids:
+ base_inds.append(idx)
+ if catId in self.novel_cat_ids:
+ novel_inds.append(idx)
+
+ base_ap = precisions[:, :, base_inds, 0, -1]
+ novel_ap = precisions[:, :, novel_inds, 0, -1]
+ base_ap50 = precisions[0, :, base_inds, 0, -1]
+ novel_ap50 = precisions[0, :, novel_inds, 0, -1]
+
+ eval_results['base_ap'] = np.mean(
+ base_ap[base_ap > -1]) if len(
+ base_ap[base_ap > -1]) else -1
+ eval_results['novel_ap'] = np.mean(
+ novel_ap[novel_ap > -1]) if len(
+ novel_ap[novel_ap > -1]) else -1
+ eval_results['base_ap50'] = np.mean(
+ base_ap50[base_ap50 > -1]) if len(
+ base_ap50[base_ap50 > -1]) else -1
+ eval_results['novel_ap50'] = np.mean(
+ novel_ap50[novel_ap50 > -1]) if len(
+ novel_ap50[novel_ap50 > -1]) else -1
+ # ------------get novel_ap50 and base_ap50---------
+ if metric_items is None:
+ metric_items = [
+ 'mAP', 'mAP_50', 'mAP_75', 'mAP_s', 'mAP_m', 'mAP_l'
+ ]
+
+ for metric_item in metric_items:
+ key = f'{metric}_{metric_item}'
+ val = coco_eval.stats[coco_metric_names[metric_item]]
+ eval_results[key] = float(f'{round(val, 3)}')
+
+ ap = coco_eval.stats[:6]
+ logger.info(f'{metric}_mAP_copypaste: {ap[0]:.3f} '
+ f'{ap[1]:.3f} {ap[2]:.3f} {ap[3]:.3f} '
+ f'{ap[4]:.3f} {ap[5]:.3f}')
+
+ if tmp_dir is not None:
+ tmp_dir.cleanup()
+ return eval_results
diff --git a/mmdet/evaluation/metrics/refexp_metric.py b/mmdet/evaluation/metrics/refexp_metric.py
new file mode 100644
index 00000000000..8bcdf1629b9
--- /dev/null
+++ b/mmdet/evaluation/metrics/refexp_metric.py
@@ -0,0 +1,100 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, Optional, Sequence
+
+import numpy as np
+from mmengine.evaluator import BaseMetric
+from mmengine.fileio import get_local_path
+from mmengine.logging import MMLogger
+
+from mmdet.datasets.api_wrappers import COCO
+from mmdet.registry import METRICS
+from ..functional import bbox_overlaps
+
+
+@METRICS.register_module()
+class RefExpMetric(BaseMetric):
+ default_prefix: Optional[str] = 'refexp'
+
+ def __init__(self,
+ ann_file: Optional[str] = None,
+ metric: str = 'bbox',
+ topk=(1, 5, 10),
+ iou_thrs: float = 0.5,
+ **kwargs) -> None:
+ super().__init__(**kwargs)
+ self.metric = metric
+ self.topk = topk
+ self.iou_thrs = iou_thrs
+
+ with get_local_path(ann_file) as local_path:
+ self.coco = COCO(local_path)
+
+ def process(self, data_batch: dict, data_samples: Sequence[dict]) -> None:
+ for data_sample in data_samples:
+ result = dict()
+ pred = data_sample['pred_instances']
+ result['img_id'] = data_sample['img_id']
+ result['bboxes'] = pred['bboxes'].cpu().numpy()
+ result['scores'] = pred['scores'].cpu().numpy()
+ self.results.append(result)
+
+ def compute_metrics(self, results: list) -> Dict[str, float]:
+ logger: MMLogger = MMLogger.get_current_instance()
+
+ dataset2score = {
+ 'refcoco': {k: 0.0
+ for k in self.topk},
+ 'refcoco+': {k: 0.0
+ for k in self.topk},
+ 'refcocog': {k: 0.0
+ for k in self.topk},
+ }
+ dataset2count = {'refcoco': 0.0, 'refcoco+': 0.0, 'refcocog': 0.0}
+
+ for result in results:
+ img_id = result['img_id']
+
+ ann_ids = self.coco.getAnnIds(imgIds=img_id)
+ assert len(ann_ids) == 1
+ img_info = self.coco.loadImgs(img_id)[0]
+ target = self.coco.loadAnns(ann_ids[0])
+
+ target_bbox = target[0]['bbox']
+ converted_bbox = [
+ target_bbox[0],
+ target_bbox[1],
+ target_bbox[2] + target_bbox[0],
+ target_bbox[3] + target_bbox[1],
+ ]
+ iou = bbox_overlaps(result['bboxes'],
+ np.array(converted_bbox).reshape(-1, 4))
+ for k in self.topk:
+ if max(iou[:k]) >= self.iou_thrs:
+ dataset2score[img_info['dataset_name']][k] += 1.0
+ dataset2count[img_info['dataset_name']] += 1.0
+
+ for key, value in dataset2score.items():
+ for k in self.topk:
+ try:
+ value[k] /= dataset2count[key]
+ except Exception as e:
+ print(e)
+
+ results = {}
+ mean_precision = 0.0
+ for key, value in dataset2score.items():
+ results[key] = sorted([v for k, v in value.items()])
+ mean_precision += sum(results[key])
+ logger.info(
+ f' Dataset: {key} - Precision @ 1, 5, 10: {results[key]}')
+
+ # `mean_precision` key is used for saving the best checkpoint
+ out_results = {'mean_precision': mean_precision / 9.0}
+
+ for i, k in enumerate(self.topk):
+ out_results[f'refcoco_precision@{k}'] = results['refcoco'][i]
+ for i, k in enumerate(self.topk):
+ out_results[f'refcoco+_precision@{k}'] = results['refcoco+'][i]
+ for i, k in enumerate(self.topk):
+ out_results[f'refcocog_precision@{k}'] = results['refcocog'][i]
+ return out_results
diff --git a/mmdet/models/dense_heads/grounding_dino_head.py b/mmdet/models/dense_heads/grounding_dino_head.py
index 3aced626555..8088322546f 100644
--- a/mmdet/models/dense_heads/grounding_dino_head.py
+++ b/mmdet/models/dense_heads/grounding_dino_head.py
@@ -417,14 +417,21 @@ def _predict_by_feat_single(self,
max_per_img = self.test_cfg.get('max_per_img', len(cls_score))
img_shape = img_meta['img_shape']
- cls_score = convert_grounding_to_cls_scores(
- logits=cls_score.sigmoid()[None],
- positive_maps=[token_positive_maps])[0]
- scores, indexes = cls_score.view(-1).topk(max_per_img)
- num_classes = cls_score.shape[-1]
- det_labels = indexes % num_classes
- bbox_index = indexes // num_classes
- bbox_pred = bbox_pred[bbox_index]
+ if token_positive_maps is not None:
+ cls_score = convert_grounding_to_cls_scores(
+ logits=cls_score.sigmoid()[None],
+ positive_maps=[token_positive_maps])[0]
+ scores, indexes = cls_score.view(-1).topk(max_per_img)
+ num_classes = cls_score.shape[-1]
+ det_labels = indexes % num_classes
+ bbox_index = indexes // num_classes
+ bbox_pred = bbox_pred[bbox_index]
+ else:
+ cls_score = cls_score.sigmoid()
+ scores, _ = cls_score.max(-1)
+ scores, indexes = scores.topk(max_per_img)
+ bbox_pred = bbox_pred[indexes]
+ det_labels = scores.new_zeros(scores.shape, dtype=torch.long)
det_bboxes = bbox_cxcywh_to_xyxy(bbox_pred)
det_bboxes[:, 0::2] = det_bboxes[:, 0::2] * img_shape[1]
diff --git a/mmdet/models/detectors/glip.py b/mmdet/models/detectors/glip.py
index e076a55fe20..45cfe7d39fd 100644
--- a/mmdet/models/detectors/glip.py
+++ b/mmdet/models/detectors/glip.py
@@ -1,7 +1,8 @@
# Copyright (c) OpenMMLab. All rights reserved.
+import copy
import re
import warnings
-from typing import Tuple, Union
+from typing import Optional, Tuple, Union
import torch
from torch import Tensor
@@ -26,8 +27,8 @@ def find_noun_phrases(caption: str) -> list:
"""
try:
import nltk
- nltk.download('punkt')
- nltk.download('averaged_perceptron_tagger')
+ nltk.download('punkt', download_dir='~/nltk_data')
+ nltk.download('averaged_perceptron_tagger', download_dir='~/nltk_data')
except ImportError:
raise RuntimeError('nltk is not installed, please install it by: '
'pip install nltk.')
@@ -78,6 +79,7 @@ def run_ner(caption: str) -> Tuple[list, list]:
noun_phrases = find_noun_phrases(caption)
noun_phrases = [remove_punctuation(phrase) for phrase in noun_phrases]
noun_phrases = [phrase for phrase in noun_phrases if phrase != '']
+ print('noun_phrases:', noun_phrases)
relevant_phrases = noun_phrases
labels = noun_phrases
@@ -166,6 +168,27 @@ def create_positive_map_label_to_token(positive_map: Tensor,
return positive_map_label_to_token
+def clean_label_name(name: str) -> str:
+ name = re.sub(r'\(.*\)', '', name)
+ name = re.sub(r'_', ' ', name)
+ name = re.sub(r' ', ' ', name)
+ return name
+
+
+def chunks(lst: list, n: int) -> list:
+ """Yield successive n-sized chunks from lst."""
+ all_ = []
+ for i in range(0, len(lst), n):
+ data_index = lst[i:i + n]
+ all_.append(data_index)
+ counter = 0
+ for i in all_:
+ counter += len(i)
+ assert (counter == len(lst))
+
+ return all_
+
+
@MODELS.register_module()
class GLIP(SingleStageDetector):
"""Implementation of `GLIP `_
@@ -207,10 +230,52 @@ def __init__(self,
self._special_tokens = '. '
+ def to_enhance_text_prompts(self, original_caption, enhanced_text_prompts):
+ caption_string = ''
+ tokens_positive = []
+ for idx, word in enumerate(original_caption):
+ if word in enhanced_text_prompts:
+ enhanced_text_dict = enhanced_text_prompts[word]
+ if 'prefix' in enhanced_text_dict:
+ caption_string += enhanced_text_dict['prefix']
+ start_i = len(caption_string)
+ if 'name' in enhanced_text_dict:
+ caption_string += enhanced_text_dict['name']
+ else:
+ caption_string += word
+ end_i = len(caption_string)
+ tokens_positive.append([[start_i, end_i]])
+
+ if 'suffix' in enhanced_text_dict:
+ caption_string += enhanced_text_dict['suffix']
+ else:
+ tokens_positive.append(
+ [[len(caption_string),
+ len(caption_string) + len(word)]])
+ caption_string += word
+
+ if idx != len(original_caption) - 1:
+ caption_string += self._special_tokens
+ return caption_string, tokens_positive
+
+ def to_plain_text_prompts(self, original_caption):
+ caption_string = ''
+ tokens_positive = []
+ for idx, word in enumerate(original_caption):
+ tokens_positive.append(
+ [[len(caption_string),
+ len(caption_string) + len(word)]])
+ caption_string += word
+ if idx != len(original_caption) - 1:
+ caption_string += self._special_tokens
+ return caption_string, tokens_positive
+
def get_tokens_and_prompts(
- self,
- original_caption: Union[str, list, tuple],
- custom_entities: bool = False) -> Tuple[dict, str, list, list]:
+ self,
+ original_caption: Union[str, list, tuple],
+ custom_entities: bool = False,
+ enhanced_text_prompts: Optional[ConfigType] = None
+ ) -> Tuple[dict, str, list, list]:
"""Get the tokens positive and prompts for the caption."""
if isinstance(original_caption, (list, tuple)) or custom_entities:
if custom_entities and isinstance(original_caption, str):
@@ -219,15 +284,15 @@ def get_tokens_and_prompts(
original_caption = list(
filter(lambda x: len(x) > 0, original_caption))
- caption_string = ''
- tokens_positive = []
- for idx, word in enumerate(original_caption):
- tokens_positive.append(
- [[len(caption_string),
- len(caption_string) + len(word)]])
- caption_string += word
- if idx != len(original_caption) - 1:
- caption_string += self._special_tokens
+ original_caption = [clean_label_name(i) for i in original_caption]
+
+ if custom_entities and enhanced_text_prompts is not None:
+ caption_string, tokens_positive = self.to_enhance_text_prompts(
+ original_caption, enhanced_text_prompts)
+ else:
+ caption_string, tokens_positive = self.to_plain_text_prompts(
+ original_caption)
+
tokenized = self.language_model.tokenizer([caption_string],
return_tensors='pt')
entities = original_caption
@@ -248,17 +313,101 @@ def get_positive_map(self, tokenized, tokens_positive):
return positive_map_label_to_token, positive_map
def get_tokens_positive_and_prompts(
- self,
- original_caption: Union[str, list, tuple],
- custom_entities: bool = False) -> Tuple[dict, str, Tensor, list]:
- tokenized, caption_string, tokens_positive, entities = \
- self.get_tokens_and_prompts(
- original_caption, custom_entities)
- positive_map_label_to_token, positive_map = self.get_positive_map(
- tokenized, tokens_positive)
+ self,
+ original_caption: Union[str, list, tuple],
+ custom_entities: bool = False,
+ enhanced_text_prompt: Optional[ConfigType] = None,
+ tokens_positive: Optional[list] = None,
+ ) -> Tuple[dict, str, Tensor, list]:
+ if tokens_positive is not None:
+ if tokens_positive == -1:
+ if not original_caption.endswith('.'):
+ original_caption = original_caption + self._special_tokens
+ return None, original_caption, None, original_caption
+ else:
+ if not original_caption.endswith('.'):
+ original_caption = original_caption + self._special_tokens
+ tokenized = self.language_model.tokenizer([original_caption],
+ return_tensors='pt')
+ positive_map_label_to_token, positive_map = \
+ self.get_positive_map(tokenized, tokens_positive)
+
+ entities = []
+ for token_positive in tokens_positive:
+ instance_entities = []
+ for t in token_positive:
+ instance_entities.append(original_caption[t[0]:t[1]])
+ entities.append(' / '.join(instance_entities))
+ return positive_map_label_to_token, original_caption, \
+ positive_map, entities
+
+ chunked_size = self.test_cfg.get('chunked_size', -1)
+ if not self.training and chunked_size > 0:
+ assert isinstance(original_caption,
+ (list, tuple)) or custom_entities is True
+ all_output = self.get_tokens_positive_and_prompts_chunked(
+ original_caption, enhanced_text_prompt)
+ positive_map_label_to_token, \
+ caption_string, \
+ positive_map, \
+ entities = all_output
+ else:
+ tokenized, caption_string, tokens_positive, entities = \
+ self.get_tokens_and_prompts(
+ original_caption, custom_entities, enhanced_text_prompt)
+ positive_map_label_to_token, positive_map = self.get_positive_map(
+ tokenized, tokens_positive)
+ if tokenized.input_ids.shape[1] > self.language_model.max_tokens:
+ warnings.warn('Inputting a text that is too long will result '
+ 'in poor prediction performance. '
+ 'Please reduce the text length.')
return positive_map_label_to_token, caption_string, \
positive_map, entities
+ def get_tokens_positive_and_prompts_chunked(
+ self,
+ original_caption: Union[list, tuple],
+ enhanced_text_prompts: Optional[ConfigType] = None):
+ chunked_size = self.test_cfg.get('chunked_size', -1)
+ original_caption = [clean_label_name(i) for i in original_caption]
+
+ original_caption_chunked = chunks(original_caption, chunked_size)
+ ids_chunked = chunks(
+ list(range(1,
+ len(original_caption) + 1)), chunked_size)
+
+ positive_map_label_to_token_chunked = []
+ caption_string_chunked = []
+ positive_map_chunked = []
+ entities_chunked = []
+
+ for i in range(len(ids_chunked)):
+ if enhanced_text_prompts is not None:
+ caption_string, tokens_positive = self.to_enhance_text_prompts(
+ original_caption_chunked[i], enhanced_text_prompts)
+ else:
+ caption_string, tokens_positive = self.to_plain_text_prompts(
+ original_caption_chunked[i])
+ tokenized = self.language_model.tokenizer([caption_string],
+ return_tensors='pt')
+ if tokenized.input_ids.shape[1] > self.language_model.max_tokens:
+ warnings.warn('Inputting a text that is too long will result '
+ 'in poor prediction performance. '
+ 'Please reduce the --chunked-size.')
+ positive_map_label_to_token, positive_map = self.get_positive_map(
+ tokenized, tokens_positive)
+
+ caption_string_chunked.append(caption_string)
+ positive_map_label_to_token_chunked.append(
+ positive_map_label_to_token)
+ positive_map_chunked.append(positive_map)
+ entities_chunked.append(original_caption_chunked[i])
+
+ return positive_map_label_to_token_chunked, \
+ caption_string_chunked, \
+ positive_map_chunked, \
+ entities_chunked
+
def loss(self, batch_inputs: Tensor,
batch_data_samples: SampleList) -> Union[dict, list]:
# TODO: Only open vocabulary tasks are supported for training now.
@@ -342,9 +491,16 @@ def predict(self,
- bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
"""
- text_prompts = [
- data_samples.text for data_samples in batch_data_samples
- ]
+ text_prompts = []
+ enhanced_text_prompts = []
+ tokens_positives = []
+ for data_samples in batch_data_samples:
+ text_prompts.append(data_samples.text)
+ if 'caption_prompt' in data_samples:
+ enhanced_text_prompts.append(data_samples.caption_prompt)
+ else:
+ enhanced_text_prompts.append(None)
+ tokens_positives.append(data_samples.get('tokens_positive', None))
if 'custom_entities' in batch_data_samples[0]:
# Assuming that the `custom_entities` flag
@@ -357,31 +513,62 @@ def predict(self,
# All the text prompts are the same,
# so there is no need to calculate them multiple times.
_positive_maps_and_prompts = [
- self.get_tokens_positive_and_prompts(text_prompts[0],
- custom_entities)
+ self.get_tokens_positive_and_prompts(
+ text_prompts[0], custom_entities, enhanced_text_prompts[0],
+ tokens_positives[0])
] * len(batch_inputs)
else:
_positive_maps_and_prompts = [
self.get_tokens_positive_and_prompts(text_prompt,
- custom_entities)
- for text_prompt in text_prompts
+ custom_entities,
+ enhanced_text_prompt,
+ tokens_positive)
+ for text_prompt, enhanced_text_prompt, tokens_positive in zip(
+ text_prompts, enhanced_text_prompts, tokens_positives)
]
token_positive_maps, text_prompts, _, entities = zip(
*_positive_maps_and_prompts)
- language_dict_features = self.language_model(list(text_prompts))
+ visual_features = self.extract_feat(batch_inputs)
- for i, data_samples in enumerate(batch_data_samples):
- data_samples.token_positive_map = token_positive_maps[i]
+ if isinstance(text_prompts[0], list):
+ # chunked text prompts, only bs=1 is supported
+ assert len(batch_inputs) == 1
+ count = 0
+ results_list = []
+
+ entities = [[item for lst in entities[0] for item in lst]]
+
+ for b in range(len(text_prompts[0])):
+ text_prompts_once = [text_prompts[0][b]]
+ token_positive_maps_once = token_positive_maps[0][b]
+ language_dict_features = self.language_model(text_prompts_once)
+ batch_data_samples[
+ 0].token_positive_map = token_positive_maps_once
+
+ pred_instances = self.bbox_head.predict(
+ copy.deepcopy(visual_features),
+ language_dict_features,
+ batch_data_samples,
+ rescale=rescale)[0]
+
+ if len(pred_instances) > 0:
+ pred_instances.labels += count
+ count += len(token_positive_maps_once)
+ results_list.append(pred_instances)
+ results_list = [results_list[0].cat(results_list)]
+ else:
+ language_dict_features = self.language_model(list(text_prompts))
- visual_features = self.extract_feat(batch_inputs)
+ for i, data_samples in enumerate(batch_data_samples):
+ data_samples.token_positive_map = token_positive_maps[i]
- results_list = self.bbox_head.predict(
- visual_features,
- language_dict_features,
- batch_data_samples,
- rescale=rescale)
+ results_list = self.bbox_head.predict(
+ visual_features,
+ language_dict_features,
+ batch_data_samples,
+ rescale=rescale)
for data_sample, pred_instances, entity in zip(batch_data_samples,
results_list, entities):
diff --git a/mmdet/models/detectors/grounding_dino.py b/mmdet/models/detectors/grounding_dino.py
index 69d398bec8f..4ec9d14e634 100644
--- a/mmdet/models/detectors/grounding_dino.py
+++ b/mmdet/models/detectors/grounding_dino.py
@@ -1,6 +1,8 @@
# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+import re
import warnings
-from typing import Dict, Tuple, Union
+from typing import Dict, Optional, Tuple, Union
import torch
import torch.nn as nn
@@ -8,6 +10,7 @@
from mmdet.registry import MODELS
from mmdet.structures import OptSampleList, SampleList
+from mmdet.utils import ConfigType
from ..layers import SinePositionalEncoding
from ..layers.transformer.grounding_dino_layers import (
GroundingDinoTransformerDecoder, GroundingDinoTransformerEncoder)
@@ -16,6 +19,27 @@
run_ner)
+def clean_label_name(name: str) -> str:
+ name = re.sub(r'\(.*\)', '', name)
+ name = re.sub(r'_', ' ', name)
+ name = re.sub(r' ', ' ', name)
+ return name
+
+
+def chunks(lst: list, n: int) -> list:
+ """Yield successive n-sized chunks from lst."""
+ all_ = []
+ for i in range(0, len(lst), n):
+ data_index = lst[i:i + n]
+ all_.append(data_index)
+ counter = 0
+ for i in all_:
+ counter += len(i)
+ assert (counter == len(lst))
+
+ return all_
+
+
@MODELS.register_module()
class GroundingDINO(DINO):
"""Implementation of `Grounding DINO: Marrying DINO with Grounded Pre-
@@ -64,10 +88,49 @@ def init_weights(self) -> None:
nn.init.constant_(self.text_feat_map.bias.data, 0)
nn.init.xavier_uniform_(self.text_feat_map.weight.data)
+ def to_enhance_text_prompts(self, original_caption, enhanced_text_prompts):
+ caption_string = ''
+ tokens_positive = []
+ for idx, word in enumerate(original_caption):
+ if word in enhanced_text_prompts:
+ enhanced_text_dict = enhanced_text_prompts[word]
+ if 'prefix' in enhanced_text_dict:
+ caption_string += enhanced_text_dict['prefix']
+ start_i = len(caption_string)
+ if 'name' in enhanced_text_dict:
+ caption_string += enhanced_text_dict['name']
+ else:
+ caption_string += word
+ end_i = len(caption_string)
+ tokens_positive.append([[start_i, end_i]])
+
+ if 'suffix' in enhanced_text_dict:
+ caption_string += enhanced_text_dict['suffix']
+ else:
+ tokens_positive.append(
+ [[len(caption_string),
+ len(caption_string) + len(word)]])
+ caption_string += word
+ caption_string += self._special_tokens
+ return caption_string, tokens_positive
+
+ def to_plain_text_prompts(self, original_caption):
+ caption_string = ''
+ tokens_positive = []
+ for idx, word in enumerate(original_caption):
+ tokens_positive.append(
+ [[len(caption_string),
+ len(caption_string) + len(word)]])
+ caption_string += word
+ caption_string += self._special_tokens
+ return caption_string, tokens_positive
+
def get_tokens_and_prompts(
- self,
- original_caption: Union[str, list, tuple],
- custom_entities: bool = False) -> Tuple[dict, str, list]:
+ self,
+ original_caption: Union[str, list, tuple],
+ custom_entities: bool = False,
+ enhanced_text_prompts: Optional[ConfigType] = None
+ ) -> Tuple[dict, str, list]:
"""Get the tokens positive and prompts for the caption."""
if isinstance(original_caption, (list, tuple)) or custom_entities:
if custom_entities and isinstance(original_caption, str):
@@ -76,14 +139,15 @@ def get_tokens_and_prompts(
original_caption = list(
filter(lambda x: len(x) > 0, original_caption))
- caption_string = ''
- tokens_positive = []
- for idx, word in enumerate(original_caption):
- tokens_positive.append(
- [[len(caption_string),
- len(caption_string) + len(word)]])
- caption_string += word
- caption_string += self._special_tokens
+ original_caption = [clean_label_name(i) for i in original_caption]
+
+ if custom_entities and enhanced_text_prompts is not None:
+ caption_string, tokens_positive = self.to_enhance_text_prompts(
+ original_caption, enhanced_text_prompts)
+ else:
+ caption_string, tokens_positive = self.to_plain_text_prompts(
+ original_caption)
+
# NOTE: Tokenizer in Grounding DINO is different from
# that in GLIP. The tokenizer in GLIP will pad the
# caption_string to max_length, while the tokenizer
@@ -113,15 +177,22 @@ def get_tokens_and_prompts(
return tokenized, caption_string, tokens_positive, entities
def get_positive_map(self, tokenized, tokens_positive):
- positive_map = create_positive_map(tokenized, tokens_positive)
+ positive_map = create_positive_map(
+ tokenized,
+ tokens_positive,
+ max_num_entities=self.bbox_head.cls_branches[
+ self.decoder.num_layers].max_text_len)
positive_map_label_to_token = create_positive_map_label_to_token(
positive_map, plus=1)
return positive_map_label_to_token, positive_map
def get_tokens_positive_and_prompts(
- self,
- original_caption: Union[str, list, tuple],
- custom_entities: bool = False) -> Tuple[dict, str, Tensor, list]:
+ self,
+ original_caption: Union[str, list, tuple],
+ custom_entities: bool = False,
+ enhanced_text_prompt: Optional[ConfigType] = None,
+ tokens_positive: Optional[list] = None,
+ ) -> Tuple[dict, str, Tensor, list]:
"""Get the tokens positive and prompts for the caption.
Args:
@@ -135,14 +206,94 @@ def get_tokens_positive_and_prompts(
id, which is numbered from 1, to its positive token id.
The str represents the prompts.
"""
- tokenized, caption_string, tokens_positive, entities = \
- self.get_tokens_and_prompts(
- original_caption, custom_entities)
- positive_map_label_to_token, positive_map = self.get_positive_map(
- tokenized, tokens_positive)
+ if tokens_positive is not None:
+ if tokens_positive == -1:
+ if not original_caption.endswith('.'):
+ original_caption = original_caption + self._special_tokens
+ return None, original_caption, None, original_caption
+ else:
+ if not original_caption.endswith('.'):
+ original_caption = original_caption + self._special_tokens
+ tokenized = self.language_model.tokenizer(
+ [original_caption],
+ padding='max_length'
+ if self.language_model.pad_to_max else 'longest',
+ return_tensors='pt')
+ positive_map_label_to_token, positive_map = \
+ self.get_positive_map(tokenized, tokens_positive)
+
+ entities = []
+ for token_positive in tokens_positive:
+ instance_entities = []
+ for t in token_positive:
+ instance_entities.append(original_caption[t[0]:t[1]])
+ entities.append(' / '.join(instance_entities))
+ return positive_map_label_to_token, original_caption, \
+ positive_map, entities
+
+ chunked_size = self.test_cfg.get('chunked_size', -1)
+ if not self.training and chunked_size > 0:
+ assert isinstance(original_caption,
+ (list, tuple)) or custom_entities is True
+ all_output = self.get_tokens_positive_and_prompts_chunked(
+ original_caption, enhanced_text_prompt)
+ positive_map_label_to_token, \
+ caption_string, \
+ positive_map, \
+ entities = all_output
+ else:
+ tokenized, caption_string, tokens_positive, entities = \
+ self.get_tokens_and_prompts(
+ original_caption, custom_entities, enhanced_text_prompt)
+ positive_map_label_to_token, positive_map = self.get_positive_map(
+ tokenized, tokens_positive)
return positive_map_label_to_token, caption_string, \
positive_map, entities
+ def get_tokens_positive_and_prompts_chunked(
+ self,
+ original_caption: Union[list, tuple],
+ enhanced_text_prompts: Optional[ConfigType] = None):
+ chunked_size = self.test_cfg.get('chunked_size', -1)
+ original_caption = [clean_label_name(i) for i in original_caption]
+
+ original_caption_chunked = chunks(original_caption, chunked_size)
+ ids_chunked = chunks(
+ list(range(1,
+ len(original_caption) + 1)), chunked_size)
+
+ positive_map_label_to_token_chunked = []
+ caption_string_chunked = []
+ positive_map_chunked = []
+ entities_chunked = []
+
+ for i in range(len(ids_chunked)):
+ if enhanced_text_prompts is not None:
+ caption_string, tokens_positive = self.to_enhance_text_prompts(
+ original_caption_chunked[i], enhanced_text_prompts)
+ else:
+ caption_string, tokens_positive = self.to_plain_text_prompts(
+ original_caption_chunked[i])
+ tokenized = self.language_model.tokenizer([caption_string],
+ return_tensors='pt')
+ if tokenized.input_ids.shape[1] > self.language_model.max_tokens:
+ warnings.warn('Inputting a text that is too long will result '
+ 'in poor prediction performance. '
+ 'Please reduce the --chunked-size.')
+ positive_map_label_to_token, positive_map = self.get_positive_map(
+ tokenized, tokens_positive)
+
+ caption_string_chunked.append(caption_string)
+ positive_map_label_to_token_chunked.append(
+ positive_map_label_to_token)
+ positive_map_chunked.append(positive_map)
+ entities_chunked.append(original_caption_chunked[i])
+
+ return positive_map_label_to_token_chunked, \
+ caption_string_chunked, \
+ positive_map_chunked, \
+ entities_chunked
+
def forward_transformer(
self,
img_feats: Tuple[Tensor],
@@ -261,7 +412,6 @@ def pre_decoder(
def loss(self, batch_inputs: Tensor,
batch_data_samples: SampleList) -> Union[dict, list]:
- # TODO: Only open vocabulary tasks are supported for training now.
text_prompts = [
data_samples.text for data_samples in batch_data_samples
]
@@ -271,34 +421,55 @@ def loss(self, batch_inputs: Tensor,
for data_samples in batch_data_samples
]
- new_text_prompts = []
- positive_maps = []
- if len(set(text_prompts)) == 1:
- # All the text prompts are the same,
- # so there is no need to calculate them multiple times.
- tokenized, caption_string, tokens_positive, _ = \
- self.get_tokens_and_prompts(
- text_prompts[0], True)
- new_text_prompts = [caption_string] * len(batch_inputs)
- for gt_label in gt_labels:
+ if 'tokens_positive' in batch_data_samples[0]:
+ tokens_positive = [
+ data_samples.tokens_positive
+ for data_samples in batch_data_samples
+ ]
+ positive_maps = []
+ for token_positive, text_prompt, gt_label in zip(
+ tokens_positive, text_prompts, gt_labels):
+ tokenized = self.language_model.tokenizer(
+ [text_prompt],
+ padding='max_length'
+ if self.language_model.pad_to_max else 'longest',
+ return_tensors='pt')
new_tokens_positive = [
- tokens_positive[label] for label in gt_label
+ token_positive[label.item()] for label in gt_label
]
_, positive_map = self.get_positive_map(
tokenized, new_tokens_positive)
positive_maps.append(positive_map)
+ new_text_prompts = text_prompts
else:
- for text_prompt, gt_label in zip(text_prompts, gt_labels):
+ new_text_prompts = []
+ positive_maps = []
+ if len(set(text_prompts)) == 1:
+ # All the text prompts are the same,
+ # so there is no need to calculate them multiple times.
tokenized, caption_string, tokens_positive, _ = \
self.get_tokens_and_prompts(
- text_prompt, True)
- new_tokens_positive = [
- tokens_positive[label] for label in gt_label
- ]
- _, positive_map = self.get_positive_map(
- tokenized, new_tokens_positive)
- positive_maps.append(positive_map)
- new_text_prompts.append(caption_string)
+ text_prompts[0], True)
+ new_text_prompts = [caption_string] * len(batch_inputs)
+ for gt_label in gt_labels:
+ new_tokens_positive = [
+ tokens_positive[label] for label in gt_label
+ ]
+ _, positive_map = self.get_positive_map(
+ tokenized, new_tokens_positive)
+ positive_maps.append(positive_map)
+ else:
+ for text_prompt, gt_label in zip(text_prompts, gt_labels):
+ tokenized, caption_string, tokens_positive, _ = \
+ self.get_tokens_and_prompts(
+ text_prompt, True)
+ new_tokens_positive = [
+ tokens_positive[label] for label in gt_label
+ ]
+ _, positive_map = self.get_positive_map(
+ tokenized, new_tokens_positive)
+ positive_maps.append(positive_map)
+ new_text_prompts.append(caption_string)
text_dict = self.language_model(new_text_prompts)
if self.text_feat_map is not None:
@@ -322,9 +493,17 @@ def loss(self, batch_inputs: Tensor,
return losses
def predict(self, batch_inputs, batch_data_samples, rescale: bool = True):
- text_prompts = [
- data_samples.text for data_samples in batch_data_samples
- ]
+ text_prompts = []
+ enhanced_text_prompts = []
+ tokens_positives = []
+ for data_samples in batch_data_samples:
+ text_prompts.append(data_samples.text)
+ if 'caption_prompt' in data_samples:
+ enhanced_text_prompts.append(data_samples.caption_prompt)
+ else:
+ enhanced_text_prompts.append(None)
+ tokens_positives.append(data_samples.get('tokens_positive', None))
+
if 'custom_entities' in batch_data_samples[0]:
# Assuming that the `custom_entities` flag
# inside a batch is always the same. For single image inference
@@ -335,40 +514,89 @@ def predict(self, batch_inputs, batch_data_samples, rescale: bool = True):
# All the text prompts are the same,
# so there is no need to calculate them multiple times.
_positive_maps_and_prompts = [
- self.get_tokens_positive_and_prompts(text_prompts[0],
- custom_entities)
+ self.get_tokens_positive_and_prompts(
+ text_prompts[0], custom_entities, enhanced_text_prompts[0],
+ tokens_positives[0])
] * len(batch_inputs)
else:
_positive_maps_and_prompts = [
self.get_tokens_positive_and_prompts(text_prompt,
- custom_entities)
- for text_prompt in text_prompts
+ custom_entities,
+ enhanced_text_prompt,
+ tokens_positive)
+ for text_prompt, enhanced_text_prompt, tokens_positive in zip(
+ text_prompts, enhanced_text_prompts, tokens_positives)
]
token_positive_maps, text_prompts, _, entities = zip(
*_positive_maps_and_prompts)
- # extract text feats
- text_dict = self.language_model(list(text_prompts))
- # text feature map layer
- if self.text_feat_map is not None:
- text_dict['embedded'] = self.text_feat_map(text_dict['embedded'])
-
- for i, data_samples in enumerate(batch_data_samples):
- data_samples.token_positive_map = token_positive_maps[i]
# image feature extraction
visual_feats = self.extract_feat(batch_inputs)
- head_inputs_dict = self.forward_transformer(visual_feats, text_dict,
- batch_data_samples)
- results_list = self.bbox_head.predict(
- **head_inputs_dict,
- rescale=rescale,
- batch_data_samples=batch_data_samples)
- for data_sample, pred_instances, entity in zip(batch_data_samples,
- results_list, entities):
+ if isinstance(text_prompts[0], list):
+ # chunked text prompts, only bs=1 is supported
+ assert len(batch_inputs) == 1
+ count = 0
+ results_list = []
+
+ entities = [[item for lst in entities[0] for item in lst]]
+
+ for b in range(len(text_prompts[0])):
+ text_prompts_once = [text_prompts[0][b]]
+ token_positive_maps_once = token_positive_maps[0][b]
+ text_dict = self.language_model(text_prompts_once)
+ # text feature map layer
+ if self.text_feat_map is not None:
+ text_dict['embedded'] = self.text_feat_map(
+ text_dict['embedded'])
+
+ batch_data_samples[
+ 0].token_positive_map = token_positive_maps_once
+
+ head_inputs_dict = self.forward_transformer(
+ copy.deepcopy(visual_feats), text_dict, batch_data_samples)
+ pred_instances = self.bbox_head.predict(
+ **head_inputs_dict,
+ rescale=rescale,
+ batch_data_samples=batch_data_samples)[0]
+
+ if len(pred_instances) > 0:
+ pred_instances.labels += count
+ count += len(token_positive_maps_once)
+ results_list.append(pred_instances)
+ results_list = [results_list[0].cat(results_list)]
+ is_rec_tasks = [False] * len(results_list)
+ else:
+ # extract text feats
+ text_dict = self.language_model(list(text_prompts))
+ # text feature map layer
+ if self.text_feat_map is not None:
+ text_dict['embedded'] = self.text_feat_map(
+ text_dict['embedded'])
+
+ is_rec_tasks = []
+ for i, data_samples in enumerate(batch_data_samples):
+ if token_positive_maps[i] is not None:
+ is_rec_tasks.append(False)
+ else:
+ is_rec_tasks.append(True)
+ data_samples.token_positive_map = token_positive_maps[i]
+
+ head_inputs_dict = self.forward_transformer(
+ visual_feats, text_dict, batch_data_samples)
+ results_list = self.bbox_head.predict(
+ **head_inputs_dict,
+ rescale=rescale,
+ batch_data_samples=batch_data_samples)
+
+ for data_sample, pred_instances, entity, is_rec_task in zip(
+ batch_data_samples, results_list, entities, is_rec_tasks):
if len(pred_instances) > 0:
label_names = []
for labels in pred_instances.labels:
+ if is_rec_task:
+ label_names.append(entity)
+ continue
if labels >= len(entity):
warnings.warn(
'The unexpected output indicates an issue with '
diff --git a/mmdet/models/losses/triplet_loss.py b/mmdet/models/losses/triplet_loss.py
index d9c9604b8c7..4528239beb4 100644
--- a/mmdet/models/losses/triplet_loss.py
+++ b/mmdet/models/losses/triplet_loss.py
@@ -40,7 +40,7 @@ def hard_mining_triplet_loss_forward(
inputs (torch.Tensor): feature matrix with shape
(batch_size, feat_dim).
targets (torch.LongTensor): ground truth labels with shape
- (num_classes).
+ (batch_size).
Returns:
torch.Tensor: triplet loss with hard mining.
diff --git a/mmdet/version.py b/mmdet/version.py
index 38ce834e152..47989fc0a31 100644
--- a/mmdet/version.py
+++ b/mmdet/version.py
@@ -1,6 +1,6 @@
# Copyright (c) OpenMMLab. All rights reserved.
-__version__ = '3.2.0'
+__version__ = '3.3.0'
short_version = __version__
diff --git a/model-index.yml b/model-index.yml
index f1704c042cd..d4b4392b422 100644
--- a/model-index.yml
+++ b/model-index.yml
@@ -99,3 +99,4 @@ Import:
- configs/glip/metafile.yml
- configs/ddq/metafile.yml
- configs/grounding_dino/metafile.yml
+ - configs/mm_grounding_dino/metafile.yml
diff --git a/projects/CO-DETR/configs/codino/co_dino_5scale_swin_l_16xb1_16e_o365tococo.py b/projects/CO-DETR/configs/codino/co_dino_5scale_swin_l_16xb1_16e_o365tococo.py
index 8fdb73269ff..77821c380f3 100644
--- a/projects/CO-DETR/configs/codino/co_dino_5scale_swin_l_16xb1_16e_o365tococo.py
+++ b/projects/CO-DETR/configs/codino/co_dino_5scale_swin_l_16xb1_16e_o365tococo.py
@@ -1,7 +1,7 @@
_base_ = ['co_dino_5scale_r50_8xb2_1x_coco.py']
pretrained = 'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window12_384_22k.pth' # noqa
-load_from = 'https://download.openmmlab.com/mmdetection/v3.0/codetr/co_dino_5scale_swin_large_22e_o365-0a33e247.pth' # noqa
+load_from = 'https://download.openmmlab.com/mmdetection/v3.0/codetr/co_dino_5scale_swin_large_16e_o365tococo-614254c9.pth' # noqa
# model settings
model = dict(
diff --git a/projects/CO-DETR/configs/codino/co_dino_5scale_swin_l_lsj_16xb1_3x_coco.py b/projects/CO-DETR/configs/codino/co_dino_5scale_swin_l_lsj_16xb1_3x_coco.py
index 0e5c00b2182..bf9cd4f4392 100644
--- a/projects/CO-DETR/configs/codino/co_dino_5scale_swin_l_lsj_16xb1_3x_coco.py
+++ b/projects/CO-DETR/configs/codino/co_dino_5scale_swin_l_lsj_16xb1_3x_coco.py
@@ -2,5 +2,6 @@
model = dict(backbone=dict(drop_path_rate=0.5))
-param_scheduler = [dict(milestones=[30])]
+param_scheduler = [dict(type='MultiStepLR', milestones=[30])]
+
train_cfg = dict(max_epochs=36)
diff --git a/projects/XDecoder/README.md b/projects/XDecoder/README.md
index b739fdfa92d..089934148f5 100644
--- a/projects/XDecoder/README.md
+++ b/projects/XDecoder/README.md
@@ -33,7 +33,7 @@ wget https://download.openmmlab.com/mmdetection/v3.0/xdecoder/xdecoder_focalt_be
The above two weights are directly copied from the official website without any modification. The specific source is https://github.com/microsoft/X-Decoder
-For convenience of demonstration, please download [the folder](https://github.com/microsoft/X-Decoder/tree/main/images) and place it in the root directory of mmdetection.
+For convenience of demonstration, please download [the folder](https://github.com/microsoft/X-Decoder/tree/main/inference_demo/images) and place it in the root directory of mmdetection.
**(1) Open Vocabulary Semantic Segmentation**
diff --git a/requirements/multimodal.txt b/requirements/multimodal.txt
index 03fdb17777e..20924eb3ee1 100644
--- a/requirements/multimodal.txt
+++ b/requirements/multimodal.txt
@@ -1,4 +1,5 @@
fairscale
+jsonlines
nltk
pycocoevalcap
transformers
diff --git a/requirements/optional.txt b/requirements/optional.txt
index 54e5dd647f4..31bdde50bea 100644
--- a/requirements/optional.txt
+++ b/requirements/optional.txt
@@ -1,4 +1,5 @@
cityscapesscripts
+emoji
fairscale
imagecorruptions
scikit-learn
diff --git a/setup.cfg b/setup.cfg
index a3ff3fa46d2..7ecd4b98a70 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -18,7 +18,7 @@ SPLIT_BEFORE_EXPRESSION_AFTER_OPENING_PAREN = true
[codespell]
skip = *.ipynb,configs/v3det/category_name_13204_v3det_2023_v1.txt
quiet-level = 3
-ignore-words-list = patten,nd,ty,mot,hist,formating,winn,gool,datas,wan,confids,TOOD,tood,ba,warmup,nam,DOTA,dota,conveyer,singed,comittee
+ignore-words-list = patten,nd,ty,mot,hist,formating,winn,gool,datas,wan,confids,TOOD,tood,ba,warmup,nam,DOTA,dota,conveyer,singed,comittee,extention,moniter,pres,
[flake8]
per-file-ignores = mmdet/configs/*: F401,F403,F405
diff --git a/tests/test_models/test_detectors/test_glip.py b/tests/test_models/test_detectors/test_glip.py
index 8be3d8d719f..dc38d3142d2 100644
--- a/tests/test_models/test_detectors/test_glip.py
+++ b/tests/test_models/test_detectors/test_glip.py
@@ -61,14 +61,14 @@ def test_glip_forward_predict_mode(self, cfg_file, devices):
self.assertIsInstance(batch_results[0], DetDataSample)
# test custom_entities is False
- packed_inputs = demo_mm_inputs(
- 2, [[3, 128, 128], [3, 125, 130]],
- texts=['a', 'b'],
- custom_entities=False)
- data = detector.data_preprocessor(packed_inputs, False)
- # Test forward test
- detector.eval()
- with torch.no_grad():
- batch_results = detector.forward(**data, mode='predict')
- self.assertEqual(len(batch_results), 2)
- self.assertIsInstance(batch_results[0], DetDataSample)
+ # packed_inputs = demo_mm_inputs(
+ # 2, [[3, 128, 128], [3, 125, 130]],
+ # texts=['a', 'b'],
+ # custom_entities=False)
+ # data = detector.data_preprocessor(packed_inputs, False)
+ # # Test forward test
+ # detector.eval()
+ # with torch.no_grad():
+ # batch_results = detector.forward(**data, mode='predict')
+ # self.assertEqual(len(batch_results), 2)
+ # self.assertIsInstance(batch_results[0], DetDataSample)
diff --git a/tools/analysis_tools/browse_grounding_dataset.py b/tools/analysis_tools/browse_grounding_dataset.py
new file mode 100644
index 00000000000..43261956faa
--- /dev/null
+++ b/tools/analysis_tools/browse_grounding_dataset.py
@@ -0,0 +1,200 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+
+import numpy as np
+from mmcv.image import imwrite
+from mmengine.config import Config, DictAction
+from mmengine.registry import init_default_scope
+from mmengine.utils import ProgressBar
+
+from mmdet.registry import DATASETS, VISUALIZERS
+from mmdet.structures.bbox import BaseBoxes
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description='Browse a dataset')
+ parser.add_argument('config', help='train config file path')
+ parser.add_argument(
+ '--output-dir',
+ '-o',
+ default=None,
+ type=str,
+ help='If there is no display interface, you can save it')
+ parser.add_argument('--not-show', default=False, action='store_true')
+ parser.add_argument('--show-num', '-n', type=int, default=30)
+ parser.add_argument('--shuffle', default=False, action='store_true')
+ parser.add_argument(
+ '--show-interval',
+ type=float,
+ default=0,
+ help='the interval of show (s)')
+ parser.add_argument(
+ '--cfg-options',
+ nargs='+',
+ action=DictAction,
+ help='override some settings in the used config, the key-value pair '
+ 'in xxx=yyy format will be merged into config file. If the value to '
+ 'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+ 'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+ 'Note that the quotation marks are necessary and that no white space '
+ 'is allowed.')
+ args = parser.parse_args()
+ return args
+
+
+def draw_all_character(visualizer, characters, w):
+ start_index = 2
+ y_index = 5
+ for char in characters:
+ if isinstance(char, str):
+ visualizer.draw_texts(
+ str(char),
+ positions=np.array([start_index, y_index]),
+ colors=(0, 0, 0),
+ font_families='monospace')
+ start_index += len(char) * 8
+ else:
+ visualizer.draw_texts(
+ str(char[0]),
+ positions=np.array([start_index, y_index]),
+ colors=char[1],
+ font_families='monospace')
+ start_index += len(char[0]) * 8
+
+ if start_index > w - 10:
+ start_index = 2
+ y_index += 15
+
+ drawn_text = visualizer.get_image()
+ return drawn_text
+
+
+def main():
+ args = parse_args()
+ cfg = Config.fromfile(args.config)
+ if args.cfg_options is not None:
+ cfg.merge_from_dict(args.cfg_options)
+
+ assert args.show_num > 0
+
+ # register all modules in mmdet into the registries
+ init_default_scope(cfg.get('default_scope', 'mmdet'))
+
+ dataset = DATASETS.build(cfg.train_dataloader.dataset)
+ visualizer = VISUALIZERS.build(cfg.visualizer)
+ visualizer.dataset_meta = dataset.metainfo
+
+ dataset_index = list(range(len(dataset)))
+ if args.shuffle:
+ import random
+ random.shuffle(dataset_index)
+
+ progress_bar = ProgressBar(len(dataset))
+ for i in dataset_index[:args.show_num]:
+ item = dataset[i]
+ img = item['inputs'].permute(1, 2, 0).numpy()
+ data_sample = item['data_samples'].numpy()
+ gt_instances = data_sample.gt_instances
+ tokens_positive = data_sample.tokens_positive
+
+ gt_labels = gt_instances.labels
+
+ base_name = osp.basename(item['data_samples'].img_path)
+ name, extension = osp.splitext(base_name)
+
+ out_file = osp.join(args.output_dir, name + '_' + str(i) +
+ extension) if args.output_dir is not None else None
+
+ img = img[..., [2, 1, 0]] # bgr to rgb
+ gt_bboxes = gt_instances.get('bboxes', None)
+ if gt_bboxes is not None and isinstance(gt_bboxes, BaseBoxes):
+ gt_instances.bboxes = gt_bboxes.tensor
+
+ print(data_sample.text)
+
+ dataset_mode = data_sample.dataset_mode
+ if dataset_mode == 'VG':
+ max_label = int(max(gt_labels) if len(gt_labels) > 0 else 0)
+ palette = np.random.randint(0, 256, size=(max_label + 1, 3))
+ bbox_palette = [tuple(c) for c in palette]
+ # bbox_palette = get_palette('random', max_label + 1)
+ colors = [bbox_palette[label] for label in gt_labels]
+
+ visualizer.set_image(img)
+
+ for label, bbox, color in zip(gt_labels, gt_bboxes, colors):
+ visualizer.draw_bboxes(
+ bbox, edge_colors=color, face_colors=color, alpha=0.3)
+ visualizer.draw_bboxes(bbox, edge_colors=color, alpha=1)
+
+ drawn_img = visualizer.get_image()
+
+ new_image = np.ones((100, img.shape[1], 3), dtype=np.uint8) * 255
+ visualizer.set_image(new_image)
+
+ gt_tokens_positive = [
+ tokens_positive[label] for label in gt_labels
+ ]
+ split_by_character = [char for char in data_sample.text]
+ characters = []
+ start_index = 0
+ end_index = 0
+ for w in split_by_character:
+ end_index += len(w)
+ is_find = False
+ for i, positive in enumerate(gt_tokens_positive):
+ for p in positive:
+ if start_index >= p[0] and end_index <= p[1]:
+ characters.append([w, colors[i]])
+ is_find = True
+ break
+ if is_find:
+ break
+ if not is_find:
+ characters.append([w, (0, 0, 0)])
+ start_index = end_index
+
+ drawn_text = draw_all_character(visualizer, characters,
+ img.shape[1])
+ drawn_img = np.concatenate((drawn_img, drawn_text), axis=0)
+ else:
+ gt_labels = gt_instances.labels
+ text = data_sample.text
+ label_names = []
+ for label in gt_labels:
+ label_names.append(text[
+ tokens_positive[label][0][0]:tokens_positive[label][0][1]])
+ gt_instances.label_names = label_names
+ data_sample.gt_instances = gt_instances
+
+ visualizer.add_datasample(
+ base_name,
+ img,
+ data_sample,
+ draw_pred=False,
+ show=False,
+ wait_time=0,
+ out_file=None)
+ drawn_img = visualizer.get_image()
+
+ new_image = np.ones((100, img.shape[1], 3), dtype=np.uint8) * 255
+ visualizer.set_image(new_image)
+
+ characters = [char for char in text]
+ drawn_text = draw_all_character(visualizer, characters,
+ img.shape[1])
+ drawn_img = np.concatenate((drawn_img, drawn_text), axis=0)
+
+ if not args.not_show:
+ visualizer.show(
+ drawn_img, win_name=base_name, wait_time=args.show_interval)
+
+ if out_file is not None:
+ imwrite(drawn_img[..., ::-1], out_file)
+
+ progress_bar.update()
+
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/analysis_tools/browse_grounding_raw.py b/tools/analysis_tools/browse_grounding_raw.py
new file mode 100644
index 00000000000..16fa604cacd
--- /dev/null
+++ b/tools/analysis_tools/browse_grounding_raw.py
@@ -0,0 +1,284 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import json
+import os.path as osp
+
+import cv2
+import numpy as np
+from mmcv.image import imfrombytes, imwrite
+from mmengine.fileio import get
+from mmengine.structures import InstanceData
+from mmengine.utils import mkdir_or_exist
+
+from mmdet.structures import DetDataSample
+from mmdet.visualization import DetLocalVisualizer
+from mmdet.visualization.palette import _get_adaptive_scales
+
+# backend_args = dict(
+# backend='petrel',
+# path_mapping=dict({
+# './data/': 's3://openmmlab/datasets/detection/',
+# 'data/': 's3://openmmlab/datasets/detection/'
+# }))
+backend_args = None
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description='Browse a dataset')
+ parser.add_argument('data_root')
+ parser.add_argument('ann_file')
+ parser.add_argument('img_prefix')
+ parser.add_argument('--label-map-file', '-m', default=None)
+ parser.add_argument(
+ '--output-dir',
+ '-o',
+ default=None,
+ type=str,
+ help='If there is no display interface, you can save it')
+ parser.add_argument('--not-show', default=False, action='store_true')
+ parser.add_argument('--show-num', '-n', type=int, default=30)
+ parser.add_argument('--shuffle', default=False, action='store_true')
+ parser.add_argument(
+ '--show-interval',
+ type=float,
+ default=0,
+ help='the interval of show (s)')
+ args = parser.parse_args()
+ return args
+
+
+def draw_all_character(visualizer, characters, w):
+ start_index = 2
+ y_index = 5
+ for char in characters:
+ if isinstance(char, str):
+ visualizer.draw_texts(
+ str(char),
+ positions=np.array([start_index, y_index]),
+ colors=(0, 0, 0),
+ font_families='monospace')
+ start_index += len(char) * 8
+ else:
+ visualizer.draw_texts(
+ str(char[0]),
+ positions=np.array([start_index, y_index]),
+ colors=char[1],
+ font_families='monospace')
+ start_index += len(char[0]) * 8
+
+ if start_index > w - 10:
+ start_index = 2
+ y_index += 15
+
+ drawn_text = visualizer.get_image()
+ return drawn_text
+
+
+def main():
+ args = parse_args()
+ assert args.show_num > 0
+
+ local_path = osp.join(args.data_root, args.ann_file)
+ with open(local_path, 'r') as f:
+ data_list = [json.loads(line) for line in f]
+
+ dataset_index = list(range(len(data_list)))
+ if args.shuffle:
+ import random
+ random.shuffle(dataset_index)
+
+ if args.label_map_file is not None:
+ label_map_file = osp.join(args.data_root, args.label_map_file)
+ with open(label_map_file, 'r') as file:
+ label_map = json.load(file)
+
+ visualizer = DetLocalVisualizer()
+
+ for i in dataset_index[:args.show_num]:
+ item = data_list[i]
+
+ img_path = osp.join(args.data_root, args.img_prefix, item['filename'])
+ if backend_args is not None:
+ img_bytes = get(img_path, backend_args)
+ img = imfrombytes(img_bytes, flag='color')
+ else:
+ img = cv2.imread(img_path)
+ img = img[..., [2, 1, 0]] # bgr to rgb
+
+ base_name, extension = osp.splitext(item['filename'])
+
+ out_file = osp.join(args.output_dir, base_name + '_' + str(i) +
+ extension) if args.output_dir is not None else None
+
+ if args.output_dir is not None:
+ mkdir_or_exist(args.output_dir)
+
+ if 'detection' in item:
+ anno = item['detection']
+
+ instances = [obj for obj in anno['instances']]
+ bboxes = [obj['bbox'] for obj in instances]
+ bbox_labels = [int(obj['label']) for obj in instances]
+ label_names = [label_map[str(label)] for label in bbox_labels]
+
+ data_sample = DetDataSample()
+ gt_instances = InstanceData()
+ if len(instances) > 0 and 'score' in instances[0]:
+ score = [obj['score'] for obj in instances]
+ gt_instances['scores'] = np.array(score)
+
+ gt_instances['bboxes'] = np.array(bboxes).reshape(-1, 4)
+ gt_instances['labels'] = np.array(bbox_labels)
+ gt_instances['label_names'] = label_names
+ data_sample.gt_instances = gt_instances
+
+ visualizer.add_datasample(
+ osp.basename(img_path),
+ img,
+ data_sample,
+ draw_pred=False,
+ show=not args.not_show,
+ wait_time=args.show_interval,
+ out_file=out_file)
+ elif 'grounding' in item:
+ anno = item['grounding']
+ text = anno['caption']
+ regions = anno['regions']
+
+ max_label = len(regions) if len(regions) > 0 else 0
+ palette = np.random.randint(0, 256, size=(max_label + 1, 3))
+ bbox_palette = [tuple(c) for c in palette]
+ # bbox_palette = get_palette('random', max_label + 1)
+ colors = [bbox_palette[label] for label in range(max_label)]
+
+ visualizer.set_image(img)
+
+ gt_tokens_positive = []
+ for i, region in enumerate(regions):
+ bbox = region['bbox']
+ bbox = np.array(bbox).reshape(-1, 4)
+ tokens_positive = region['tokens_positive']
+ gt_tokens_positive.append(tokens_positive)
+ visualizer.draw_bboxes(
+ bbox,
+ edge_colors=colors[i],
+ face_colors=colors[i],
+ alpha=0.3)
+ visualizer.draw_bboxes(bbox, edge_colors=colors[i], alpha=1)
+
+ if 'score' in region:
+ areas = (bbox[:, 3] - bbox[:, 1]) * (
+ bbox[:, 2] - bbox[:, 0])
+ scales = _get_adaptive_scales(areas)
+ score = region['score'][0]
+ score = [str(s) for s in score]
+ font_sizes = [
+ int(13 * scales[i]) for i in range(len(scales))
+ ]
+ visualizer.draw_texts(
+ score,
+ bbox[:, :2].astype(np.int32),
+ colors=(255, 255, 255),
+ font_sizes=font_sizes,
+ bboxes=[{
+ 'facecolor': 'black',
+ 'alpha': 0.8,
+ 'pad': 0.7,
+ 'edgecolor': 'none'
+ }] * len(bbox))
+
+ drawn_img = visualizer.get_image()
+ new_image = np.ones((100, img.shape[1], 3), dtype=np.uint8) * 255
+ visualizer.set_image(new_image)
+
+ split_by_character = [char for char in text]
+ characters = []
+ start_index = 0
+ end_index = 0
+ for w in split_by_character:
+ end_index += len(w)
+ is_find = False
+ for i, positive in enumerate(gt_tokens_positive):
+ for p in positive:
+ if start_index >= p[0] and end_index <= p[1]:
+ characters.append([w, colors[i]])
+ is_find = True
+ break
+ if is_find:
+ break
+ if not is_find:
+ characters.append([w, (0, 0, 0)])
+ start_index = end_index
+
+ drawn_text = draw_all_character(visualizer, characters,
+ img.shape[1])
+ drawn_img = np.concatenate((drawn_img, drawn_text), axis=0)
+
+ if not args.not_show:
+ visualizer.show(
+ drawn_img,
+ win_name=base_name,
+ wait_time=args.show_interval)
+
+ if out_file is not None:
+ imwrite(drawn_img[..., ::-1], out_file)
+
+ elif 'referring' in item:
+ referring = item['referring']
+
+ max_label = len(referring) if len(referring) > 0 else 0
+ palette = np.random.randint(0, 256, size=(max_label + 1, 3))
+ bbox_palette = [tuple(c) for c in palette]
+ # bbox_palette = get_palette('random', max_label + 1)
+ colors = [bbox_palette[label] for label in range(max_label)]
+
+ visualizer.set_image(img)
+ phrases = []
+ for i, ref in enumerate(referring):
+ bbox = ref['bbox']
+ phrase = ref['phrase']
+ phrases.append(' // '.join(phrase))
+ bbox = np.array(bbox).reshape(-1, 4)
+
+ visualizer.draw_bboxes(
+ bbox,
+ edge_colors=colors[i],
+ face_colors=colors[i],
+ alpha=0.3)
+ visualizer.draw_bboxes(bbox, edge_colors=colors[i], alpha=1)
+ drawn_img = visualizer.get_image()
+
+ new_image = np.ones((100, img.shape[1], 3), dtype=np.uint8) * 255
+ visualizer.set_image(new_image)
+
+ start_index = 2
+ y_index = 5
+
+ chunk_size = max(min(img.shape[1] - 400, 70), 50)
+ for i, p in enumerate(phrases):
+ chunk_p = [
+ p[i:i + chunk_size] for i in range(0, len(p), chunk_size)
+ ]
+ for cp in chunk_p:
+ visualizer.draw_texts(
+ cp,
+ positions=np.array([start_index, y_index]),
+ colors=colors[i],
+ font_families='monospace')
+ y_index += 15
+
+ drawn_text = visualizer.get_image()
+ drawn_img = np.concatenate((drawn_img, drawn_text), axis=0)
+
+ if not args.not_show:
+ visualizer.show(
+ drawn_img,
+ win_name=base_name,
+ wait_time=args.show_interval)
+
+ if out_file is not None:
+ imwrite(drawn_img[..., ::-1], out_file)
+
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/analysis_tools/coco_error_analysis.py b/tools/analysis_tools/coco_error_analysis.py
index 102ea4ebb29..ed270144d77 100644
--- a/tools/analysis_tools/coco_error_analysis.py
+++ b/tools/analysis_tools/coco_error_analysis.py
@@ -204,8 +204,12 @@ def analyze_individual_category(k,
cocoEval.params.iouThrs = [0.1]
cocoEval.params.useCats = 1
if areas:
- cocoEval.params.areaRng = [[0**2, areas[2]], [0**2, areas[0]],
- [areas[0], areas[1]], [areas[1], areas[2]]]
+ cocoEval.params.areaRng = [
+ [0**2, areas[2]],
+ [0**2, areas[0]],
+ [areas[0], areas[1]],
+ [areas[1], areas[2]],
+ ]
cocoEval.evaluate()
cocoEval.accumulate()
ps_supercategory = cocoEval.eval['precision'][0, :, k, :, :]
@@ -223,8 +227,12 @@ def analyze_individual_category(k,
cocoEval.params.iouThrs = [0.1]
cocoEval.params.useCats = 1
if areas:
- cocoEval.params.areaRng = [[0**2, areas[2]], [0**2, areas[0]],
- [areas[0], areas[1]], [areas[1], areas[2]]]
+ cocoEval.params.areaRng = [
+ [0**2, areas[2]],
+ [0**2, areas[0]],
+ [areas[0], areas[1]],
+ [areas[1], areas[2]],
+ ]
cocoEval.evaluate()
cocoEval.accumulate()
ps_allcategory = cocoEval.eval['precision'][0, :, k, :, :]
@@ -237,13 +245,17 @@ def analyze_results(res_file,
res_types,
out_dir,
extraplots=None,
- areas=None):
+ areas=None,
+ score_thr=None):
for res_type in res_types:
assert res_type in ['bbox', 'segm']
if areas:
- assert len(areas) == 3, '3 integers should be specified as areas, \
+ assert (len(areas) == 3), '3 integers should be specified as areas, \
representing 3 area regions'
+ if score_thr:
+ assert score_thr >= 0, 'score_thr should be bigger than 0'
+
directory = os.path.dirname(out_dir + '/')
if not os.path.exists(directory):
print(f'-------------create {out_dir}-----------------')
@@ -252,6 +264,13 @@ def analyze_results(res_file,
cocoGt = COCO(ann_file)
cocoDt = cocoGt.loadRes(res_file)
imgIds = cocoGt.getImgIds()
+
+ if score_thr:
+ cocoDt.dataset['annotations'] = list(
+ filter(lambda ann: ann['score'] >= score_thr,
+ cocoDt.dataset['annotations']))
+ cocoDt.createIndex()
+
for res_type in res_types:
res_out_dir = out_dir + '/' + res_type + '/'
res_directory = os.path.dirname(res_out_dir)
@@ -265,9 +284,12 @@ def analyze_results(res_file,
cocoEval.params.iouThrs = [0.75, 0.5, 0.1]
cocoEval.params.maxDets = [100]
if areas:
- cocoEval.params.areaRng = [[0**2, areas[2]], [0**2, areas[0]],
- [areas[0], areas[1]],
- [areas[1], areas[2]]]
+ cocoEval.params.areaRng = [
+ [0**2, areas[2]],
+ [0**2, areas[0]],
+ [areas[0], areas[1]],
+ [areas[1], areas[2]],
+ ]
cocoEval.evaluate()
cocoEval.accumulate()
ps = cocoEval.eval['precision']
@@ -312,19 +334,28 @@ def main():
parser.add_argument(
'--ann',
default='data/coco/annotations/instances_val2017.json',
- help='annotation file path')
+ help='annotation file path',
+ )
parser.add_argument(
'--types', type=str, nargs='+', default=['bbox'], help='result types')
parser.add_argument(
'--extraplots',
action='store_true',
help='export extra bar/stat plots')
+ parser.add_argument(
+ '--score-thr',
+ type=float,
+ default=None,
+ help='score threshold to filter detection bboxes, only applied'
+ 'when users want to change it.',
+ )
parser.add_argument(
'--areas',
type=int,
nargs='+',
default=[1024, 9216, 10000000000],
- help='area regions')
+ help='area regions',
+ )
args = parser.parse_args()
analyze_results(
args.result,
@@ -332,7 +363,9 @@ def main():
args.types,
out_dir=args.out_dir,
extraplots=args.extraplots,
- areas=args.areas)
+ areas=args.areas,
+ score_thr=args.score_thr,
+ )
if __name__ == '__main__':
diff --git a/tools/dataset_converters/coco2odvg.py b/tools/dataset_converters/coco2odvg.py
new file mode 100644
index 00000000000..aa9bc86d6d2
--- /dev/null
+++ b/tools/dataset_converters/coco2odvg.py
@@ -0,0 +1,345 @@
+import argparse
+import json
+import os.path
+
+import jsonlines
+from pycocotools.coco import COCO
+from tqdm import tqdm
+
+id_map = {
+ 0: 1,
+ 1: 2,
+ 2: 3,
+ 3: 4,
+ 4: 5,
+ 5: 6,
+ 6: 7,
+ 7: 8,
+ 8: 9,
+ 9: 10,
+ 10: 11,
+ 11: 13,
+ 12: 14,
+ 13: 15,
+ 14: 16,
+ 15: 17,
+ 16: 18,
+ 17: 19,
+ 18: 20,
+ 19: 21,
+ 20: 22,
+ 21: 23,
+ 22: 24,
+ 23: 25,
+ 24: 27,
+ 25: 28,
+ 26: 31,
+ 27: 32,
+ 28: 33,
+ 29: 34,
+ 30: 35,
+ 31: 36,
+ 32: 37,
+ 33: 38,
+ 34: 39,
+ 35: 40,
+ 36: 41,
+ 37: 42,
+ 38: 43,
+ 39: 44,
+ 40: 46,
+ 41: 47,
+ 42: 48,
+ 43: 49,
+ 44: 50,
+ 45: 51,
+ 46: 52,
+ 47: 53,
+ 48: 54,
+ 49: 55,
+ 50: 56,
+ 51: 57,
+ 52: 58,
+ 53: 59,
+ 54: 60,
+ 55: 61,
+ 56: 62,
+ 57: 63,
+ 58: 64,
+ 59: 65,
+ 60: 67,
+ 61: 70,
+ 62: 72,
+ 63: 73,
+ 64: 74,
+ 65: 75,
+ 66: 76,
+ 67: 77,
+ 68: 78,
+ 69: 79,
+ 70: 80,
+ 71: 81,
+ 72: 82,
+ 73: 84,
+ 74: 85,
+ 75: 86,
+ 76: 87,
+ 77: 88,
+ 78: 89,
+ 79: 90
+}
+key_list_coco = list(id_map.keys())
+val_list_coco = list(id_map.values())
+key_list_o365 = [i for i in range(365)]
+val_list_o365 = [i for i in range(1, 366)]
+key_list_v3det = [i for i in range(13204)]
+val_list_v3det = [i for i in range(1, 13205)]
+
+
+def dump_coco_label_map(args):
+ ori_map = {
+ '1': 'person',
+ '2': 'bicycle',
+ '3': 'car',
+ '4': 'motorcycle',
+ '5': 'airplane',
+ '6': 'bus',
+ '7': 'train',
+ '8': 'truck',
+ '9': 'boat',
+ '10': 'traffic light',
+ '11': 'fire hydrant',
+ '13': 'stop sign',
+ '14': 'parking meter',
+ '15': 'bench',
+ '16': 'bird',
+ '17': 'cat',
+ '18': 'dog',
+ '19': 'horse',
+ '20': 'sheep',
+ '21': 'cow',
+ '22': 'elephant',
+ '23': 'bear',
+ '24': 'zebra',
+ '25': 'giraffe',
+ '27': 'backpack',
+ '28': 'umbrella',
+ '31': 'handbag',
+ '32': 'tie',
+ '33': 'suitcase',
+ '34': 'frisbee',
+ '35': 'skis',
+ '36': 'snowboard',
+ '37': 'sports ball',
+ '38': 'kite',
+ '39': 'baseball bat',
+ '40': 'baseball glove',
+ '41': 'skateboard',
+ '42': 'surfboard',
+ '43': 'tennis racket',
+ '44': 'bottle',
+ '46': 'wine glass',
+ '47': 'cup',
+ '48': 'fork',
+ '49': 'knife',
+ '50': 'spoon',
+ '51': 'bowl',
+ '52': 'banana',
+ '53': 'apple',
+ '54': 'sandwich',
+ '55': 'orange',
+ '56': 'broccoli',
+ '57': 'carrot',
+ '58': 'hot dog',
+ '59': 'pizza',
+ '60': 'donut',
+ '61': 'cake',
+ '62': 'chair',
+ '63': 'couch',
+ '64': 'potted plant',
+ '65': 'bed',
+ '67': 'dining table',
+ '70': 'toilet',
+ '72': 'tv',
+ '73': 'laptop',
+ '74': 'mouse',
+ '75': 'remote',
+ '76': 'keyboard',
+ '77': 'cell phone',
+ '78': 'microwave',
+ '79': 'oven',
+ '80': 'toaster',
+ '81': 'sink',
+ '82': 'refrigerator',
+ '84': 'book',
+ '85': 'clock',
+ '86': 'vase',
+ '87': 'scissors',
+ '88': 'teddy bear',
+ '89': 'hair drier',
+ '90': 'toothbrush'
+ }
+ new_map = {}
+ for key, value in ori_map.items():
+ label = int(key)
+ ind = val_list_coco.index(label)
+ label_trans = key_list_coco[ind]
+ new_map[label_trans] = value
+ if args.output is None:
+ output = os.path.dirname(args.input) + '/coco2017_label_map.json'
+ else:
+ output = os.path.dirname(args.output) + '/coco2017_label_map.json'
+ with open(output, 'w') as f:
+ json.dump(new_map, f)
+
+
+def dump_o365v1_label_map(args):
+ with open(args.input, 'r') as f:
+ j = json.load(f)
+ o_dict = {}
+ for category in j['categories']:
+ index = str(int(category['id']) - 1)
+ name = category['name']
+ o_dict[index] = name
+ if args.output is None:
+ output = os.path.dirname(args.input) + '/o365v1_label_map.json'
+ else:
+ output = os.path.dirname(args.output) + '/o365v1_label_map.json'
+ with open(output, 'w') as f:
+ json.dump(o_dict, f)
+
+
+def dump_o365v2_label_map(args):
+ with open(args.input, 'r') as f:
+ j = json.load(f)
+ o_dict = {}
+ for category in j['categories']:
+ index = str(int(category['id']) - 1)
+ name = category['name']
+ o_dict[index] = name
+ if args.output is None:
+ output = os.path.dirname(args.input) + '/o365v2_label_map.json'
+ else:
+ output = os.path.dirname(args.output) + '/o365v2_label_map.json'
+ with open(output, 'w') as f:
+ json.dump(o_dict, f)
+
+
+def dump_v3det_label_map(args):
+ with open(args.input, 'r') as f:
+ j = json.load(f)
+ o_dict = {}
+ for category in j['categories']:
+ index = str(int(category['id']) - 1)
+ name = category['name']
+ o_dict[index] = name
+ if args.output is None:
+ output = os.path.dirname(args.input) + '/v3det_2023_v1_label_map.json'
+ else:
+ output = os.path.dirname(args.output) + '/v3det_2023_v1_label_map.json'
+ with open(output, 'w') as f:
+ json.dump(o_dict, f)
+
+
+def coco2odvg(args):
+ coco = COCO(args.input)
+ cats = coco.loadCats(coco.getCatIds())
+ nms = {cat['id']: cat['name'] for cat in cats}
+ metas = []
+ if args.output is None:
+ out_path = args.input[:-5] + '_od.json'
+ else:
+ out_path = args.output
+
+ if args.dataset == 'coco':
+ key_list = key_list_coco
+ val_list = val_list_coco
+ dump_coco_label_map(args)
+ elif args.dataset == 'o365v1':
+ key_list = key_list_o365
+ val_list = val_list_o365
+ dump_o365v1_label_map(args)
+ elif args.dataset == 'o365v2':
+ key_list = key_list_o365
+ val_list = val_list_o365
+ dump_o365v2_label_map(args)
+ elif args.dataset == 'v3det':
+ key_list = key_list_v3det
+ val_list = val_list_v3det
+ dump_v3det_label_map(args)
+
+ for img_id, img_info in tqdm(coco.imgs.items()):
+ # missing images
+ if args.dataset == 'o365v2' and img_id in [908726, 320532, 320534]:
+ print(img_info['file_name'])
+ continue
+ if args.dataset == 'o365v1' and img_id in [6, 19, 23]:
+ print(img_info['file_name'])
+ continue
+
+ if args.dataset == 'o365v2':
+ file_name = img_info['file_name']
+ if file_name.startswith('images/v2/'):
+ file_name = file_name.replace('images/v2/', '')
+ elif file_name.startswith('images/v1/'):
+ file_name = file_name.replace('images/v1/', '')
+ img_info['file_name'] = file_name
+
+ ann_ids = coco.getAnnIds(imgIds=img_id)
+ instance_list = []
+ for ann_id in ann_ids:
+ ann = coco.anns[ann_id]
+
+ if ann.get('ignore', False):
+ continue
+ x1, y1, w, h = ann['bbox']
+ inter_w = max(0, min(x1 + w, img_info['width']) - max(x1, 0))
+ inter_h = max(0, min(y1 + h, img_info['height']) - max(y1, 0))
+ if inter_w * inter_h == 0:
+ continue
+ if ann['area'] <= 0 or w < 1 or h < 1:
+ continue
+
+ if ann.get('iscrowd', False):
+ continue
+
+ bbox_xyxy = [x1, y1, x1 + w, y1 + h]
+ label = ann['category_id']
+ category = nms[label]
+ ind = val_list.index(label)
+ label_trans = key_list[ind]
+ instance_list.append({
+ 'bbox': bbox_xyxy,
+ 'label': label_trans,
+ 'category': category
+ })
+ metas.append({
+ 'filename': img_info['file_name'],
+ 'height': img_info['height'],
+ 'width': img_info['width'],
+ 'detection': {
+ 'instances': instance_list
+ }
+ })
+
+ with jsonlines.open(out_path, mode='w') as writer:
+ writer.write_all(metas)
+
+ print('save to {}'.format(out_path))
+
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser('coco to odvg format.', add_help=True)
+ parser.add_argument('input', type=str, help='input json file name')
+ parser.add_argument(
+ '--output', '-o', type=str, help='output json file name')
+ parser.add_argument(
+ '--dataset',
+ '-d',
+ required=True,
+ type=str,
+ choices=['coco', 'o365v1', 'o365v2', 'v3det'],
+ )
+ args = parser.parse_args()
+
+ coco2odvg(args)
diff --git a/tools/dataset_converters/coco2ovd.py b/tools/dataset_converters/coco2ovd.py
new file mode 100644
index 00000000000..fc70145f9aa
--- /dev/null
+++ b/tools/dataset_converters/coco2ovd.py
@@ -0,0 +1,70 @@
+import argparse
+import json
+import os.path
+
+base_classes = ('person', 'bicycle', 'car', 'motorcycle', 'train', 'truck',
+ 'boat', 'bench', 'bird', 'horse', 'sheep', 'bear', 'zebra',
+ 'giraffe', 'backpack', 'handbag', 'suitcase', 'frisbee',
+ 'skis', 'kite', 'surfboard', 'bottle', 'fork', 'spoon', 'bowl',
+ 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
+ 'pizza', 'donut', 'chair', 'bed', 'toilet', 'tv', 'laptop',
+ 'mouse', 'remote', 'microwave', 'oven', 'toaster',
+ 'refrigerator', 'book', 'clock', 'vase', 'toothbrush')
+
+novel_classes = ('airplane', 'bus', 'cat', 'dog', 'cow', 'elephant',
+ 'umbrella', 'tie', 'snowboard', 'skateboard', 'cup', 'knife',
+ 'cake', 'couch', 'keyboard', 'sink', 'scissors')
+
+
+def filter_annotation(anno_dict, split_name_list, class_id_to_split):
+ filtered_categories = []
+ for item in anno_dict['categories']:
+ if class_id_to_split.get(item['id']) in split_name_list:
+ item['split'] = class_id_to_split.get(item['id'])
+ filtered_categories.append(item)
+ anno_dict['categories'] = filtered_categories
+
+ filtered_images = []
+ filtered_annotations = []
+ useful_image_ids = set()
+ for item in anno_dict['annotations']:
+ if class_id_to_split.get(item['category_id']) in split_name_list:
+ filtered_annotations.append(item)
+ useful_image_ids.add(item['image_id'])
+ for item in anno_dict['images']:
+ if item['id'] in useful_image_ids:
+ filtered_images.append(item)
+ anno_dict['annotations'] = filtered_annotations
+ anno_dict['images'] = filtered_images
+
+
+def coco2ovd(args):
+ ann_path = os.path.join(args.data_root, 'annotations/')
+ with open(ann_path + 'instances_train2017.json', 'r') as fin:
+ coco_train_anno_all = json.load(fin)
+
+ class_id_to_split = {}
+ for item in coco_train_anno_all['categories']:
+ if item['name'] in base_classes:
+ class_id_to_split[item['id']] = 'seen'
+ elif item['name'] in novel_classes:
+ class_id_to_split[item['id']] = 'unseen'
+
+ filter_annotation(coco_train_anno_all, ['seen'], class_id_to_split)
+ with open(ann_path + 'instances_train2017_seen_2.json', 'w') as fout:
+ json.dump(coco_train_anno_all, fout)
+
+ with open(ann_path + 'instances_val2017.json', 'r') as fin:
+ coco_val_anno_all = json.load(fin)
+
+ filter_annotation(coco_val_anno_all, ['seen', 'unseen'], class_id_to_split)
+ with open(ann_path + 'instances_val2017_all_2.json', 'w') as fout:
+ json.dump(coco_val_anno_all, fout)
+
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser('coco to ovd format.', add_help=True)
+ parser.add_argument('data_root', type=str, help='coco root path')
+ args = parser.parse_args()
+
+ coco2ovd(args)
diff --git a/tools/dataset_converters/extract_coco_from_mixed.py b/tools/dataset_converters/extract_coco_from_mixed.py
new file mode 100644
index 00000000000..d4777b0fd07
--- /dev/null
+++ b/tools/dataset_converters/extract_coco_from_mixed.py
@@ -0,0 +1,45 @@
+import argparse
+import os.path as osp
+
+import mmengine
+from pycocotools.coco import COCO
+
+
+def extract_coco(args):
+ coco = COCO(args.mixed_ann)
+
+ json_data = mmengine.load(args.mixed_ann)
+ new_json_data = {
+ 'info': json_data['info'],
+ 'licenses': json_data['licenses'],
+ 'categories': json_data['categories'],
+ 'images': [],
+ 'annotations': []
+ }
+ del json_data
+
+ img_ids = coco.getImgIds()
+ for img_id in img_ids:
+ img_info = coco.loadImgs([img_id])[0]
+ if img_info['data_source'] == 'coco':
+ new_json_data['images'].append(img_info)
+ ann_ids = coco.getAnnIds(imgIds=[img_id])
+ img_ann_info = coco.loadAnns(ann_ids)
+ new_json_data['annotations'].extend(img_ann_info)
+ if args.out_ann is None:
+ out_ann = osp.dirname(
+ args.mixed_ann) + '/final_mixed_train_only_coco.json'
+ mmengine.dump(new_json_data, out_ann)
+ print('save new json to {}'.format(out_ann))
+ else:
+ mmengine.dump(new_json_data, args.out_ann)
+
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser(
+ 'split mixed goldg to coco.', add_help=True)
+ parser.add_argument('mixed_ann', type=str)
+ parser.add_argument('--out-ann', '-o', type=str)
+ args = parser.parse_args()
+
+ extract_coco(args)
diff --git a/tools/dataset_converters/fix_o365_names.py b/tools/dataset_converters/fix_o365_names.py
new file mode 100644
index 00000000000..3bb4a62843c
--- /dev/null
+++ b/tools/dataset_converters/fix_o365_names.py
@@ -0,0 +1,35 @@
+# Reference: https://github.com/shenyunhang/APE/blob/main/datasets/tools/objects3652coco/fix_o365_names.py # noqa
+import argparse
+import copy
+import json
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser()
+ parser.add_argument(
+ '--ann',
+ default='data/objects365v2/annotations/zhiyuan_objv2_train.json')
+ parser.add_argument(
+ '--fix_name_map',
+ default='tools/dataset_converters/zhiyuan_objv2_train_names_fix.csv')
+ args = parser.parse_args()
+
+ new_names = {}
+ old_names = {}
+ with open(args.fix_name_map, 'r') as f:
+ for line in f:
+ tmp = line.strip().split(',')
+ old_names[int(tmp[0])] = tmp[1]
+ new_names[int(tmp[0])] = tmp[2]
+ data = json.load(open(args.ann, 'r'))
+
+ cat_info = copy.deepcopy(data['categories'])
+
+ for x in cat_info:
+ if old_names[x['id']] != new_names[x['id']]:
+ print('Renaming', x['id'], x['name'], new_names[x['id']])
+ x['name'] = new_names[x['id']]
+
+ data['categories'] = cat_info
+ out_name = args.ann[:-5] + '_fixname.json'
+ print('Saving to', out_name)
+ json.dump(data, open(out_name, 'w'))
diff --git a/tools/dataset_converters/goldg2odvg.py b/tools/dataset_converters/goldg2odvg.py
new file mode 100644
index 00000000000..5267553da01
--- /dev/null
+++ b/tools/dataset_converters/goldg2odvg.py
@@ -0,0 +1,136 @@
+import argparse
+
+import jsonlines
+from pycocotools.coco import COCO
+from tqdm import tqdm
+
+
+def _has_only_empty_bbox(anno):
+ return all(any(o <= 1 for o in obj['bbox'][2:]) for obj in anno)
+
+
+def has_valid_annotation(anno):
+ # if it's empty, there is no annotation
+ if len(anno) == 0:
+ return False
+ # if all boxes have close to zero area, there is no annotation
+ if _has_only_empty_bbox(anno):
+ return False
+ return True
+
+
+def goldg2odvg(args):
+ coco = COCO(args.input)
+ ids = list(sorted(coco.imgs.keys()))
+
+ out_results = []
+ for img_id in tqdm(ids):
+ if isinstance(img_id, str):
+ ann_ids = coco.getAnnIds(imgIds=[img_id], iscrowd=0)
+ else:
+ ann_ids = coco.getAnnIds(imgIds=img_id, iscrowd=0)
+ annos = coco.loadAnns(ann_ids)
+ if not has_valid_annotation(annos):
+ continue
+
+ img_info = coco.loadImgs(img_id)[0]
+ file_name = img_info['file_name']
+ caption = img_info['caption']
+
+ regions = {}
+
+ for anno in annos:
+ box = anno['bbox']
+ tokens_positive = anno['tokens_positive']
+ x1, y1, w, h = box
+ inter_w = max(0, min(x1 + w, int(img_info['width'])) - max(x1, 0))
+ inter_h = max(0, min(y1 + h, int(img_info['height'])) - max(y1, 0))
+ if inter_w * inter_h == 0:
+ continue
+ if anno['area'] <= 0 or w < 1 or h < 1:
+ continue
+
+ if anno.get('iscrowd', False):
+ continue
+ bbox_xyxy = [
+ x1, y1,
+ min(x1 + w, int(img_info['width'])),
+ min(y1 + h, int(img_info['height']))
+ ]
+
+ tokens_positive = sorted(tokens_positive, key=lambda x: x[0])
+
+ phrase = []
+ pre_end_index = -10
+ for token in tokens_positive:
+ start_index = token[0]
+ end_index = token[1]
+ if pre_end_index + 1 == start_index:
+ if caption[token[0] - 1] == ' ':
+ phrase[
+ -1] = phrase[-1] + ' ' + caption[token[0]:token[1]]
+ else:
+ phrase.append(caption[token[0]:token[1]])
+ else:
+ phrase.append(caption[token[0]:token[1]])
+ pre_end_index = end_index
+
+ key = ' '.join(phrase)
+
+ if key not in regions:
+ regions[key] = {
+ 'bbox': bbox_xyxy,
+ 'phrase': phrase,
+ 'tokens_positive': tokens_positive
+ }
+ else:
+ old_box = regions[key]['bbox']
+ if isinstance(old_box[0], list):
+ old_box.append(bbox_xyxy)
+ else:
+ old_box = [old_box, bbox_xyxy]
+
+ regions[key]['bbox'] = old_box
+
+ out_dict = {
+ 'filename': file_name,
+ 'height': int(img_info['height']),
+ 'width': int(img_info['width']),
+ 'grounding': {
+ 'caption': caption
+ }
+ }
+
+ region_list = []
+ for key, value in regions.items():
+ phrase = value['phrase']
+ if len(phrase) == 1:
+ phrase = phrase[0]
+ region_list.append({
+ 'bbox': value['bbox'],
+ 'phrase': phrase,
+ 'tokens_positive': value['tokens_positive']
+ })
+ out_dict['grounding']['regions'] = region_list
+ out_results.append(out_dict)
+
+ if args.out_ann is None:
+ out_path = args.input[:-5] + '_vg.json'
+ else:
+ out_path = args.out_ann
+
+ with jsonlines.open(out_path, mode='w') as writer:
+ writer.write_all(out_results)
+ print(f'save to {out_path}')
+
+
+# goldg+: final_mixed_train_no_coco.json +
+# final_flickr_separateGT_train.json +
+# final_mixed_train_only_coco.json
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser('goldg to odvg format.', add_help=True)
+ parser.add_argument('input', type=str, help='input json file name')
+ parser.add_argument('--out-ann', '-o', type=str)
+ args = parser.parse_args()
+
+ goldg2odvg(args)
diff --git a/tools/dataset_converters/grit2odvg.py b/tools/dataset_converters/grit2odvg.py
new file mode 100644
index 00000000000..3d1c6d1a5e7
--- /dev/null
+++ b/tools/dataset_converters/grit2odvg.py
@@ -0,0 +1,189 @@
+import argparse
+import json
+import multiprocessing
+import os
+import os.path as osp
+
+import emoji
+import jsonlines
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
+is_debug = False
+
+
+def is_valid_caption(caption, rules={'↙️', '[CLS]', '[SEP]'}):
+ check_anno = caption.strip(
+ )[:-1] # Remove the ending delimiter from the caption.
+ for ch in rules:
+ if ch in check_anno:
+ return False
+ return True
+
+
+def process_one_file(anno_file, result_queue):
+ print('processing', anno_file)
+ with open(anno_file, 'r') as f:
+ metas = json.load(f)
+
+ results = []
+ for meta in metas:
+ # print('============================')
+ file_name = meta['key'][0:5] + '/' + meta['key'] + '.jpg'
+ file_name = osp.join('images', file_name)
+
+ h = meta['height']
+ w = meta['width']
+
+ caption = meta['caption']
+ # Weird captions are filtered out from the beginning.
+ if not is_valid_caption(caption):
+ if is_debug:
+ print('=====caption filtered====', caption)
+ continue
+
+ # Captions exceeding 240 tokens are filtered out,
+ # where 240 is an empirical value.
+ tokenized = tokenizer([caption], return_tensors='pt')
+ if tokenized.input_ids.shape[1] >= 240:
+ if is_debug:
+ print('=====token filtered====', caption)
+ continue
+
+ ref_exps = meta['ref_exps']
+ ref_captions = [i[0:2] for i in ref_exps]
+ ref_token_positives = [i[0:2] for i in ref_exps]
+ ref_captions = [caption[int(i[0]):int(i[1])] for i in ref_captions]
+ ref_boxes = [i[2:6] for i in ref_exps]
+
+ regions = {}
+ for bbox, ref_caption, tokens_positive in zip(ref_boxes, ref_captions,
+ ref_token_positives):
+ # If the current reference includes special delimiters,
+ # it will be filtered out.
+ if not is_valid_caption(
+ caption, rules={'.', '?', ' ', "\'", "\""}):
+ if is_debug:
+ print('=====ref filtered====', caption)
+ continue
+ # If the current reference contains non-ASCII characters,
+ # it will be filtered out.
+ if not str.isascii(caption):
+ if is_debug:
+ print('=====ref filtered====', caption)
+ continue
+ # If the current reference includes non-ASCII characters,
+ # it will be filtered out.
+ if emoji.emoji_count(caption):
+ if is_debug:
+ print('=====ref filtered====', caption)
+ continue
+
+ box = [
+ round(bbox[0] * w, 3),
+ round(bbox[1] * h, 3),
+ round((bbox[2]) * w, 3),
+ round((bbox[3]) * h, 3)
+ ]
+ x1, y1, x2, y2 = box
+ inter_w = max(0, min(x1 + w, int(w)) - max(x1, 0))
+ inter_h = max(0, min(y1 + h, int(h)) - max(y1, 0))
+ if inter_w * inter_h == 0:
+ if is_debug:
+ print('=====wh filtered====', box)
+ continue
+ if w <= 1 or h <= 1:
+ if is_debug:
+ print('=====area filtered====', box)
+ continue
+
+ if ref_caption not in regions:
+ regions[ref_caption] = {
+ 'bbox':
+ box,
+ 'phrase':
+ ref_caption,
+ 'tokens_positive':
+ [[int(tokens_positive[0]),
+ int(tokens_positive[1])]],
+ }
+ else:
+ old_box = regions[ref_caption]['bbox']
+ if isinstance(old_box[0], list):
+ old_box.append(box)
+ else:
+ old_box = [old_box, box]
+ regions[ref_caption]['bbox'] = old_box
+
+ if len(regions) > 0:
+ print('caption: ', caption)
+ print('regions', regions)
+ else:
+ if is_debug:
+ print('caption: ', caption)
+ print('regions', regions)
+
+ if len(regions) == 0:
+ continue
+
+ out_dict = {
+ 'filename': file_name,
+ 'height': int(h),
+ 'width': int(w),
+ 'grounding': {
+ 'caption': caption
+ }
+ }
+
+ region_list = []
+ for key, value in regions.items():
+ phrase = value['phrase']
+ if len(phrase) == 1:
+ phrase = phrase[0]
+ region_list.append({
+ 'bbox': value['bbox'],
+ 'phrase': phrase,
+ 'tokens_positive': value['tokens_positive']
+ })
+ out_dict['grounding']['regions'] = region_list
+ print(out_dict)
+ results.append(out_dict)
+ result_queue.put(results)
+
+
+def grit2odvg(args):
+ annotations_dir = osp.join(args.data_root, 'annotations')
+ annos_files = [
+ osp.join(annotations_dir, anno) for anno in os.listdir(annotations_dir)
+ if anno.endswith('.json') and not anno.endswith('vg.json')
+ ]
+
+ annos_files = annos_files[:2]
+
+ manager = multiprocessing.Manager()
+ result_queue = manager.Queue()
+ pool = multiprocessing.Pool(processes=min(len(annos_files), 16))
+
+ for anno_file in annos_files:
+ pool.apply_async(process_one_file, args=(anno_file, result_queue))
+
+ pool.close()
+ pool.join()
+
+ out_datas = []
+ while not result_queue.empty():
+ out_datas.extend(result_queue.get())
+
+ out_path = osp.join(args.data_root, 'grit20m_vg.json')
+ with jsonlines.open(out_path, mode='w') as writer:
+ writer.write_all(out_datas)
+ print('save to ', out_path)
+ print('total img: ', len(out_datas))
+
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser('grit to odvg format.', add_help=True)
+ parser.add_argument('data_root', type=str, help='input dir name')
+ args = parser.parse_args()
+
+ grit2odvg(args)
diff --git a/tools/dataset_converters/grit_processing.py b/tools/dataset_converters/grit_processing.py
new file mode 100644
index 00000000000..ebf3791a80e
--- /dev/null
+++ b/tools/dataset_converters/grit_processing.py
@@ -0,0 +1,121 @@
+import argparse
+import json
+import logging
+import os
+import tarfile
+from functools import partial
+from multiprocessing import Pool
+
+
+def create_logger(output_file):
+ logger = logging.getLogger('grit_logger')
+ logger.setLevel(logging.INFO) # set logger output level
+ formatter = logging.Formatter('%(asctime)s - %(message)s')
+
+ fh = logging.FileHandler(output_file)
+ fh.setLevel(logging.INFO)
+ fh.setFormatter(formatter)
+
+ console = logging.StreamHandler()
+ console.setLevel(logging.INFO)
+
+ logger.addHandler(fh)
+ logger.addHandler(console)
+
+ return logger
+
+
+def count_download_image(download_json_dir, logger):
+ parquet_files = [
+ f for f in os.listdir(download_json_dir) if f.endswith('.json')
+ ]
+ len = 0
+
+ for file in parquet_files:
+ with open(os.path.join(download_json_dir, file), 'r') as f:
+ data = json.load(f)
+ len = len + int(data['successes'])
+ logger.info(file + 'has ' + str(data['successes']) +
+ ' successful images')
+
+ logger.info('all files finished.', str(len),
+ 'images have been successfully downloaded.')
+
+
+def tar_processing(tar_path, output_dir, logger):
+ filepath = untar(tar_path, logger)
+ json_files = [f for f in os.listdir(filepath) if f.endswith('.json')]
+ all_data = []
+ cnt = 0
+
+ for file in json_files:
+ with open(os.path.join(filepath, file), 'r') as f:
+ df = json.load(f)
+ cnt = cnt + 1
+ all_data.extend([df])
+ dir_name = os.path.basename(filepath)
+ # write all data to a json file
+ logger.info(f'{dir_name} has {cnt} jsons')
+ json_name = os.path.basename(filepath) + '.json'
+ if not os.path.exists(os.path.join(output_dir, 'annotations')):
+ os.mkdir(os.path.join(output_dir, 'annotations'))
+ with open(os.path.join(output_dir, 'annotations', json_name), 'w') as f:
+ json.dump(all_data, f)
+ logger.info(f'{dir_name} completed')
+ cp_rm(filepath, output_dir)
+ return os.path.basename(filepath)
+
+
+def untar(filepath, logger):
+ if tarfile.is_tarfile(filepath):
+ new_folder = os.path.splitext(filepath)[0]
+ tar_name = os.path.basename(filepath)
+ with tarfile.open(filepath) as tar:
+ members = tar.getmembers()
+ if not os.path.exists(new_folder):
+ os.mkdir(new_folder)
+ else:
+ f = os.listdir(new_folder)
+ if len(members) == len(f):
+ logger.info(f'{tar_name} already decompressed')
+ return new_folder
+ logger.info(f'{tar_name} decompressing...')
+ os.system(f'tar -xf {filepath} -C {new_folder}')
+ logger.info(f'{tar_name} decompressed!')
+ return new_folder
+
+
+def cp_rm(filepath, output_dir):
+ # delete txt/json
+ for file in os.listdir(filepath):
+ if file.endswith('.txt') or file.endswith('.json'):
+ os.remove(os.path.join(filepath, file))
+ # move images to output dir
+ target_dir = os.path.join(output_dir, 'images')
+ if not os.path.exists(os.path.join(output_dir, 'images')):
+ os.mkdir(os.path.join(output_dir, 'images'))
+ os.system('mv -f {} {}'.format(filepath, target_dir))
+
+
+def main(args):
+ logger = create_logger(args.log_name)
+ all_file_name = [
+ os.path.join(args.image_dir, file)
+ for file in os.listdir(args.image_dir) if file.endswith('.tar')
+ ]
+ all_file_name.sort()
+ func = partial(tar_processing, output_dir=args.output_dir, logger=logger)
+ with Pool(processes=args.num_process) as pool:
+ result = pool.imap(func=func, iterable=all_file_name) # noqa
+ # print(result)
+
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser()
+ parser.add_argument('image_dir', type=str) # grit raw directory
+ parser.add_argument('output_dir', type=str)
+ parser.add_argument('--num-process', default=10)
+ parser.add_argument('--log-name', type=str, default='grit_processing.log')
+ args = parser.parse_args()
+
+ main(args)
diff --git a/tools/dataset_converters/lvis2odvg.py b/tools/dataset_converters/lvis2odvg.py
new file mode 100644
index 00000000000..ce0c4381b35
--- /dev/null
+++ b/tools/dataset_converters/lvis2odvg.py
@@ -0,0 +1,98 @@
+import argparse
+import json
+import os.path
+
+import jsonlines
+from lvis import LVIS
+from tqdm import tqdm
+
+key_list_lvis = [i for i in range(1203)]
+val_list_lvis = [i for i in range(1, 1204)]
+
+
+def dump_lvis_label_map(args):
+ with open(args.input, 'r') as f:
+ j = json.load(f)
+ o_dict = {}
+ for category in j['categories']:
+ index = str(int(category['id']) - 1)
+ name = category['name']
+ o_dict[index] = name
+ if args.output is None:
+ output = os.path.dirname(args.input) + '/lvis_v1_label_map.json'
+ else:
+ output = os.path.dirname(args.output) + '/lvis_v1_label_map.json'
+ with open(output, 'w') as f:
+ json.dump(o_dict, f)
+
+
+def lvis2odvg(args):
+ lvis = LVIS(args.input)
+ cats = lvis.load_cats(lvis.get_cat_ids())
+ nms = {cat['id']: cat['name'] for cat in cats}
+ metas = []
+ if args.output is None:
+ out_path = args.input[:-5] + '_od.json'
+ else:
+ out_path = args.output
+
+ key_list = key_list_lvis
+ val_list = val_list_lvis
+ dump_lvis_label_map(args)
+
+ for img_id, img_info in tqdm(lvis.imgs.items()):
+ file_name = img_info['coco_url'].replace(
+ 'http://images.cocodataset.org/', '')
+ ann_ids = lvis.get_ann_ids(img_ids=[img_id])
+ raw_ann_info = lvis.load_anns(ann_ids)
+ instance_list = []
+ for ann in raw_ann_info:
+ if ann.get('ignore', False):
+ print(f'invalid ignore box of {ann}')
+ continue
+ x1, y1, w, h = ann['bbox']
+ inter_w = max(0, min(x1 + w, img_info['width']) - max(x1, 0))
+ inter_h = max(0, min(y1 + h, img_info['height']) - max(y1, 0))
+ if inter_w * inter_h == 0:
+ print(f'invalid wh box of {ann}')
+ continue
+ if ann['area'] <= 0 or w < 1 or h < 1:
+ print(f'invalid area box of {ann}, '
+ f'w={img_info["width"]}, h={img_info["height"]}')
+ continue
+
+ if ann.get('iscrowd', False):
+ print(f'invalid iscrowd box of {ann}')
+ continue
+
+ bbox_xyxy = [x1, y1, x1 + w, y1 + h]
+ label = ann['category_id']
+ category = nms[label]
+ ind = val_list.index(label)
+ label_trans = key_list[ind]
+ instance_list.append({
+ 'bbox': bbox_xyxy,
+ 'label': label_trans,
+ 'category': category
+ })
+ metas.append({
+ 'filename': file_name,
+ 'height': img_info['height'],
+ 'width': img_info['width'],
+ 'detection': {
+ 'instances': instance_list
+ }
+ })
+
+ with jsonlines.open(out_path, mode='w') as writer:
+ writer.write_all(metas)
+
+ print('save to {}'.format(out_path))
+
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser('lvis to odvg format.', add_help=True)
+ parser.add_argument('input', type=str, help='input list name')
+ parser.add_argument('--output', '-o', type=str, help='input list name')
+ args = parser.parse_args()
+ lvis2odvg(args)
diff --git a/tools/dataset_converters/lvis2ovd.py b/tools/dataset_converters/lvis2ovd.py
new file mode 100644
index 00000000000..3405bf3ad4f
--- /dev/null
+++ b/tools/dataset_converters/lvis2ovd.py
@@ -0,0 +1,41 @@
+import argparse
+import json
+import os.path
+
+import jsonlines
+
+
+def lvis2ovd(args):
+ ann_path = os.path.join(args.data_root, 'annotations/')
+
+ lvis = json.load(open(ann_path + 'lvis_v1_val.json'))
+ base_class_ids = [
+ cat['id'] - 1 for cat in lvis['categories'] if cat['frequency'] != 'r'
+ ]
+
+ with open(ann_path + 'lvis_v1_train_od.json') as f:
+ data = [json.loads(d) for d in f]
+ for i in range(len(data)):
+ instance = [
+ inst for inst in data[i]['detection']['instances']
+ if inst['label'] in base_class_ids
+ ]
+ data[i]['detection']['instances'] = instance
+ with jsonlines.open(
+ ann_path + 'lvis_v1_train_od_norare.json', mode='w') as writer:
+ writer.write_all(data)
+
+ label_map = json.load(open(ann_path + 'lvis_v1_label_map.json'))
+ label_map = {
+ k: v
+ for k, v in label_map.items() if int(k) in base_class_ids
+ }
+ json.dump(label_map, open(ann_path + 'lvis_v1_label_map_norare.json', 'w'))
+
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser('lvis to ovd format.', add_help=True)
+ parser.add_argument('data_root', type=str, help='coco root path')
+ args = parser.parse_args()
+
+ lvis2ovd(args)
diff --git a/tools/dataset_converters/openimages2odvg.py b/tools/dataset_converters/openimages2odvg.py
new file mode 100644
index 00000000000..d700a4146a3
--- /dev/null
+++ b/tools/dataset_converters/openimages2odvg.py
@@ -0,0 +1,187 @@
+import argparse
+import copy
+import csv
+import json
+import os.path as osp
+
+import jsonlines
+from mmcv.image import imfrombytes
+from mmengine.fileio import get
+
+
+def _parse_label_file(label_file):
+ index_list = []
+ classes_names = []
+ with open(label_file, 'r') as f:
+ reader = csv.reader(f)
+ for line in reader:
+ classes_names.append(line[1])
+ index_list.append(line[0])
+ index_mapping = {index: i for i, index in enumerate(index_list)}
+ return classes_names, index_mapping
+
+
+# backend_args = dict(
+# backend='petrel',
+# path_mapping=dict({
+# './data/': 's3://openmmlab/datasets/detection/',
+# 'data/': 's3://openmmlab/datasets/detection/'
+# }))
+backend_args = None
+
+
+def oi2odvg(args):
+ ann_file = osp.join(args.input_dir, 'oidv6-train-annotations-bbox.csv')
+ label_file = osp.join(args.input_dir, 'class-descriptions-boxable.csv')
+
+ classes_names, index_mapping = _parse_label_file(label_file)
+
+ label_map = {}
+ for class_name, idx in index_mapping.items():
+ class_name = classes_names[idx]
+ label_map[str(idx)] = class_name
+
+ if args.out_ann is None:
+ output = osp.join(args.input_dir, 'openimages_label_map.json')
+ else:
+ output = osp.join(
+ osp.dirname(args.out_ann), 'openimages_label_map.json')
+ with open(output, 'w') as f:
+ json.dump(label_map, f)
+
+ metas = []
+ skip_count = 0
+ with open(ann_file, 'r') as f:
+ reader = csv.reader(f)
+ last_img_id = None
+ _filename_shape = [0, 0]
+ instances = []
+ for i, line in enumerate(reader):
+ if i == 0:
+ continue
+ img_id = line[0]
+ if last_img_id is None:
+ last_img_id = img_id
+ label_id = line[2]
+
+ filename = f'{img_id}.jpg'
+ label = index_mapping[label_id]
+ category = label_map[str(label)]
+ bbox = [
+ float(line[4]), # xmin
+ float(line[6]), # ymin
+ float(line[5]), # xmax
+ float(line[7]) # ymax
+ ]
+
+ # is_occluded = True if int(line[8]) == 1 else False
+ # is_truncated = True if int(line[9]) == 1 else False
+ is_group_of = True if int(line[10]) == 1 else False
+ # is_depiction = True if int(line[11]) == 1 else False
+ # is_inside = True if int(line[12]) == 1 else False
+
+ # if any([is_occluded, is_truncated, is_group_of,
+ # is_depiction, is_inside]):
+ if is_group_of:
+ print(f'skip {filename} of one instance')
+ skip_count += 1
+ continue
+
+ # denormalize
+ if filename != _filename_shape[0]:
+ if args.img_prefix is not None:
+ _filename = osp.join(
+ osp.dirname(args.input_dir), args.img_prefix, filename)
+ else:
+ _filename = osp.join(osp.dirname(args.input_dir), filename)
+ img_bytes = get(_filename, backend_args)
+ img = imfrombytes(img_bytes, flag='color')
+ shape = img.shape
+ _filename_shape = [filename, shape]
+ else:
+ shape = _filename_shape[1]
+
+ h, w = shape[:2]
+ bbox = [
+ max(bbox[0] * w, 0),
+ max(bbox[1] * h, 0),
+ min(bbox[2] * w, w),
+ min(bbox[3] * h, h)
+ ]
+
+ x1, y1, x2, y2 = bbox
+ inter_w = max(0, min(x2, w) - max(x1, 0))
+ inter_h = max(0, min(y2, h) - max(y1, 0))
+ if inter_w * inter_h == 0:
+ continue
+ if w < 1 or h < 1:
+ continue
+
+ instance = {
+ 'filename': filename,
+ 'height': h,
+ 'width': w,
+ 'bbox': bbox,
+ 'label': label,
+ 'category': category
+ }
+
+ if img_id != last_img_id:
+ copy_instances = copy.deepcopy(instances)
+ for copy_instance in copy_instances:
+ _filename = copy_instance.pop('filename')
+ _h = copy_instance.pop('height')
+ _w = copy_instance.pop('width')
+
+ meta_ifo = {
+ 'filename': _filename,
+ 'height': _h,
+ 'width': _w,
+ 'detection': {
+ 'instances': copy_instances
+ }
+ }
+ metas.append(meta_ifo)
+ instances = []
+ instances.append(instance)
+ last_img_id = img_id
+
+ for instance in instances:
+ _filename = instance.pop('filename')
+ _h = instance.pop('height')
+ _w = instance.pop('width')
+ meta_ifo = {
+ 'filename': _filename,
+ 'height': _h,
+ 'width': _w,
+ 'detection': {
+ 'instances': instances
+ }
+ }
+ metas.append(meta_ifo)
+
+ if args.out_ann is None:
+ out_path = osp.join(args.input_dir, 'oidv6-train-annotations_od.json')
+ else:
+ out_path = args.out_ann
+
+ with jsonlines.open(out_path, mode='w') as writer:
+ writer.write_all(metas)
+
+ print('skip {} instances'.format(skip_count))
+ print('save to {}'.format(out_path))
+
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser(
+ 'openimages to odvg format.', add_help=True)
+ parser.add_argument(
+ '--input-dir',
+ default='data/OpenImages/annotations',
+ type=str,
+ help='input list name')
+ parser.add_argument('--img-prefix', default='OpenImages/train/')
+ parser.add_argument('--out-ann', '-o', type=str)
+ args = parser.parse_args()
+
+ oi2odvg(args)
diff --git a/tools/dataset_converters/refcoco2odvg.py b/tools/dataset_converters/refcoco2odvg.py
new file mode 100644
index 00000000000..c11869b3855
--- /dev/null
+++ b/tools/dataset_converters/refcoco2odvg.py
@@ -0,0 +1,147 @@
+import argparse
+import os.path as osp
+
+import jsonlines
+from pycocotools.coco import COCO
+from tqdm import tqdm
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description='refcoco to odvg')
+ parser.add_argument('mdetr_anno_dir', type=str)
+ parser.add_argument('--out-dir', '-o', type=str)
+ args = parser.parse_args()
+ return args
+
+
+def _has_only_empty_bbox(anno):
+ return all(any(o <= 1 for o in obj['bbox'][2:]) for obj in anno)
+
+
+def has_valid_annotation(anno):
+ # if it's empty, there is no annotation
+ if len(anno) == 0:
+ return False
+ # if all boxes have close to zero area, there is no annotation
+ if _has_only_empty_bbox(anno):
+ return False
+ return True
+
+
+def process_item(args, filename):
+ path = osp.join(args.mdetr_anno_dir, filename)
+ coco = COCO(path)
+
+ ids = list(sorted(coco.imgs.keys()))
+
+ out_results = []
+ for img_id in tqdm(ids):
+ if isinstance(img_id, str):
+ ann_ids = coco.getAnnIds(imgIds=[img_id], iscrowd=0)
+ else:
+ ann_ids = coco.getAnnIds(imgIds=img_id, iscrowd=0)
+ annos = coco.loadAnns(ann_ids)
+ if not has_valid_annotation(annos):
+ continue
+
+ img_info = coco.loadImgs(img_id)[0]
+ file_name = img_info['file_name']
+ caption = img_info['caption']
+
+ regions = {}
+
+ for anno in annos:
+ box = anno['bbox']
+ tokens_positive = anno['tokens_positive']
+ x1, y1, w, h = box
+ inter_w = max(0, min(x1 + w, int(img_info['width'])) - max(x1, 0))
+ inter_h = max(0, min(y1 + h, int(img_info['height'])) - max(y1, 0))
+ if inter_w * inter_h == 0:
+ continue
+ if anno['area'] <= 0 or w < 1 or h < 1:
+ continue
+
+ if anno.get('iscrowd', False):
+ continue
+ bbox_xyxy = [
+ x1, y1,
+ min(x1 + w, int(img_info['width'])),
+ min(y1 + h, int(img_info['height']))
+ ]
+
+ tokens_positive = sorted(tokens_positive, key=lambda x: x[0])
+
+ phrase = []
+ pre_end_index = -10
+ for token in tokens_positive:
+ start_index = token[0]
+ end_index = token[1]
+ if pre_end_index + 1 == start_index:
+ if caption[token[0] - 1] == ' ':
+ phrase[
+ -1] = phrase[-1] + ' ' + caption[token[0]:token[1]]
+ else:
+ phrase.append(caption[token[0]:token[1]])
+ else:
+ phrase.append(caption[token[0]:token[1]])
+ pre_end_index = end_index
+
+ key = ' '.join(phrase)
+
+ if key not in regions:
+ regions[key] = {
+ 'bbox': bbox_xyxy,
+ 'phrase': phrase,
+ 'tokens_positive': tokens_positive
+ }
+ else:
+ old_box = regions[key]['bbox']
+ if isinstance(old_box[0], list):
+ old_box.append(bbox_xyxy)
+ else:
+ old_box = [old_box, bbox_xyxy]
+
+ regions[key]['bbox'] = old_box
+
+ out_dict = {
+ 'filename': file_name,
+ 'height': int(img_info['height']),
+ 'width': int(img_info['width']),
+ 'grounding': {
+ 'caption': caption
+ }
+ }
+
+ region_list = []
+ for key, value in regions.items():
+ phrase = value['phrase']
+ if len(phrase) == 1:
+ phrase = phrase[0]
+ region_list.append({
+ 'bbox': value['bbox'],
+ 'phrase': phrase,
+ 'tokens_positive': value['tokens_positive']
+ })
+ out_dict['grounding']['regions'] = region_list
+ out_results.append(out_dict)
+
+ if args.out_dir is None:
+ out_path = osp.join(args.mdetr_anno_dir, filename[:-5] + '_vg.json')
+ else:
+ out_path = osp.join(args.out_dir, filename[:-5] + '_vg.json')
+
+ with jsonlines.open(out_path, mode='w') as writer:
+ writer.write_all(out_results)
+ print(f'save to {out_path}')
+
+
+def main():
+ args = parse_args()
+ process_item(args, 'finetune_refcoco_train.json')
+ process_item(args, 'finetune_refcoco+_train.json')
+ process_item(args, 'finetune_refcocog_train.json')
+ process_item(args, 'finetune_grefcoco_train.json')
+
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/dataset_converters/remove_cocotrain2017_from_refcoco.py b/tools/dataset_converters/remove_cocotrain2017_from_refcoco.py
new file mode 100644
index 00000000000..7de2a9ec4e2
--- /dev/null
+++ b/tools/dataset_converters/remove_cocotrain2017_from_refcoco.py
@@ -0,0 +1,110 @@
+import argparse
+import json
+import os.path as osp
+
+import mmengine
+from pycocotools.coco import COCO
+
+
+def diff_image_id(coco2017_train_ids, ref_ids):
+ set1 = set(coco2017_train_ids)
+ set2 = set(ref_ids)
+ intersection = set1.intersection(set2)
+ result = set1 - intersection
+ return result
+
+
+def gen_new_json(coco2017_train_path, json_data, coco2017_train_ids):
+ coco = COCO(coco2017_train_path)
+ new_json_data = {
+ 'info': json_data['info'],
+ 'licenses': json_data['licenses'],
+ 'categories': json_data['categories'],
+ 'images': [],
+ 'annotations': []
+ }
+
+ for id in coco2017_train_ids:
+ ann_ids = coco.getAnnIds(imgIds=[id])
+ img_ann_info = coco.loadAnns(ann_ids)
+ img_info = coco.loadImgs([id])[0]
+
+ new_json_data['images'].append(img_info)
+ new_json_data['annotations'].extend(img_ann_info)
+ return new_json_data
+
+
+# coco2017 val and final_mixed_train.json have no intersection,
+# so deduplication is not necessary.
+
+# coco2017 val and datasets like refcoco based on coco2014 train
+# have no intersection, so deduplication is not necessary.
+
+
+# coco2017 train and datasets like refcoco based on coco2014
+# train have overlapping annotations in the validation set,
+# so deduplication is required.
+def exclude_coco(args):
+ with open(args.coco2017_train, 'r') as f:
+ coco2017_train = json.load(f)
+ coco2017_train_ids = [train['id'] for train in coco2017_train['images']]
+ orig_len = len(coco2017_train_ids)
+
+ with open(osp.join(args.mdetr_anno_dir, 'finetune_refcoco_val.json'),
+ 'r') as f:
+ refcoco_ann = json.load(f)
+ refcoco_ids = [refcoco['original_id'] for refcoco in refcoco_ann['images']]
+ coco2017_train_ids = diff_image_id(coco2017_train_ids, refcoco_ids)
+
+ with open(
+ osp.join(args.mdetr_anno_dir, 'finetune_refcoco+_val.json'),
+ 'r') as f:
+ refcoco_plus_ann = json.load(f)
+ refcoco_plus_ids = [
+ refcoco['original_id'] for refcoco in refcoco_plus_ann['images']
+ ]
+ coco2017_train_ids = diff_image_id(coco2017_train_ids, refcoco_plus_ids)
+
+ with open(
+ osp.join(args.mdetr_anno_dir, 'finetune_refcocog_val.json'),
+ 'r') as f:
+ refcocog_ann = json.load(f)
+ refcocog_ids = [
+ refcoco['original_id'] for refcoco in refcocog_ann['images']
+ ]
+ coco2017_train_ids = diff_image_id(coco2017_train_ids, refcocog_ids)
+
+ with open(
+ osp.join(args.mdetr_anno_dir, 'finetune_grefcoco_val.json'),
+ 'r') as f:
+ grefcoco_ann = json.load(f)
+ grefcoco_ids = [
+ refcoco['original_id'] for refcoco in grefcoco_ann['images']
+ ]
+ coco2017_train_ids = diff_image_id(coco2017_train_ids, grefcoco_ids)
+
+ coco2017_train_ids = list(coco2017_train_ids)
+ print(
+ 'remove {} images from coco2017_train'.format(orig_len -
+ len(coco2017_train_ids)))
+
+ new_json_data = gen_new_json(args.coco2017_train, coco2017_train,
+ coco2017_train_ids)
+ if args.out_ann is None:
+ out_ann = osp.dirname(
+ args.coco2017_train) + '/instances_train2017_norefval.json'
+ mmengine.dump(new_json_data, out_ann)
+ print('save new json to {}'.format(out_ann))
+ else:
+ mmengine.dump(new_json_data, args.out_ann)
+ print('save new json to {}'.format(args.out_ann))
+
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser('coco to odvg format.', add_help=True)
+ parser.add_argument('mdetr_anno_dir', type=str)
+ parser.add_argument('coco2017_train', type=str)
+ parser.add_argument('--out-ann', '-o', type=str)
+ args = parser.parse_args()
+
+ exclude_coco(args)
diff --git a/tools/dataset_converters/zhiyuan_objv2_train_names_fix.csv b/tools/dataset_converters/zhiyuan_objv2_train_names_fix.csv
new file mode 100644
index 00000000000..33b0aa946c6
--- /dev/null
+++ b/tools/dataset_converters/zhiyuan_objv2_train_names_fix.csv
@@ -0,0 +1,365 @@
+1,Person,Person
+2,Sneakers,Sneakers
+3,Chair,Chair
+4,Other Shoes,Other Shoes
+5,Hat,Hat
+6,Car,Car
+7,Lamp,Lamp
+8,Glasses,Glasses
+9,Bottle,Bottle
+10,Desk,Desk
+11,Cup,Cup
+12,Street Lights,Street Lights
+13,Cabinet/shelf,Cabinet/shelf
+14,Handbag/Satchel,Handbag/Satchel
+15,Bracelet,Bracelet
+16,Plate,Plate
+17,Picture/Frame,Picture/Frame
+18,Helmet,Helmet
+19,Book,Book
+20,Gloves,Gloves
+21,Storage box,Storage box
+22,Boat,Boat
+23,Leather Shoes,Leather Shoes
+24,Flower,Flower
+25,Bench,Bench
+26,Potted Plant,Potted Plant
+27,Bowl/Basin,Bowl/Basin
+28,Flag,Flag
+29,Pillow,Pillow
+30,Boots,Boots
+31,Vase,Vase
+32,Microphone,Microphone
+33,Necklace,Necklace
+34,Ring,Ring
+35,SUV,SUV
+36,Wine Glass,Wine Glass
+37,Belt,Belt
+38,Moniter/TV,Monitor/TV
+39,Backpack,Backpack
+40,Umbrella,Umbrella
+41,Traffic Light,Traffic Light
+42,Speaker,Speaker
+43,Watch,Watch
+44,Tie,Tie
+45,Trash bin Can,Trash bin Can
+46,Slippers,Slippers
+47,Bicycle,Bicycle
+48,Stool,Stool
+49,Barrel/bucket,Barrel/bucket
+50,Van,Van
+51,Couch,Couch
+52,Sandals,Sandals
+53,Bakset,Basket
+54,Drum,Drum
+55,Pen/Pencil,Pen/Pencil
+56,Bus,Bus
+57,Wild Bird,Wild Bird
+58,High Heels,High Heels
+59,Motorcycle,Motorcycle
+60,Guitar,Guitar
+61,Carpet,Carpet
+62,Cell Phone,Cell Phone
+63,Bread,Bread
+64,Camera,Camera
+65,Canned,Canned
+66,Truck,Truck
+67,Traffic cone,Traffic cone
+68,Cymbal,Cymbal
+69,Lifesaver,Lifesaver
+70,Towel,Towel
+71,Stuffed Toy,Stuffed Toy
+72,Candle,Candle
+73,Sailboat,Sailboat
+74,Laptop,Laptop
+75,Awning,Awning
+76,Bed,Bed
+77,Faucet,Faucet
+78,Tent,Tent
+79,Horse,Horse
+80,Mirror,Mirror
+81,Power outlet,Power outlet
+82,Sink,Sink
+83,Apple,Apple
+84,Air Conditioner,Air Conditioner
+85,Knife,Knife
+86,Hockey Stick,Hockey Stick
+87,Paddle,Paddle
+88,Pickup Truck,Pickup Truck
+89,Fork,Fork
+90,Traffic Sign,Traffic Sign
+91,Ballon,Ballon
+92,Tripod,Tripod
+93,Dog,Dog
+94,Spoon,Spoon
+95,Clock,Clock
+96,Pot,Pot
+97,Cow,Cow
+98,Cake,Cake
+99,Dinning Table,Dining Table
+100,Sheep,Sheep
+101,Hanger,Hanger
+102,Blackboard/Whiteboard,Blackboard/Whiteboard
+103,Napkin,Napkin
+104,Other Fish,Other Fish
+105,Orange/Tangerine,Orange/Tangerine
+106,Toiletry,Toiletry
+107,Keyboard,Keyboard
+108,Tomato,Tomato
+109,Lantern,Lantern
+110,Machinery Vehicle,Machinery Vehicle
+111,Fan,Fan
+112,Green Vegetables,Green Vegetables
+113,Banana,Banana
+114,Baseball Glove,Baseball Glove
+115,Airplane,Airplane
+116,Mouse,Mouse
+117,Train,Train
+118,Pumpkin,Pumpkin
+119,Soccer,Soccer
+120,Skiboard,Skiboard
+121,Luggage,Luggage
+122,Nightstand,Nightstand
+123,Tea pot,Teapot
+124,Telephone,Telephone
+125,Trolley,Trolley
+126,Head Phone,Head Phone
+127,Sports Car,Sports Car
+128,Stop Sign,Stop Sign
+129,Dessert,Dessert
+130,Scooter,Scooter
+131,Stroller,Stroller
+132,Crane,Crane
+133,Remote,Remote
+134,Refrigerator,Refrigerator
+135,Oven,Oven
+136,Lemon,Lemon
+137,Duck,Duck
+138,Baseball Bat,Baseball Bat
+139,Surveillance Camera,Surveillance Camera
+140,Cat,Cat
+141,Jug,Jug
+142,Broccoli,Broccoli
+143,Piano,Piano
+144,Pizza,Pizza
+145,Elephant,Elephant
+146,Skateboard,Skateboard
+147,Surfboard,Surfboard
+148,Gun,Gun
+149,Skating and Skiing shoes,Skating and Skiing shoes
+150,Gas stove,Gas stove
+151,Donut,Donut
+152,Bow Tie,Bow Tie
+153,Carrot,Carrot
+154,Toilet,Toilet
+155,Kite,Kite
+156,Strawberry,Strawberry
+157,Other Balls,Other Balls
+158,Shovel,Shovel
+159,Pepper,Pepper
+160,Computer Box,Computer Box
+161,Toilet Paper,Toilet Paper
+162,Cleaning Products,Cleaning Products
+163,Chopsticks,Chopsticks
+164,Microwave,Microwave
+165,Pigeon,Pigeon
+166,Baseball,Baseball
+167,Cutting/chopping Board,Cutting/chopping Board
+168,Coffee Table,Coffee Table
+169,Side Table,Side Table
+170,Scissors,Scissors
+171,Marker,Marker
+172,Pie,Pie
+173,Ladder,Ladder
+174,Snowboard,Snowboard
+175,Cookies,Cookies
+176,Radiator,Radiator
+177,Fire Hydrant,Fire Hydrant
+178,Basketball,Basketball
+179,Zebra,Zebra
+180,Grape,Grape
+181,Giraffe,Giraffe
+182,Potato,Potato
+183,Sausage,Sausage
+184,Tricycle,Tricycle
+185,Violin,Violin
+186,Egg,Egg
+187,Fire Extinguisher,Fire Extinguisher
+188,Candy,Candy
+189,Fire Truck,Fire Truck
+190,Billards,Billiards
+191,Converter,Converter
+192,Bathtub,Bathtub
+193,Wheelchair,Wheelchair
+194,Golf Club,Golf Club
+195,Briefcase,Briefcase
+196,Cucumber,Cucumber
+197,Cigar/Cigarette,Cigar/Cigarette
+198,Paint Brush,Paint Brush
+199,Pear,Pear
+200,Heavy Truck,Heavy Truck
+201,Hamburger,Hamburger
+202,Extractor,Extractor
+203,Extention Cord,Extension Cord
+204,Tong,Tong
+205,Tennis Racket,Tennis Racket
+206,Folder,Folder
+207,American Football,American Football
+208,earphone,earphone
+209,Mask,Mask
+210,Kettle,Kettle
+211,Tennis,Tennis
+212,Ship,Ship
+213,Swing,Swing
+214,Coffee Machine,Coffee Machine
+215,Slide,Slide
+216,Carriage,Carriage
+217,Onion,Onion
+218,Green beans,Green beans
+219,Projector,Projector
+220,Frisbee,Frisbee
+221,Washing Machine/Drying Machine,Washing Machine/Drying Machine
+222,Chicken,Chicken
+223,Printer,Printer
+224,Watermelon,Watermelon
+225,Saxophone,Saxophone
+226,Tissue,Tissue
+227,Toothbrush,Toothbrush
+228,Ice cream,Ice cream
+229,Hotair ballon,Hot air balloon
+230,Cello,Cello
+231,French Fries,French Fries
+232,Scale,Scale
+233,Trophy,Trophy
+234,Cabbage,Cabbage
+235,Hot dog,Hot dog
+236,Blender,Blender
+237,Peach,Peach
+238,Rice,Rice
+239,Wallet/Purse,Wallet/Purse
+240,Volleyball,Volleyball
+241,Deer,Deer
+242,Goose,Goose
+243,Tape,Tape
+244,Tablet,Tablet
+245,Cosmetics,Cosmetics
+246,Trumpet,Trumpet
+247,Pineapple,Pineapple
+248,Golf Ball,Golf Ball
+249,Ambulance,Ambulance
+250,Parking meter,Parking meter
+251,Mango,Mango
+252,Key,Key
+253,Hurdle,Hurdle
+254,Fishing Rod,Fishing Rod
+255,Medal,Medal
+256,Flute,Flute
+257,Brush,Brush
+258,Penguin,Penguin
+259,Megaphone,Megaphone
+260,Corn,Corn
+261,Lettuce,Lettuce
+262,Garlic,Garlic
+263,Swan,Swan
+264,Helicopter,Helicopter
+265,Green Onion,Green Onion
+266,Sandwich,Sandwich
+267,Nuts,Nuts
+268,Speed Limit Sign,Speed Limit Sign
+269,Induction Cooker,Induction Cooker
+270,Broom,Broom
+271,Trombone,Trombone
+272,Plum,Plum
+273,Rickshaw,Rickshaw
+274,Goldfish,Goldfish
+275,Kiwi fruit,Kiwi fruit
+276,Router/modem,Router/modem
+277,Poker Card,Poker Card
+278,Toaster,Toaster
+279,Shrimp,Shrimp
+280,Sushi,Sushi
+281,Cheese,Cheese
+282,Notepaper,Notepaper
+283,Cherry,Cherry
+284,Pliers,Pliers
+285,CD,CD
+286,Pasta,Pasta
+287,Hammer,Hammer
+288,Cue,Cue
+289,Avocado,Avocado
+290,Hamimelon,Hami melon
+291,Flask,Flask
+292,Mushroon,Mushroom
+293,Screwdriver,Screwdriver
+294,Soap,Soap
+295,Recorder,Recorder
+296,Bear,Bear
+297,Eggplant,Eggplant
+298,Board Eraser,Board Eraser
+299,Coconut,Coconut
+300,Tape Measur/ Ruler,Tape Measure/ Ruler
+301,Pig,Pig
+302,Showerhead,Showerhead
+303,Globe,Globe
+304,Chips,Chips
+305,Steak,Steak
+306,Crosswalk Sign,Crosswalk Sign
+307,Stapler,Stapler
+308,Campel,Camel
+309,Formula 1,Formula 1
+310,Pomegranate,Pomegranate
+311,Dishwasher,Dishwasher
+312,Crab,Crab
+313,Hoverboard,Hoverboard
+314,Meat ball,Meatball
+315,Rice Cooker,Rice Cooker
+316,Tuba,Tuba
+317,Calculator,Calculator
+318,Papaya,Papaya
+319,Antelope,Antelope
+320,Parrot,Parrot
+321,Seal,Seal
+322,Buttefly,Butterfly
+323,Dumbbell,Dumbbell
+324,Donkey,Donkey
+325,Lion,Lion
+326,Urinal,Urinal
+327,Dolphin,Dolphin
+328,Electric Drill,Electric Drill
+329,Hair Dryer,Hair Dryer
+330,Egg tart,Egg tart
+331,Jellyfish,Jellyfish
+332,Treadmill,Treadmill
+333,Lighter,Lighter
+334,Grapefruit,Grapefruit
+335,Game board,Game board
+336,Mop,Mop
+337,Radish,Radish
+338,Baozi,Baozi
+339,Target,Target
+340,French,French
+341,Spring Rolls,Spring Rolls
+342,Monkey,Monkey
+343,Rabbit,Rabbit
+344,Pencil Case,Pencil Case
+345,Yak,Yak
+346,Red Cabbage,Red Cabbage
+347,Binoculars,Binoculars
+348,Asparagus,Asparagus
+349,Barbell,Barbell
+350,Scallop,Scallop
+351,Noddles,Noddles
+352,Comb,Comb
+353,Dumpling,Dumpling
+354,Oyster,Oyster
+355,Table Teniis paddle,Table Tennis paddle
+356,Cosmetics Brush/Eyeliner Pencil,Cosmetics Brush/Eyeliner Pencil
+357,Chainsaw,Chainsaw
+358,Eraser,Eraser
+359,Lobster,Lobster
+360,Durian,Durian
+361,Okra,Okra
+362,Lipstick,Lipstick
+363,Cosmetics Mirror,Cosmetics Mirror
+364,Curling,Curling
+365,Table Tennis,Table Tennis
diff --git a/tools/misc/split_odvg.py b/tools/misc/split_odvg.py
new file mode 100644
index 00000000000..37fae909859
--- /dev/null
+++ b/tools/misc/split_odvg.py
@@ -0,0 +1,80 @@
+import argparse
+import json
+import os
+import shutil
+
+import jsonlines
+import numpy as np
+from mmengine.utils import ProgressBar, mkdir_or_exist
+
+
+def parse_args():
+ parser = argparse.ArgumentParser()
+ parser.add_argument('data_root', type=str, help='The data root.')
+ parser.add_argument('ann_file', type=str)
+ parser.add_argument('img_prefix', type=str)
+ parser.add_argument(
+ 'out_dir',
+ type=str,
+ help='The output directory of coco semi-supervised annotations.')
+ parser.add_argument(
+ '--label-map-file', '-m', type=str, help='label map file')
+ parser.add_argument(
+ '--num-img',
+ '-n',
+ default=200,
+ type=int,
+ help='num of extract image, -1 means all images')
+ parser.add_argument('--seed', default=-1, type=int, help='seed')
+ args = parser.parse_args()
+ return args
+
+
+def main():
+ args = parse_args()
+ assert args.out_dir != args.data_root, \
+ 'The file will be overwritten in place, ' \
+ 'so the same folder is not allowed !'
+
+ seed = int(args.seed)
+ if seed != -1:
+ print(f'Set the global seed: {seed}')
+ np.random.seed(int(args.seed))
+
+ ann_file = os.path.join(args.data_root, args.ann_file)
+ with open(ann_file, 'r') as f:
+ data_list = [json.loads(line) for line in f]
+
+ np.random.shuffle(data_list)
+
+ num_img = args.num_img
+
+ progress_bar = ProgressBar(num_img)
+ for i in range(num_img):
+ file_name = data_list[i]['filename']
+ image_path = os.path.join(args.data_root, args.img_prefix, file_name)
+ out_image_dir = os.path.join(args.out_dir, args.img_prefix)
+ mkdir_or_exist(out_image_dir)
+ out_image_path = os.path.join(out_image_dir, file_name)
+ shutil.copyfile(image_path, out_image_path)
+
+ progress_bar.update()
+
+ out_path = os.path.join(args.out_dir, args.ann_file)
+ out_dir = os.path.dirname(out_path)
+ mkdir_or_exist(out_dir)
+
+ with jsonlines.open(out_path, mode='w') as writer:
+ writer.write_all(data_list[:num_img])
+
+ if args.label_map_file is not None:
+ out_dir = os.path.dirname(
+ os.path.join(args.out_dir, args.label_map_file))
+ mkdir_or_exist(out_dir)
+ shutil.copyfile(
+ os.path.join(args.data_root, args.label_map_file),
+ os.path.join(args.out_dir, args.label_map_file))
+
+
+if __name__ == '__main__':
+ main()