Refine annotations for DAB-DETR Transformer (facebookresearch#61)

* refine annos * fix * refine dn annos * refine README * refine CondDETR README * refine README * add detr image * refine * refine * refine links * refine links Co-authored-by: ntianhe ren <rentianhe@dgx061.scc.idea>
niqbal996 · Sep 15, 2022 · f986a0f · f986a0f
1 parent fd8825f
commit f986a0f
Show file tree

Hide file tree

Showing 6 changed files with 94 additions and 42 deletions.
diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@
 [📘Documentation]() |
 [🛠️Installation]() |
 [👀Model Zoo]() |
-[🚀Awesome DETR](https://github.com/IDEACVR/awesome-detection-transformer) |
+[🚀Awesome DETR](https://github.com/IDEA-Research/awesome-detection-transformer) |
 [🆕News]() |
 [🤔Reporting Issues](https://github.com/rentainhe/detrex/issues/new/choose)
 
@@ -29,6 +29,9 @@
 
 detrex is an open-source toolbox that provides state-of-the-art transformer based detection algorithms on top of [Detectron2](https://github.com/facebookresearch/detectron2) and the module designs are partially borrowed from [MMDetection](https://github.com/open-mmlab/mmdetection) and [DETR](https://github.com/facebookresearch/detr). Many thanks for their nicely organized code. The main branch works with **Pytorch 1.9+** or higher (we recommend **Pytorch 1.12**).
 
+<div align="center">
+  <img src="./assets/detr_arch.png" width="100%"/>
+</div>
 
 <details open>
 <summary> Major Features </summary>
@@ -41,7 +44,7 @@ detrex is an open-source toolbox that provides state-of-the-art transformer base
   - [LazyConfig System](https://detectron2.readthedocs.io/en/latest/tutorials/lazyconfigs.html) for more flexible syntax and cleaner config files.
   - Light-weight [training engine](./tools/train_net.py) modified from detectron2 [lazyconfig_train_net.py](https://github.com/facebookresearch/detectron2/blob/main/tools/lazyconfig_train_net.py)
 
-Apart from detrex, we also released a repo [Awesome Detection Transformer](https://github.com/IDEACVR/awesome-detection-transformer) to present papers about transformer for detection and segmentation.
+Apart from detrex, we also released a repo [Awesome Detection Transformer](https://github.com/IDEA-Research/awesome-detection-transformer) to present papers about transformer for detection and segmentation.
 
 </details>
 
@@ -59,25 +62,28 @@ Please refer to [Getting Started with detrex]() for the basic usage of detrex.
 Please see [documentation]() for full API documentation and tutorials.
 
 ## Model Zoo
-Results and models are available in [model zoo]()
+Results and models are available in [model zoo]().
 
 <details open>
 <summary> Supported methods </summary>
 
-- [x] [DETR](./projects/detr/)
-- [x] [Deformable-DETR](./projects/dab_deformable_detr/)
-- [x] [Conditional DETR]()
-- [x] [DAB-DETR](./projects/dab_detr/)
-- [x] [DAB-Deformable-DETR](./projects/dab_deformable_detr/)
-- [x] [DN-DETR](./projects/dn_detr/)
-- [x] [DN-Deformable-DETR](./projects/dn_deformable_detr/)
-- [x] [DINO](./projects/dino/)
+- [x] [DETR (ECCV'2020)](./projects/detr/)
+- [x] [Deformable-DETR (ICLR'2021)](./projects/dab_deformable_detr/)
+- [x] [Conditional DETR (ICCV'2021)](./projects/conditional_detr/)
+- [x] [DAB-DETR (ICLR'2022)](./projects/dab_detr/)
+- [x] [DAB-Deformable-DETR (ICLR'2022)](./projects/dab_deformable_detr/)
+- [x] [DN-DETR (CVPR'2022)](./projects/dn_detr/)
+- [x] [DN-Deformable-DETR (CVPR'2022)](./projects/dn_deformable_detr/)
+- [x] [DINO (ArXiv'2022)](./projects/dino/)
 
 Please see [projects](./projects/) for the details about projects that are built based on detrex.
 
+</details>
+
+
 ## Change Log
 
-The beta v0.1.0 version was released in 30/09/2022. Highlights of the released version:
+The **beta v0.1.0** version was released in 30/09/2022. Highlights of the released version:
 - Support various backbones including: [FocalNet](https://arxiv.org/abs/2203.11926), [Swin-T](https://arxiv.org/pdf/2103.14030.pdf), [ResNet](https://arxiv.org/abs/1512.03385) and other [detectron2 builtin backbones](https://github.com/facebookresearch/detectron2/tree/main/detectron2/modeling/backbone).
 - Add [timm](https://github.com/rwightman/pytorch-image-models) backbones wrapper and [torchvision](https://github.com/pytorch/vision) backbones wrapper.
 - Support various transformer based detection algorithms including: [DETR](https://arxiv.org/abs/2005.12872), [Deformable-DETR](https://arxiv.org/abs/2010.04159), [Conditional-DETR](https://arxiv.org/abs/2108.06152), [DAB-DETR](https://arxiv.org/abs/2201.12329), [DN-DETR](https://arxiv.org/abs/2203.01305), [DINO](https://arxiv.org/abs/2203.03605).
@@ -96,3 +102,4 @@ This project is released under the [Apache 2.0 license](LICENSE).
 
 
 ## Citation
+If you find this project useful in your research, please consider cite:
diff --git a/assets/detr_arch.png b/assets/detr_arch.png
diff --git a/projects/conditional_detr/README.md b/projects/conditional_detr/README.md
@@ -19,15 +19,15 @@ Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun
 <th valign="bottom">download</th>
 <!-- TABLE BODY -->
 <!-- ROW: dab_detr_r50_50ep -->
- <tr><td align="left"><a href="configs/dab_detr_r50_50ep.py">Conditional DETR-R50</a></td>
+ <tr><td align="left"><a href="configs/conditional_detr_r50_50ep.py">Conditional DETR-R50</a></td>
 <td align="center">R-50</td>
 <td align="center">IN1k</td>
-<td align="center">43.2</td>
-<td align="center"> <a href="">Google Drive</a></td>
+<td align="center">41.0</td>
+<td align="center"> <a href="">model</a></td>
 </tr>
 </tbody></table>
 
-**Note:** DC5 means using dilated convolution in `res5`.
+**Note:** Here we borrowed the pretrained weight from [ConditionalDETR](https://github.com/Atten4Vis/ConditionalDETR). And our detrex training results will be released in the future version.
 
 
 ## Training

diff --git a/projects/dab_detr/modeling/dab_detr.py b/projects/dab_detr/modeling/dab_detr.py
@@ -1,7 +1,7 @@
 # coding=utf-8
 # Copyright 2022 The IDEA Authors. All rights reserved.
 #
-# Licensed under the Apache License, Version 2.0 (the "License");
+# Licensed under the Apache License, Version 2.0 (the "License");  
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #

diff --git a/projects/dab_detr/modeling/dab_transformer.py b/projects/dab_detr/modeling/dab_transformer.py
@@ -177,16 +177,16 @@ def forward(
         attn_masks=None,
         query_key_padding_mask=None,
         key_padding_mask=None,
-        refpoints_embed=None,
+        anchor_box_embed=None,
         **kwargs,
     ):
         intermediate = []
 
-        reference_points = refpoints_embed.sigmoid()
-        refpoints = [reference_points]
+        reference_boxes = anchor_box_embed.sigmoid()
+        intermediate_ref_boxes = [reference_boxes]
 
         for idx, layer in enumerate(self.layers):
-            obj_center = reference_points[..., : self.embed_dim]
+            obj_center = reference_boxes[..., : self.embed_dim]
             query_sine_embed = get_sine_pos_embed(obj_center)
             query_pos = self.ref_point_head(query_sine_embed)
 
@@ -222,15 +222,16 @@ def forward(
                 **kwargs,
             )
 
-            # iter update
+            # update anchor boxes after each decoder layer using shared box head.
             if self.bbox_embed is not None:
-                temp = self.bbox_embed(query)
-                temp[..., : self.embed_dim] += inverse_sigmoid(reference_points)
-                new_reference_points = temp[..., : self.embed_dim].sigmoid()
+                # predict offsets and added to the input normalized anchor boxes.
+                offsets = self.bbox_embed(query)
+                offsets[..., : self.embed_dim] += inverse_sigmoid(reference_boxes)
+                new_reference_boxes = offsets[..., : self.embed_dim].sigmoid()
 
                 if idx != self.num_layers - 1:
-                    refpoints.append(new_reference_points)
-                reference_points = new_reference_points.detach()
+                    intermediate_ref_boxes.append(new_reference_boxes)
+                reference_boxes = new_reference_boxes.detach()
 
             if self.return_intermediate:
                 if self.post_norm_layer is not None:
@@ -248,12 +249,12 @@ def forward(
             if self.bbox_embed is not None:
                 return [
                     torch.stack(intermediate).transpose(1, 2),
-                    torch.stack(refpoints).transpose(1, 2),
+                    torch.stack(intermediate_ref_boxes).transpose(1, 2),
                 ]
             else:
                 return [
                     torch.stack(intermediate).transpose(1, 2),
-                    reference_points.unsqueeze(0).transpose(1, 2),
+                    reference_boxes.unsqueeze(0).transpose(1, 2),
                 ]
 
         return query.unsqueeze(0)
@@ -273,11 +274,11 @@ def init_weights(self):
             if p.dim() > 1:
                 nn.init.xavier_uniform_(p)
 
-    def forward(self, x, mask, refpoints_embed, pos_embed):
+    def forward(self, x, mask, anchor_box_embed, pos_embed):
         bs, c, h, w = x.shape
-        x = x.view(bs, c, -1).permute(2, 0, 1)
+        x = x.view(bs, c, -1).permute(2, 0, 1)  # (c, bs, num_queries)
         pos_embed = pos_embed.view(bs, c, -1).permute(2, 0, 1)
-        refpoints_embed = refpoints_embed.unsqueeze(1).repeat(1, bs, 1)
+        anchor_box_embed = anchor_box_embed.unsqueeze(1).repeat(1, bs, 1)
         mask = mask.view(bs, -1)
         memory = self.encoder(
             query=x,
@@ -286,15 +287,15 @@ def forward(self, x, mask, refpoints_embed, pos_embed):
             query_pos=pos_embed,
             query_key_padding_mask=mask,
         )
-        num_queries = refpoints_embed.shape[0]
-        target = torch.zeros(num_queries, bs, self.embed_dim, device=refpoints_embed.device)
+        num_queries = anchor_box_embed.shape[0]
+        target = torch.zeros(num_queries, bs, self.embed_dim, device=anchor_box_embed.device)
 
         hidden_state, reference_boxes = self.decoder(
             query=target,
             key=memory,
             value=memory,
             key_pos=pos_embed,
-            refpoints_embed=refpoints_embed,
+            anchor_box_embed=anchor_box_embed,
         )
 
         return hidden_state, reference_boxes
diff --git a/projects/dn_detr/modeling/dn_detr.py b/projects/dn_detr/modeling/dn_detr.py
@@ -12,12 +12,6 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-# ------------------------------------------------------------------------------------------------
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-# ------------------------------------------------------------------------------------------------
-# Modified from:
-# https://github.com/facebookresearch/detr/blob/main/d2/detr/detr.py
-# ------------------------------------------------------------------------------------------------
 
 import math
 from typing import List
@@ -33,6 +27,35 @@
 
 
 class DNDETR(nn.Module):
+    """Implement DAB-DETR in `DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR 
+    <https://arxiv.org/abs/2201.12329>`_
+    
+    Args:
+        backbone (nn.Module): Backbone module for feature extraction.
+        in_features (List[str]): Selected backbone output features for transformer module.
+        in_channels (int): Dimension of the last feature in `in_features`.
+        position_embedding (nn.Module): Position encoding layer for generating position embeddings.
+        transformer (nn.Module): Transformer module used for further processing features and input queries.
+        embed_dim (int): Hidden dimension for transformer module.
+        num_classes (int): Number of total categories.
+        num_queries (int): Number of proposal dynamic anchor boxes in Transformer
+        criterion (nn.Module): Criterion for calculating the total losses.
+        aux_loss (bool): Whether to calculate auxiliary loss in criterion. Default: True.
+        pixel_mean (List[float]): Pixel mean value for image normalization. 
+            Default: [123.675, 116.280, 103.530].
+        pixel_std (List[float]): Pixel std value for image normalization.
+            Default: [58.395, 57.120, 57.375].
+        freeze_anchor_box_centers (bool): If True, freeze the center param ``(x, y)`` for the initialized dynamic anchor boxes
+            in format ``(x, y, w, h)`` and only train ``(w, h)``. Default: True.
+        select_box_nums_for_evaluation (int): Select the top-k confidence predicted boxes for inference.
+            Default: 300.
+        denoising_groups (int): Number of groups for noised ground truths. Default: 5.
+        label_noise_prob (float): The probability of the label being noised. Default: 0.2.
+        box_noise_scale (float): Scaling factor for box noising. Default: 0.4.
+        with_indicator (bool): If True, add indicator in denoising queries part and matching queries part. 
+            Default: True.
+        device (str): Training device. Default: "cuda".
+    """
     def __init__(
         self,
         backbone: nn.Module,
@@ -134,7 +157,28 @@ def init_weights(self):
         nn.init.constant_(self.bbox_embed.layers[-1].bias.data, 0)
 
     def forward(self, batched_inputs):
+        """Forward function of `DAB-DETR` which excepts a list of dict as inputs.
 
+        Args:
+            batched_inputs (List[dict]): A list of instance dict, and each instance dict must consists of:
+                - dict["image"] (torch.Tensor): The unnormalized image tensor.
+                - dict["height"] (int): The original image height.
+                - dict["width"] (int): The original image width.
+                - dict["instance"] (detectron2.structures.Instances): Image meta informations and ground truth boxes and labels during training.
+                    Please refer to https://detectron2.readthedocs.io/en/latest/modules/structures.html#detectron2.structures.Instances
+                    for the basic usage of Instances.
+        
+        Returns:
+            dict: Returns a dict with the following elements:
+                - dict["pred_logits"]: the classification logits for all queries (anchor boxes in DAB-DETR).
+                            with shape ``[batch_size, num_queries, num_classes]``
+                - dict["pred_boxes"]: The normalized boxes coordinates for all queries in format
+                    ``(x, y, w, h)``. These values are normalized in [0, 1] relative to the size of 
+                    each individual image (disregarding possible padding). See PostProcess for information 
+                    on how to retrieve the unnormalized bounding box.
+                - dict["aux_outputs"]: Optional, only returned when auxilary losses are activated. It is a list of
+                            dictionnaries containing the two above keys for each decoder layer.
+        """
         images = self.preprocess_image(batched_inputs)
 
         if self.training:
@@ -147,7 +191,7 @@ def forward(self, batched_inputs):
             batch_size, _, H, W = images.tensor.shape
             img_masks = images.tensor.new_zeros(batch_size, H, W)
 
-        # only use last level feature in DAB-DETR
+        # only use last level feature as DAB-DETR
         features = self.backbone(images.tensor)[self.in_features[-1]]
         features = self.input_proj(features)
         img_masks = F.interpolate(img_masks[None], size=features.shape[-2:]).to(torch.bool)[0]