[Feature] Official implementation of SETR (#531)

* Adjust vision transformer backbone architectures; * Add DropPath, trunc_normal_ for VisionTransformer implementation; * Add class token buring intermediate period and remove it during final period; * Fix some parameters loss bug; * * Store intermediate token features and impose no processes on them; * Remove class token and reshape entire token feature from NLC to NCHW; * Fix some doc error * Add a arg for VisionTransformer backbone to control if input class token into transformer; * Add stochastic depth decay rule for DropPath; * * Fix output bug when input_cls_token=False; * Add related unit test; * Re-implement of SETR * Add two head -- SETRUPHead (Naive, PUP) & SETRMLAHead (MLA); * * Modify some docs of heads of SETR; * Add MLA auxiliary head of SETR; * * Modify some arg of setr heads; * Add unit test for setr heads; * * Add 768x768 cityscapes dataset config; * Add Backbone: SETR -- Backbone: MLA, PUP, Naive; * Add SETR cityscapes training & testing config; * * Fix the low code coverage of unit test about heads of setr; * Remove some rebundant error capture; * * Add pascal context dataset & ade20k dataset config; * Modify auxiliary head relative config; * Modify folder structure. * add setr * modify vit * Fix the test_cfg arg position; * Fix some learning schedule bug; * optimize setr code * Add arg: final_reshape to control if converting output feature information from NLC to NCHW; * Fix the default value of final_reshape; * Modify arg: final_reshape to arg: out_shape; * Fix some unit test bug; * Add MLA neck; * Modify setr configs to add MLA neck; * Modify MLA decode head to remove rebundant structure; * Remove some rebundant files. * * Fix the code style bug; * Remove some rebundant files; * Modify some unit tests of SETR; * Ignoring CityscapesCoarseDataset and MapillaryDataset. * Fix the activation function loss bug; * Fix the img_size bug of SETR_PUP_ADE20K * * Fix the lint bug of transformers.py; * Add mla neck unit test; * Convert vit of setr out shape from NLC to NCHW. * * Modify Resize action of data pipeline; * Fix deit related bug; * Set find_unused_parameters=False for pascal context dataset; * Remove arg: find_unused_parameters which is False by default. * Error auxiliary head of PUP deit * Remove the minimal restrict of slide inference. * Modify doc string of Resize * Seperate this part of code to a new PR #544 * * Remove some rebundant codes; * Modify unit tests of SETR heads; * Fix the tuple in_channels of mla_deit. * Modify code style * Move detailed definition of auxiliary head into model config dict; * Add some setr config for default cityscapes.py; * Fix the doc string of SETR head; * Modify implementation of SETR Heads * Remove setr aux head and use fcn head to replace it; * Remove arg: img_size and remove last interpolate op of heads; * Rename arg: conv3x3_conv1x1 to kernel_size of SETRUPHead; * non-square input support for setr heads * Modify config argument for above commits * Remove norm_layer argument of SETRMLAHead * Add mla_align_corners for MLAModule interpolate * [Refactor]Refactor of SETRMLAHead * Modify Head implementation; * Modify Head unit test; * Modify related config file; * [Refactor]MLA Neck * Fix config bug * [Refactor]SETR Naive Head and SETR PUP Head * [Fix]Fix the lack of arg: act_cfg and arg: norm_cfg * Fix config error * Refactor of SETR MLA, Naive, PUP heads. * Modify some attribute name of SETR Heads. * Modify setr configs to adapt new vit code. * Fix trunc_normal_ bug * Parameters init adjustment. * Remove redundant doc string of SETRUPHead * Fix pretrained bug * [Fix] Fix vit init bug * Add some vit unit tests * Modify module import * Remove norm from PatchEmbed * Fix pretrain weights bug * Modify pretrained judge * Fix some gradient backward bugs. * Add some unit tests to improve code cov * Fix init_weights of setr up head * Add DropPath in FFN * Finish benchmark of SETR 1. Add benchmark information into README.MD of SETR; 2. Fix some name bugs of vit; * Remove DropPath implementation and use DropPath from mmcv. * Modify out_indices arg * Fix out_indices bug. * Remove cityscapes base dataset config. Co-authored-by: sennnnn <201730271412@mail.scut.edu.cn> Co-authored-by: CuttlefishXuan <zhaoxinxuan1997@gmail.com>
open-mmlab · Jun 23, 2021 · ec91893 · ec91893
1 parent 3e70d93
commit ec91893
Show file tree

Hide file tree

Showing 21 changed files with 914 additions and 101 deletions.
diff --git a/configs/_base_/models/setr_mla.py b/configs/_base_/models/setr_mla.py
@@ -0,0 +1,96 @@
+# model settings
+backbone_norm_cfg = dict(type='LN', eps=1e-6, requires_grad=True)
+norm_cfg = dict(type='SyncBN', requires_grad=True)
+model = dict(
+    type='EncoderDecoder',
+    pretrained=\
+    'https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_large_p16_384-b3be5167.pth',  # noqa
+    backbone=dict(
+        type='VisionTransformer',
+        img_size=(768, 768),
+        patch_size=16,
+        in_channels=3,
+        embed_dims=1024,
+        num_layers=24,
+        num_heads=16,
+        out_indices=(5, 11, 17, 23),
+        drop_rate=0.1,
+        norm_cfg=backbone_norm_cfg,
+        with_cls_token=False,
+        interpolate_mode='bilinear',
+    ),
+    neck=dict(
+        type='MLANeck',
+        in_channels=[1024, 1024, 1024, 1024],
+        out_channels=256,
+        norm_cfg=norm_cfg,
+        act_cfg=dict(type='ReLU'),
+    ),
+    decode_head=dict(
+        type='SETRMLAHead',
+        in_channels=(256, 256, 256, 256),
+        channels=512,
+        in_index=(0, 1, 2, 3),
+        dropout_ratio=0,
+        mla_channels=128,
+        num_classes=19,
+        norm_cfg=norm_cfg,
+        align_corners=False,
+        loss_decode=dict(
+            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
+    auxiliary_head=[
+        dict(
+            type='FCNHead',
+            in_channels=256,
+            channels=256,
+            in_index=0,
+            dropout_ratio=0,
+            num_convs=0,
+            kernel_size=1,
+            concat_input=False,
+            num_classes=19,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+        dict(
+            type='FCNHead',
+            in_channels=256,
+            channels=256,
+            in_index=1,
+            dropout_ratio=0,
+            num_convs=0,
+            kernel_size=1,
+            concat_input=False,
+            num_classes=19,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+        dict(
+            type='FCNHead',
+            in_channels=256,
+            channels=256,
+            in_index=2,
+            dropout_ratio=0,
+            num_convs=0,
+            kernel_size=1,
+            concat_input=False,
+            num_classes=19,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+        dict(
+            type='FCNHead',
+            in_channels=256,
+            channels=256,
+            in_index=3,
+            dropout_ratio=0,
+            num_convs=0,
+            kernel_size=1,
+            concat_input=False,
+            num_classes=19,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+    ],
+    train_cfg=dict(),
+    test_cfg=dict(mode='whole'))
diff --git a/configs/_base_/models/setr_naive.py b/configs/_base_/models/setr_naive.py
@@ -0,0 +1,81 @@
+# model settings
+backbone_norm_cfg = dict(type='LN', eps=1e-6, requires_grad=True)
+norm_cfg = dict(type='SyncBN', requires_grad=True)
+model = dict(
+    type='EncoderDecoder',
+    pretrained=\
+    'https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_large_p16_384-b3be5167.pth',  # noqa
+    backbone=dict(
+        type='VisionTransformer',
+        img_size=(768, 768),
+        patch_size=16,
+        in_channels=3,
+        embed_dims=1024,
+        num_layers=24,
+        num_heads=16,
+        out_indices=(9, 14, 19, 23),
+        drop_rate=0.1,
+        norm_cfg=backbone_norm_cfg,
+        with_cls_token=True,
+        interpolate_mode='bilinear',
+    ),
+    decode_head=dict(
+        type='SETRUPHead',
+        in_channels=1024,
+        channels=256,
+        in_index=3,
+        num_classes=19,
+        dropout_ratio=0,
+        norm_cfg=norm_cfg,
+        num_convs=1,
+        up_scale=4,
+        kernel_size=1,
+        align_corners=False,
+        loss_decode=dict(
+            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
+    auxiliary_head=[
+        dict(
+            type='SETRUPHead',
+            in_channels=1024,
+            channels=256,
+            in_index=0,
+            num_classes=19,
+            dropout_ratio=0,
+            norm_cfg=norm_cfg,
+            num_convs=1,
+            up_scale=4,
+            kernel_size=1,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+        dict(
+            type='SETRUPHead',
+            in_channels=1024,
+            channels=256,
+            in_index=1,
+            num_classes=19,
+            dropout_ratio=0,
+            norm_cfg=norm_cfg,
+            num_convs=1,
+            up_scale=4,
+            kernel_size=1,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+        dict(
+            type='SETRUPHead',
+            in_channels=1024,
+            channels=256,
+            in_index=2,
+            num_classes=19,
+            dropout_ratio=0,
+            norm_cfg=norm_cfg,
+            num_convs=1,
+            up_scale=4,
+            kernel_size=1,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4))
+    ],
+    train_cfg=dict(),
+    test_cfg=dict(mode='whole'))
diff --git a/configs/_base_/models/setr_pup.py b/configs/_base_/models/setr_pup.py
@@ -0,0 +1,81 @@
+# model settings
+backbone_norm_cfg = dict(type='LN', eps=1e-6, requires_grad=True)
+norm_cfg = dict(type='SyncBN', requires_grad=True)
+model = dict(
+    type='EncoderDecoder',
+    pretrained=\
+    'https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_large_p16_384-b3be5167.pth',  # noqa
+    backbone=dict(
+        type='VisionTransformer',
+        img_size=(768, 768),
+        patch_size=16,
+        in_channels=3,
+        embed_dims=1024,
+        num_layers=24,
+        num_heads=16,
+        out_indices=(9, 14, 19, 23),
+        drop_rate=0.1,
+        norm_cfg=backbone_norm_cfg,
+        with_cls_token=True,
+        interpolate_mode='bilinear',
+    ),
+    decode_head=dict(
+        type='SETRUPHead',
+        in_channels=1024,
+        channels=256,
+        in_index=3,
+        num_classes=19,
+        dropout_ratio=0,
+        norm_cfg=norm_cfg,
+        num_convs=4,
+        up_scale=2,
+        kernel_size=3,
+        align_corners=False,
+        loss_decode=dict(
+            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
+    auxiliary_head=[
+        dict(
+            type='SETRUPHead',
+            in_channels=1024,
+            channels=256,
+            in_index=0,
+            num_classes=19,
+            dropout_ratio=0,
+            norm_cfg=norm_cfg,
+            num_convs=1,
+            up_scale=4,
+            kernel_size=3,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+        dict(
+            type='SETRUPHead',
+            in_channels=1024,
+            channels=256,
+            in_index=1,
+            num_classes=19,
+            dropout_ratio=0,
+            norm_cfg=norm_cfg,
+            num_convs=1,
+            up_scale=4,
+            kernel_size=3,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+        dict(
+            type='SETRUPHead',
+            in_channels=1024,
+            channels=256,
+            in_index=2,
+            num_classes=19,
+            dropout_ratio=0,
+            norm_cfg=norm_cfg,
+            num_convs=1,
+            up_scale=4,
+            kernel_size=3,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+    ],
+    train_cfg=dict(),
+    test_cfg=dict(mode='whole'))
diff --git a/configs/setr/README.md b/configs/setr/README.md
@@ -0,0 +1,25 @@
+# Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
+
+## Introduction
+
+<!-- [ALGORITHM] -->
+
+```latex
+@article{zheng2020rethinking,
+  title={Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers},
+  author={Zheng, Sixiao and Lu, Jiachen and Zhao, Hengshuang and Zhu, Xiatian and Luo, Zekun and Wang, Yabiao and Fu, Yanwei and Feng, Jianfeng and Xiang, Tao and Torr, Philip HS and others},
+  journal={arXiv preprint arXiv:2012.15840},
+  year={2020}
+}
+```
+
+## Results and models
+
+### ADE20K
+
+| Method | Backbone | Crop Size | Batch Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU  | mIoU(ms+flip) | config                                                                                                                          | download                                                                                                                                                                                                                                                                                                                                                     |
+| ------ | -------- | --------- | ---------- | ------- | -------- | -------------- | ----- | ------------: | ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| SETR-Naive | ViT-L | 512x512  | 16          | 160000   | 18.40        | 4.72              | 48.28 |             49.56 | [config](https://github.com/open-mmlab/mmsegmentation/blob/master/configs/setr/setr_naive_512x512_160k_b16_ade20k.py)  | [model](https://download.openmmlab.com/mmsegmentation/v0.5/setr/setr_naive_512x512_160k_b16_ade20k/setr_naive_512x512_160k_b16_ade20k_20210619_191258-061f24f5.pth) &#124; [log](https://download.openmmlab.com/mmsegmentation/v0.5/setr/setr_naive_512x512_160k_b16_ade20k/setr_naive_512x512_160k_b16_ade20k_20210619_191258.log.json)     |
+| SETR-PUP | ViT-L | 512x512  | 16          | 160000   | 19.54        | 4.50              | 48.24 |             49.99 | [config](https://github.com/open-mmlab/mmsegmentation/blob/master/configs/setr/setr_pup_512x512_160k_b16_ade20k.py)  | [model](https://download.openmmlab.com/mmsegmentation/v0.5/setr/setr_pup_512x512_160k_b16_ade20k/setr_pup_512x512_160k_b16_ade20k_20210619_191343-7e0ce826.pth) &#124; [log](https://download.openmmlab.com/mmsegmentation/v0.5/setr/setr_pup_512x512_160k_b16_ade20k/setr_pup_512x512_160k_b16_ade20k_20210619_191343.log.json)     |
+| SETR-MLA | ViT-L | 512x512  | 8           | 160000   | 10.96        | -              | 47.34 |             49.05 | [config](https://github.com/open-mmlab/mmsegmentation/blob/master/configs/setr/setr_mla_512x512_160k_b8_ade20k.py)  | [model](https://download.openmmlab.com/mmsegmentation/v0.5/setr/setr_mla_512x512_160k_b8_ade20k/setr_mla_512x512_160k_b8_ade20k_20210619_191118-c6d21df0.pth) &#124; [log](https://download.openmmlab.com/mmsegmentation/v0.5/setr/setr_mla_512x512_160k_b8_ade20k/setr_mla_512x512_160k_b8_ade20k_20210619_191118.log.json)     |
+| SETR-MLA | ViT-L | 512x512  | 16          | 160000   | 17.30        | 5.25              | 47.54 |             49.37 | [config](https://github.com/open-mmlab/mmsegmentation/blob/master/configs/setr/setr_mla_512x512_160k_b16_ade20k.py)  | [model](https://download.openmmlab.com/mmsegmentation/v0.5/setr/setr_mla_512x512_160k_b16_ade20k/setr_mla_512x512_160k_b16_ade20k_20210619_191057-f9741de7.pth) &#124; [log](https://download.openmmlab.com/mmsegmentation/v0.5/setr/setr_mla_512x512_160k_b16_ade20k/setr_mla_512x512_160k_b16_ade20k_20210619_191057.log.json)     |
diff --git a/configs/setr/setr_mla_512x512_160k_b16_ade20k.py b/configs/setr/setr_mla_512x512_160k_b16_ade20k.py
@@ -0,0 +1,4 @@
+_base_ = ['./setr_mla_512x512_160k_b8_ade20k.py']
+
+# num_gpus: 8 -> batch_size: 16
+data = dict(samples_per_gpu=2)
diff --git a/configs/setr/setr_mla_512x512_160k_b8_ade20k.py b/configs/setr/setr_mla_512x512_160k_b8_ade20k.py
@@ -0,0 +1,80 @@
+_base_ = [
+    '../_base_/models/setr_mla.py', '../_base_/datasets/ade20k.py',
+    '../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
+]
+norm_cfg = dict(type='SyncBN', requires_grad=True)
+model = dict(
+    backbone=dict(img_size=(512, 512), drop_rate=0.),
+    decode_head=dict(num_classes=150),
+    auxiliary_head=[
+        dict(
+            type='FCNHead',
+            in_channels=256,
+            channels=256,
+            in_index=0,
+            dropout_ratio=0,
+            norm_cfg=norm_cfg,
+            act_cfg=dict(type='ReLU'),
+            num_convs=0,
+            kernel_size=1,
+            concat_input=False,
+            num_classes=150,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+        dict(
+            type='FCNHead',
+            in_channels=256,
+            channels=256,
+            in_index=1,
+            dropout_ratio=0,
+            norm_cfg=norm_cfg,
+            act_cfg=dict(type='ReLU'),
+            num_convs=0,
+            kernel_size=1,
+            concat_input=False,
+            num_classes=150,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+        dict(
+            type='FCNHead',
+            in_channels=256,
+            channels=256,
+            in_index=2,
+            dropout_ratio=0,
+            norm_cfg=norm_cfg,
+            act_cfg=dict(type='ReLU'),
+            num_convs=0,
+            kernel_size=1,
+            concat_input=False,
+            num_classes=150,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+        dict(
+            type='FCNHead',
+            in_channels=256,
+            channels=256,
+            in_index=3,
+            dropout_ratio=0,
+            norm_cfg=norm_cfg,
+            act_cfg=dict(type='ReLU'),
+            num_convs=0,
+            kernel_size=1,
+            concat_input=False,
+            num_classes=150,
+            align_corners=False,
+            loss_decode=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+    ],
+    test_cfg=dict(mode='slide', crop_size=(512, 512), stride=(341, 341)),
+)
+
+optimizer = dict(
+    lr=0.001,
+    weight_decay=0.0,
+    paramwise_cfg=dict(custom_keys={'head': dict(lr_mult=10.)}))
+
+# num_gpus: 8 -> batch_size: 8
+data = dict(samples_per_gpu=1)