Skip to content

Files

Latest commit

c448646 · Mar 31, 2023

History

History

maskformer

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Mar 31, 2023
Dec 1, 2022
Feb 1, 2023
Dec 1, 2022
Dec 1, 2022
Mar 17, 2023

MaskFormer

MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation

Introduction

Official Repo

Code Snippet

Abstract

Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

Usage

pip install "mmdet>=3.0.0rc4"

Results and models

ADE20K

Method Backbone Crop Size Lr schd Mem (GB) Inf time (fps) Device mIoU mIoU(ms+flip) config download
MaskFormer R-50-D32 512x512 160000 3.29 A100 42.20 44.29 - config model | log
MaskFormer R-101-D32 512x512 160000 4.12 A100 34.90 45.11 - config model | log
MaskFormer Swin-T 512x512 160000 3.73 A100 40.53 46.69 - config model | log
MaskFormer Swin-S 512x512 160000 5.33 A100 26.98 49.36 - config model | log

Note:

  • All experiments of MaskFormer are implemented with 8 V100 (32G) GPUs with 2 samplers per GPU.
  • The results of MaskFormer are relatively not stable. The accuracy (mIoU) of model with R-101-D32 is from 44.7 to 46.0, and with Swin-S is from 49.0 to 49.8.
  • The ResNet backbones utilized in MaskFormer models are standard ResNet rather than ResNetV1c.
  • Test time augmentation is not supported in MMSegmentation 1.x version yet, we would add "ms+flip" results as soon as possible.

Citation

@article{cheng2021per,
  title={Per-pixel classification is not all you need for semantic segmentation},
  author={Cheng, Bowen and Schwing, Alex and Kirillov, Alexander},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  pages={17864--17875},
  year={2021}
}