[Docs] Add abstract and image for every paper. (open-mmlab#546)

mzr1996 · Nov 24, 2021 · 36c9d92 · 36c9d92
1 parent c1af106
commit 36c9d92
Show file tree

Hide file tree

Showing 23 changed files with 162 additions and 53 deletions.
diff --git a/configs/lenet/README.md b/configs/lenet/README.md
@@ -1,10 +1,18 @@
 # Backpropagation Applied to Handwritten Zip Code Recognition
 <!-- {LeNet} -->
+<!-- [ALGORITHM] -->
 
-## Introduction
+## Abstract
 
-<!-- [ALGORITHM] -->
+<!-- [ABSTRACT] -->
+The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.
+
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142561080-cd1c4bdc-8739-46ca-bc32-76d462a32901.png" width="50%"/>
+</div>
 
+## Citation
 ```latex
 @ARTICLE{6795724,
   author={Y. {LeCun} and B. {Boser} and J. S. {Denker} and D. {Henderson} and R. E. {Howard} and W. {Hubbard} and L. D. {Jackel}},

diff --git a/configs/mobilenet_v2/README.md b/configs/mobilenet_v2/README.md
@@ -1,10 +1,19 @@
 # MobileNetV2: Inverted Residuals and Linear Bottlenecks
 <!-- {MobileNet V2} -->
+<!-- [ALGORITHM] -->
 
-## Introduction
+## Abstract
+<!-- [ABSTRACT] -->
+In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3.
 
-<!-- [ALGORITHM] -->
+The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters
+
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142563365-7a9ea577-8f79-4c21-a750-ebcaad9bcc2f.png" width="40%"/>
+</div>
 
+## Citation
 ```latex
 @INPROCEEDINGS{8578572,
   author={M. {Sandler} and A. {Howard} and M. {Zhu} and A. {Zhmoginov} and L. {Chen}},

diff --git a/configs/mobilenet_v3/README.md b/configs/mobilenet_v3/README.md
@@ -1,10 +1,17 @@
 # Searching for MobileNetV3
 <!-- {MobileNet V3} -->
+<!-- [ALGORITHM] -->
 
-## Introduction
+## Abstract
+<!-- [ABSTRACT] -->
+We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2\% more accurate on ImageNet classification while reducing latency by 15\% compared to MobileNetV2. MobileNetV3-Small is 4.6\% more accurate while reducing latency by 5\% compared to MobileNetV2. MobileNetV3-Large detection is 25\% faster at roughly the same accuracy as MobileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 30\% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.
 
-<!-- [ALGORITHM] -->
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142563801-ef4feacc-ecd7-4d14-a411-8c9d63571749.png" width="70%"/>
+</div>
 
+## Citation
 ```latex
 @inproceedings{Howard_2019_ICCV,
     author = {Howard, Andrew and Sandler, Mark and Chu, Grace and Chen, Liang-Chieh and Chen, Bo and Tan, Mingxing and Wang, Weijun and Zhu, Yukun and Pang, Ruoming and Vasudevan, Vijay and Le, Quoc V. and Adam, Hartwig},
@@ -23,8 +30,8 @@ The pre-trained modles are converted from [torchvision](https://pytorch.org/visi
 
 |         Model         | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Download |
 |:---------------------:|:---------:|:--------:|:---------:|:---------:|:--------:|
-| MobileNetV3-Large     |   5.48    |   0.23   |   74.04   |   91.34   | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth)|
 | MobileNetV3-Small     |   2.54    |   0.06   |   67.66   |   87.41   | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_small-8427ecf0.pth)|
+| MobileNetV3-Large     |   5.48    |   0.23   |   74.04   |   91.34   | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth)|
 
 ## Results and models
 

diff --git a/configs/regnet/README.md b/configs/regnet/README.md
@@ -1,10 +1,17 @@
 # Designing Network Design Spaces
 <!-- {RegNet} -->
+<!-- [ALGORITHM] -->
 
-## Introduction
+## Abstract
+<!-- [ABSTRACT] -->
+In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.
 
-<!-- [ALGORITHM] -->
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142572813-5dad3317-9d58-4177-971f-d346e01fb3c4.png" width=60%/>
+</div>
 
+## Citation
 ```latex
 @article{radosavovic2020designing,
     title={Designing Network Design Spaces},

diff --git a/configs/repvgg/README.md b/configs/repvgg/README.md
@@ -1,10 +1,17 @@
 # Repvgg: Making vgg-style convnets great again
 <!-- {RepVGG} -->
+<!-- [ALGORITHM] -->
 
-## Introduction
+## Abstract
+<!-- [ABSTRACT] -->
+We present a simple but powerful architecture of convolutional neural network, which has a VGG-like inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG. On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model, to the best of our knowledge. On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet.
 
-<!-- [ALGORITHM] -->
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142573223-f7f14d32-ea08-43a1-81ad-5a6a83ee0122.png" width="60%"/>
+</div>
 
+## Citation
 ```latex
 @inproceedings{ding2021repvgg,
   title={Repvgg: Making vgg-style convnets great again},

diff --git a/configs/res2net/README.md b/configs/res2net/README.md
@@ -1,10 +1,17 @@
 # Res2Net: A New Multi-scale Backbone Architecture
 <!-- {Res2Net} -->
+<!-- [ALGORITHM] -->
 
-## Introduction
+## Abstract
+<!-- [ABSTRACT] -->
+Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layer-wise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods.
 
-<!-- [ALGORITHM] -->
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142573547-cde68abf-287b-46db-a848-5cffe3068faf.png" width="50%"/>
+</div>
 
+## Citation
 ```latex
 @article{gao2019res2net,
   title={Res2Net: A New Multi-scale Backbone Architecture},

diff --git a/configs/resnest/README.md b/configs/resnest/README.md
@@ -1,10 +1,18 @@
 # ResNeSt: Split-Attention Networks
 <!-- {ResNeSt} -->
+<!-- [ALGORITHM] -->
 
-## Introduction
+## Abstract
+<!-- [ABSTRACT] -->
+It is well known that featuremap attention and multi-path representation are important for visual recognition. In this paper, we present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations. Our design results in a simple and unified computation block, which can be parameterized using only a few variables. Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification. In addition, ResNeSt has achieved superior transfer learning results on several public benchmarks serving as the backbone, and has been adopted by the winning entries of COCO-LVIS challenge. The source code for complete system and pretrained models are publicly available.
+
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142573827-a8189607-614b-4385-b579-b0db148b3db7.png" width="60%"/>
+</div>
 
-<!-- [ALGORITHM] -->
 
+## Citation
 ```latex
 @misc{zhang2020resnest,
       title={ResNeSt: Split-Attention Networks},

diff --git a/configs/resnet/README.md b/configs/resnet/README.md
@@ -1,10 +1,19 @@
 # Deep Residual Learning for Image Recognition
 <!-- {ResNet} -->
+<!-- [ALGORITHM] -->
 
-## Introduction
+## Abstract
+<!-- [ABSTRACT] -->
+Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
 
-<!-- [ALGORITHM] -->
+The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
+
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142574068-60cfdeea-c4ec-4c49-abb2-5dc2facafc3b.png" width="40%"/>
+</div>
 
+## Citation
 ```latex
 @inproceedings{he2016deep,
   title={Deep residual learning for image recognition},

diff --git a/configs/resnext/README.md b/configs/resnext/README.md
@@ -1,10 +1,17 @@
 # Aggregated Residual Transformations for Deep Neural Networks
 <!-- {ResNeXt} -->
+<!-- [ALGORITHM] -->
 
-## Introduction
+## Abstract
+<!-- [ABSTRACT] -->
+We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call "cardinality" (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online.
 
-<!-- [ALGORITHM] -->
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142574479-21fb00a2-e63e-4bc6-a9f2-989cd6e15528.png" width="70%"/>
+</div>
 
+## Citation
 ```latex
 @inproceedings{xie2017aggregated,
   title={Aggregated residual transformations for deep neural networks},

diff --git a/configs/seresnet/README.md b/configs/seresnet/README.md
@@ -1,10 +1,17 @@
 # Squeeze-and-Excitation Networks
 <!-- {SE-ResNet} -->
+<!-- [ALGORITHM] -->
 
-## Introduction
+## Abstract
+<!-- [ABSTRACT] -->
+The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
 
-<!-- [ALGORITHM] -->
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142574668-3464d087-b962-48ba-ad1d-5d6b33c3ba0b.png" width="50%"/>
+</div>
 
+## Citation
 ```latex
 @inproceedings{hu2018squeeze,
   title={Squeeze-and-excitation networks},

diff --git a/...eresnext/seresnext101-32x4d_8xb32_in1k.py → ...seresnet/seresnext101-32x4d_8xb32_in1k.py b/...eresnext/seresnext101-32x4d_8xb32_in1k.py → ...seresnet/seresnext101-32x4d_8xb32_in1k.py
diff --git a/...next/seresnext101_32x4d_b32x8_imagenet.py → ...snet/seresnext101_32x4d_b32x8_imagenet.py b/...next/seresnext101_32x4d_b32x8_imagenet.py → ...snet/seresnext101_32x4d_b32x8_imagenet.py
diff --git a/...seresnext/seresnext50-32x4d_8xb32_in1k.py → .../seresnet/seresnext50-32x4d_8xb32_in1k.py b/...seresnext/seresnext50-32x4d_8xb32_in1k.py → .../seresnet/seresnext50-32x4d_8xb32_in1k.py
diff --git a/...snext/seresnext50_32x4d_b32x8_imagenet.py → ...esnet/seresnext50_32x4d_b32x8_imagenet.py b/...snext/seresnext50_32x4d_b32x8_imagenet.py → ...esnet/seresnext50_32x4d_b32x8_imagenet.py
diff --git a/configs/seresnext/README.md b/configs/seresnext/README.md
diff --git a/configs/shufflenet_v1/README.md b/configs/shufflenet_v1/README.md
@@ -1,10 +1,17 @@
 # ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
 <!-- {ShuffleNet V1} -->
+<!-- [ALGORITHM] -->
 
-## Introduction
+## Abstract
+<!-- [ABSTRACT] -->
+We introduce an extremely computation-efficient CNN architecture named ShuffleNet, which is designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute 7.8%) than recent MobileNet on ImageNet classification task, under the computation budget of 40 MFLOPs. On an ARM-based mobile device, ShuffleNet achieves ~13x actual speedup over AlexNet while maintaining comparable accuracy.
 
-<!-- [ALGORITHM] -->
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142575730-dc2f616d-80df-4fb1-93e1-77ebb2b835cf.png" width="70%"/>
+</div>
 
+## Citation
 ```latex
 @inproceedings{zhang2018shufflenet,
   title={Shufflenet: An extremely efficient convolutional neural network for mobile devices},

diff --git a/configs/shufflenet_v2/README.md b/configs/shufflenet_v2/README.md
@@ -1,10 +1,17 @@
 # Shufflenet v2: Practical guidelines for efficient cnn architecture design
 <!-- {ShuffleNet V2} -->
+<!-- [ALGORITHM] -->
 
-## Introduction
+## Abstract
+<!-- [ABSTRACT] -->
+Currently, the neural network architecture design is mostly guided by the *indirect* metric of computation complexity, i.e., FLOPs. However, the *direct* metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical *guidelines* for efficient network design. Accordingly, a new architecture is presented, called *ShuffleNet V2*. Comprehensive ablation experiments verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff.
 
-<!-- [ALGORITHM] -->
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142576336-e0db2866-3add-44e6-a792-14d4f11bd983.png" width="80%"/>
+</div>
 
+## Citation
 ```latex
 @inproceedings{ma2018shufflenet,
   title={Shufflenet v2: Practical guidelines for efficient cnn architecture design},