Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Add docs and update algo README #259

Merged
merged 2 commits into from
Aug 30, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion configs/distill/mmcls/abloss/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@

An activation boundary for a neuron refers to a separating hyperplane that determines whether the neuron is activated or deactivated. It has been long considered in neural networks that the activations of neurons, rather than their exact output values, play the most important role in forming classification friendly partitions of the hidden feature space. However, as far as we know, this aspect of neural networks has not been considered in the literature of knowledge transfer. In this pa- per, we propose a knowledge transfer method via distillation of activation boundaries formed by hidden neurons. For the distillation, we propose an activation transfer loss that has the minimum value when the boundaries generated by the stu- dent coincide with those by the teacher. Since the activation transfer loss is not differentiable, we design a piecewise differentiable loss approximating the activation transfer loss. By the proposed method, the student learns a separating bound- ary between activation region and deactivation region formed by each neuron in the teacher. Through the experiments in various aspects of knowledge transfer, it is verified that the proposed method outperforms the current state-of-the-art [link](https://github.com/bhheo/AB_distillation)

![pipeline](/docs/en/imgs/model_zoo/abloss/pipeline.png)
<img width="1184" alt="pipeline" src="https://user-images.githubusercontent.com/88702197/187422794-d681ed58-293a-4d9e-9e5b-9937289136a7.png">


## Results and models

Expand Down
3 changes: 2 additions & 1 deletion configs/distill/mmcls/byot/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ Convolutional neural networks have been widely deployed in various application s

## Pipeline

![pipeline](../../../../docs/en/imgs/model_zoo/byot/byot.png)
![byot](https://user-images.githubusercontent.com/88702197/187422992-e7bd692d-b6d4-44d8-8b36-741e0cf1c4f6.png)


## Results and models

Expand Down
3 changes: 2 additions & 1 deletion configs/distill/mmcls/dafl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@

Learning portable neural networks is very essential for computer vision for the purpose that pre-trained heavy deep models can be well applied on edge devices such as mobile phones and micro sensors. Most existing deep neural network compression and speed-up methods are very effective for training compact deep models, when we can directly access the training dataset. However, training data for the given deep network are often unavailable due to some practice problems (e.g. privacy, legal issue, and transmission), and the architecture of the given network are also unknown except some interfaces. To this end, we propose a novel framework for training efficient deep neural networks by exploiting generative adversarial networks (GANs). To be specific, the pre-trained teacher networks are regarded as a fixed discriminator and the generator is utilized for deviating training samples which can obtain the maximum response on the discriminator. Then, an efficient network with smaller model size and computational complexity is trained using the generated data and the teacher network, simultaneously. Efficient student networks learned using the pro- posed Data-Free Learning (DAFL) method achieve 92.22% and 74.47% accuracies using ResNet-18 without any training data on the CIFAR-10 and CIFAR-100 datasets, respectively. Meanwhile, our student network obtains an 80.56% accuracy on the CelebA benchmark.

![pipeline](/docs/en/imgs/model_zoo/dafl/pipeline.png)
<img width="910" alt="pipeline" src="https://user-images.githubusercontent.com/88702197/187423163-b34896fc-8516-403b-acd7-4c0b8e43af5b.png">


## Results and models

Expand Down
2 changes: 1 addition & 1 deletion configs/distill/mmcls/dfad/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

Knowledge Distillation (KD) has made remarkable progress in the last few years and become a popular paradigm for model compression and knowledge transfer. However, almost all existing KD algorithms are data-driven, i.e., relying on a large amount of original training data or alternative data, which is usually unavailable in real-world scenarios. In this paper, we devote ourselves to this challenging problem and propose a novel adversarial distillation mechanism to craft a compact student model without any real-world data. We introduce a model discrepancy to quantificationally measure the difference between student and teacher models and construct an optimizable upper bound. In our work, the student and the teacher jointly act the role of the discriminator to reduce this discrepancy, when a generator adversarially produces some "hard samples" to enlarge it. Extensive experiments demonstrate that the proposed data-free method yields comparable performance to existing data-driven methods. More strikingly, our approach can be directly extended to semantic segmentation, which is more complicated than classification, and our approach achieves state-of-the-art results.

![pipeline](/docs/en/imgs/model_zoo/dfad/pipeline.png)
<img width="1001" alt="pipeline" src="https://user-images.githubusercontent.com/88702197/187423332-30a5d409-6f83-45d7-9e11-e306f7ffec78.png">

## Results and models

Expand Down
2 changes: 1 addition & 1 deletion configs/distill/mmcls/dkd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

State-of-the-art distillation methods are mainly based on distilling deep features from intermediate layers, while the significance of logit distillation is greatly overlooked. To provide a novel viewpoint to study logit distillation, we reformulate the classical KD loss into two parts, i.e., target class knowledge distillation (TCKD) and non-target class knowledge distillation (NCKD). We empirically investigate and prove the effects of the two parts: TCKD transfers knowledge concerning the "difficulty" of training samples, while NCKD is the prominent reason why logit distillation works. More importantly, we reveal that the classical KD loss is a coupled formulation, which (1) suppresses the effectiveness of NCKD and (2) limits the flexibility to balance these two parts. To address these issues, we present Decoupled Knowledge Distillation (DKD), enabling TCKD and NCKD to play their roles more efficiently and flexibly. Compared with complex feature-based methods, our DKD achieves comparable or even better results and has better training efficiency on CIFAR-100, ImageNet, and MS-COCO datasets for image classification and object detection tasks. This paper proves the great potential of logit distillation, and we hope it will be helpful for future research. The code is available at https://github.com/megvii-research/mdistiller.

![avatar](../../../../docs/en/imgs/model_zoo/dkd/dkd.png)
<img width="921" alt="dkd" src="https://user-images.githubusercontent.com/88702197/187423438-c9eadb93-826f-471c-9553-bdae2e434541.png">

## Results and models

Expand Down
3 changes: 2 additions & 1 deletion configs/distill/mmcls/fitnet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ allows one to train deeper students that can generalize better or run faster, a
controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with
almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.

![pipeline](/docs/en/imgs/model_zoo/fitnet/pipeline.png)
<img width="743" alt="pipeline" src="https://user-images.githubusercontent.com/88702197/187423686-68719140-a978-4a19-a684-42b1d793d1fb.png">


## Results and models

Expand Down
2 changes: 1 addition & 1 deletion configs/distill/mmcls/kd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

![pipeline](/docs/en/imgs/model_zoo/kd/pipeline.png)
![pipeline](https://user-images.githubusercontent.com/88702197/187423762-e932dd3e-16cb-4714-a85f-cddfc906c1b7.png)

## Results and models

Expand Down
4 changes: 2 additions & 2 deletions configs/distill/mmcls/ofd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ We investigate the design aspects of feature distillation methods achieving netw

### Feature-based Distillation

![structure](../../../../docs/en/imgs/model_zoo/overhaul/feature_base.png)
![feature_base](https://user-images.githubusercontent.com/88702197/187423965-bb3bde16-c71a-43c6-903c-69aff1005415.png)

### Margin ReLU

![margin_relu](../../../../docs/en/imgs/model_zoo/overhaul/margin_relu.png)
![margin_relu](https://user-images.githubusercontent.com/88702197/187423981-67106ac2-48d9-4002-8b32-b92a90b1dacd.png)

## Results and models

Expand Down
2 changes: 1 addition & 1 deletion configs/distill/mmcls/rkd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ proposed method improves educated student models with a significant margin.
In particular for metric learning, it allows students to outperform their
teachers' performance, achieving the state of the arts on standard benchmark datasets.

![pipeline](/docs/en/imgs/model_zoo/rkd/pipeline.png)
![pipeline](https://user-images.githubusercontent.com/88702197/187424092-b58742aa-6724-4a89-8d28-62960efb58b4.png)

## Results and models

Expand Down
2 changes: 1 addition & 1 deletion configs/distill/mmcls/wsld/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ empirically find that completely filtering out regularization samples also deter
weighted soft labels to help the network adaptively handle the sample-wise biasvariance tradeoff. Experiments on standard evaluation benchmarks validate the
effectiveness of our method.

![pipeline](/docs/en/imgs/model_zoo/wsld/pipeline.png)
<img width="1032" alt="pipeline" src="https://user-images.githubusercontent.com/88702197/187424195-a3ea3d72-5ee7-4ffc-b562-65677076c18e.png">

## Results and models

Expand Down
4 changes: 2 additions & 2 deletions configs/distill/mmcls/zskt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ Performing knowledge transfer from a large teacher network to a smaller student

## The teacher and student decision boundaries

![ZSKT_Distribution](/docs/en/imgs/model_zoo/zskt/zskt_distribution.png)
<img width="766" alt="distribution" src="https://user-images.githubusercontent.com/88702197/187424317-9f3c5547-a838-4858-b63e-608eee8165f5.png">

## Pseudo images sampled from the generator

![ZSKT_Fakeimgs](/docs/en/imgs/model_zoo/zskt/zskt_synthesis.png)
<img width="1176" alt="synthesis" src="https://user-images.githubusercontent.com/88702197/187424322-79be0b07-66b5-4775-8e23-6c2ddca0ad0f.png">

## Results and models

Expand Down
3 changes: 2 additions & 1 deletion configs/distill/mmdet/cwd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@

Knowledge distillation (KD) has been proven to be a simple and effective tool for training compact models. Almost all KD variants for dense prediction tasks align the student and teacher networks' feature maps in the spatial domain, typically by minimizing point-wise and/or pair-wise discrepancy. Observing that in semantic segmentation, some layers' feature activations of each channel tend to encode saliency of scene categories (analogue to class activation mapping), we propose to align features channel-wise between the student and teacher networks. To this end, we first transform the feature map of each channel into a probability map using softmax normalization, and then minimize the Kullback-Leibler (KL) divergence of the corresponding channels of the two networks. By doing so, our method focuses on mimicking the soft distributions of channels between networks. In particular, the KL divergence enables learning to pay more attention to the most salient regions of the channel-wise maps, presumably corresponding to the most useful signals for semantic segmentation. Experiments demonstrate that our channel-wise distillation outperforms almost all existing spatial distillation methods for semantic segmentation considerably, and requires less computational cost during training. We consistently achieve superior performance on three benchmarks with various network structures.

![pipeline](/docs/en/imgs/model_zoo/cwd/pipeline.png)
![pipeline](https://user-images.githubusercontent.com/88702197/187424502-d8efb7a3-c40c-4e53-a36c-bd947de464a4.png)


## Results and models

Expand Down
3 changes: 2 additions & 1 deletion configs/distill/mmdet/fbkd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@

Knowledge distillation, in which a student model is trained to mimic a teacher model, has been proved as an effective technique for model compression and model accuracy boosting. However, most knowledge distillation methods, designed for image classification, have failed on more challenging tasks, such as object detection. In this paper, we suggest that the failure of knowledge distillation on object detection is mainly caused by two reasons: (1) the imbalance between pixels of foreground and background and (2) lack of distillation on the relation between different pixels. Observing the above reasons, we propose attention-guided distillation and non-local distillation to address the two problems, respectively. Attention-guided distillation is proposed to find the crucial pixels of foreground objects with attention mechanism and then make the students take more effort to learn their features. Non-local distillation is proposed to enable students to learn not only the feature of an individual pixel but also the relation between different pixels captured by non-local modules. Experiments show that our methods achieve excellent AP improvements on both one-stage and two-stage, both anchor-based and anchor-free detectors. For example, Faster RCNN (ResNet101 backbone) with our distillation achieves 43.9 AP on COCO2017, which is 4.1 higher than the baseline.

![pipeline](/docs/en/imgs/model_zoo/fbkd/pipeline.png)
<img width="836" alt="pipeline" src="https://user-images.githubusercontent.com/88702197/187424617-6259a7fc-b610-40ae-92eb-f21450dcbaa1.png">


## Results and models

Expand Down
2 changes: 1 addition & 1 deletion configs/distill/mmseg/cwd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

Knowledge distillation (KD) has been proven to be a simple and effective tool for training compact models. Almost all KD variants for dense prediction tasks align the student and teacher networks' feature maps in the spatial domain, typically by minimizing point-wise and/or pair-wise discrepancy. Observing that in semantic segmentation, some layers' feature activations of each channel tend to encode saliency of scene categories (analogue to class activation mapping), we propose to align features channel-wise between the student and teacher networks. To this end, we first transform the feature map of each channel into a probability map using softmax normalization, and then minimize the Kullback-Leibler (KL) divergence of the corresponding channels of the two networks. By doing so, our method focuses on mimicking the soft distributions of channels between networks. In particular, the KL divergence enables learning to pay more attention to the most salient regions of the channel-wise maps, presumably corresponding to the most useful signals for semantic segmentation. Experiments demonstrate that our channel-wise distillation outperforms almost all existing spatial distillation methods for semantic segmentation considerably, and requires less computational cost during training. We consistently achieve superior performance on three benchmarks with various network structures.

![pipeline](/docs/en/imgs/model_zoo/cwd/pipeline.png)
![pipeline](https://user-images.githubusercontent.com/88702197/187424502-d8efb7a3-c40c-4e53-a36c-bd947de464a4.png)

## Results and models

Expand Down
2 changes: 1 addition & 1 deletion configs/nas/mmcls/darts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Unlike conventional approaches of applying evolution or reinforcement learning over a discrete and non-differentiable search space, our method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent. Extensive experiments on CIFAR-10, ImageNet, Penn Treebank and WikiText-2 show that our algorithm excels in discovering high-performance convolutional architectures for image classification and recurrent architectures for language modeling, while being orders of magnitude faster than state-of-the-art non-differentiable techniques. Our implementation has been made publicly available to facilitate further research on efficient architecture search algorithms.

![pipeline](/docs/en/imgs/model_zoo/darts/pipeline.png)
![pipeline](https://user-images.githubusercontent.com/88702197/187425171-2dfe7fbf-7c2c-4c22-9219-2234aa83e47d.png)

## Results and models

Expand Down
3 changes: 2 additions & 1 deletion configs/nas/mmcls/spos/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@
We revisit the one-shot Neural Architecture Search (NAS) paradigm and analyze its advantages over existing NAS approaches. Existing one-shot method, however, is hard to train and not yet effective on large scale datasets like ImageNet. This work propose a Single Path One-Shot model to address the challenge in the training. Our central idea is to construct a simplified supernet, where all architectures are single paths so that weight co-adaption problem is alleviated. Training is performed by uniform path sampling. All architectures (and their weights) are trained fully and equally.
Comprehensive experiments verify that our approach is flexible and effective. It is easy to train and fast to search. It effortlessly supports complex search spaces (e.g., building blocks, channel, mixed-precision quantization) and different search constraints (e.g., FLOPs, latency). It is thus convenient to use for various needs. It achieves start-of-the-art performance on the large dataset ImageNet.

![pipeline](/docs/en/imgs/model_zoo/spos/pipeline.jpg)
![pipeline](https://user-images.githubusercontent.com/88702197/187424862-c2f3fde1-4a48-4eda-9ff7-c65971b683ba.jpg)


## Introduction

Expand Down
2 changes: 1 addition & 1 deletion configs/nas/mmdet/detnas/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

Object detectors are usually equipped with backbone networks designed for image classification. It might be sub-optimal because of the gap between the tasks of image classification and object detection. In this work, we present DetNAS to use Neural Architecture Search (NAS) for the design of better backbones for object detection. It is non-trivial because detection training typically needs ImageNet pre-training while NAS systems require accuracies on the target detection task as supervisory signals. Based on the technique of one-shot supernet, which contains all possible networks in the search space, we propose a framework for backbone search on object detection. We train the supernet under the typical detector training schedule: ImageNet pre-training and detection fine-tuning. Then, the architecture search is performed on the trained supernet, using the detection task as the guidance. This framework makes NAS on backbones very efficient. In experiments, we show the effectiveness of DetNAS on various detectors, for instance, one-stage RetinaNet and the two-stage FPN. We empirically find that networks searched on object detection shows consistent superiority compared to those searched on ImageNet classification. The resulting architecture achieves superior performance than hand-crafted networks on COCO with much less FLOPs complexity.

![pipeline](/docs/en/imgs/model_zoo/detnas/pipeline.jpg)
![pipeline](https://user-images.githubusercontent.com/88702197/187425296-64baa22a-9422-46cd-bd95-47e3e5707f75.jpg)

## Introduction

Expand Down
Loading