diff --git a/.gitignore b/.gitignore index acb2f56524..a0f8ba7486 100644 --- a/.gitignore +++ b/.gitignore @@ -51,6 +51,7 @@ build/Release # Dependency directories node_modules/ jspm_packages/ +**/package-lock.json # TypeScript v1 declaration files typings/ @@ -81,6 +82,8 @@ __pycache__ build *.egg-info setup.pye +**/__init__.pye +**/.ipynb_checkpoints # Environments .env diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000000..ced2b5c962 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,68 @@ +# Contributing to NNI + +Welcome, and thank you for your interest in contributing to NNI! + +There are many ways in which you can contribute, beyond writing code. The goal of this document is to provide a high-level overview of how you can get involved. + +# Provide feedback or ask a question + +* [File an issue](https://github.com/microsoft/nni/issues/new/choose) on GitHub. +* Ask a question with NNI tags on [Stack Overflow](https://stackoverflow.com/questions/tagged/nni?sort=Newest&edited=true). +* Discuss on the NNI [Gitter](https://gitter.im/Microsoft/nni?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) in NNI. + +Join IM discussion groups: +|Gitter||WeChat| +|----|----|----| +|![image](https://user-images.githubusercontent.com/39592018/80665738-e0574a80-8acc-11ea-91bc-0836dc4cbf89.png)| OR |![image](https://github.com/scarlett2018/nniutil/raw/master/wechat.png)| + + +# Look for an existing issue +Before you create a new issue, please do a search in [open issues](https://github.com/microsoft/nni/issues) to see if the issue or feature request has already been filed. + +Be sure to scan through the [most popular](https://github.com/microsoft/nni/issues?q=is%3Aopen+is%3Aissue+label%3AFAQ+sort%3Areactions-%2B1-desc) feature requests. + +If you find your issue already exists, make relevant comments and add your [reaction](https://github.com/blog/2119-add-reactions-to-pull-requests-issues-and-comments). Use a reaction in place of a "+1" comment: + +* 👍 - upvote +* 👎 - downvote + +If you cannot find an existing issue that describes your bug or feature, create a new issue using the guidelines below. + +# Writing good bug reports or feature requests +File a single issue per problem and feature request. Do not enumerate multiple bugs or feature requests in the same issue. + +Provide as many information as you think might relevant to the context (thinking the issue is assigning to you, what kinds of info you will need to debug it!!!). To give you a general idea about what kinds of info are useful for developers to dig out the issue, we had provided issue template for you. + +Once you had submitted an issue, be sure to follow it for questions and discussions. + +Once the bug is fixed or feature is addressed, be sure to close the issue. + +# Contributing fixes or examples + +This project welcomes contributions and suggestions. Most contributions require you to agree to a +Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us +the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. + +When you submit a pull request, a CLA bot will automatically determine whether you need to provide +a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions +provided by the bot. You will only need to do this once across all repos using our CLA. + +# Code of Conduct + +This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). +For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or +contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. + +# How to Contribute + +After getting familiar with contribution agreements, you are ready to create your first PR =), follow the NNI developer tutorials to get start: + +* We recommend new contributors to start with simple issues: ['good first issue'](https://github.com/Microsoft/nni/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or ['help-wanted'](https://github.com/microsoft/nni/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22). +* [NNI developer environment installation tutorial](docs/en_US/Tutorial/SetupNniDeveloperEnvironment.md) +* [How to debug](docs/en_US/Tutorial/HowToDebug.md) +* If you have any questions on usage, review [FAQ](https://github.com/microsoft/nni/blob/master/docs/en_US/Tutorial/FAQ.md) first, if there are no relevant issues and answers to your question, try contact NNI dev team and users in [Gitter](https://gitter.im/Microsoft/nni?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) or [File an issue](https://github.com/microsoft/nni/issues/new/choose) on GitHub. +* [Customize your own Tuner](docs/en_US/Tuner/CustomizeTuner.md) +* [Implement customized TrainingService](docs/en_US/TrainingService/HowToImplementTrainingService.md) +* [Implement a new NAS trainer on NNI](docs/en_US/NAS/Advanced.md) +* [Customize your own Advisor](docs/en_US/Tuner/CustomizeAdvisor.md) + diff --git a/README.md b/README.md index 41d0a87960..e03b339b80 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ [简体中文](README_zh_CN.md) -**NNI (Neural Network Intelligence)** is a lightweight but powerful toolkit to help users **automate** Feature Engineering, Neural Architecture Search, Hyperparameter Tuning and Model Compression. +**NNI (Neural Network Intelligence)** is a lightweight but powerful toolkit to help users **automate** Feature Engineering, Neural Architecture Search, Hyperparameter Tuning and Model Compression. The tool manages automated machine learning (AutoML) experiments, **dispatches and runs** experiments' trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in **different training environments** like Local Machine, Remote Servers, OpenPAI, Kubeflow, FrameworkController on K8S (AKS etc.), DLWorkspace (aka. DLTS), AML (Azure Machine Learning) and other cloud options. @@ -135,24 +135,25 @@ Within the following table, we summarized the current NNI capabilities, we are g
  • ProxylessNAS
  • Network Morphism
  • TextNAS
  • +
  • Cream
  • - Model Compression + Model Compression Feature Engineering (Beta) @@ -331,8 +332,8 @@ With authors' permission, we listed a set of NNI usage examples and relevant art * [Hyperparameter Tuning for Matrix Factorization](https://github.com/microsoft/recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb) with NNI * [scikit-nni](https://github.com/ksachdeva/scikit-nni) Hyper-parameter search for scikit-learn pipelines using NNI * ### **Relevant Articles** ### - * [Hyper Parameter Optimization Comparison](docs/en_US/CommunitySharings/HpoComparision.md) - * [Neural Architecture Search Comparison](docs/en_US/CommunitySharings/NasComparision.md) + * [Hyper Parameter Optimization Comparison](docs/en_US/CommunitySharings/HpoComparison.md) + * [Neural Architecture Search Comparison](docs/en_US/CommunitySharings/NasComparison.md) * [Parallelizing a Sequential Algorithm TPE](docs/en_US/CommunitySharings/ParallelizingTpeSearch.md) * [Automatically tuning SVD with NNI](docs/en_US/CommunitySharings/RecommendersSvd.md) * [Automatically tuning SPTAG with NNI](docs/en_US/CommunitySharings/SptagAutoTune.md) diff --git a/README_zh_CN.md b/README_zh_CN.md index d123d04dea..3e7a52eb4c 100644 --- a/README_zh_CN.md +++ b/README_zh_CN.md @@ -10,7 +10,7 @@ **NNI (Neural Network Intelligence)** 是一个轻量但强大的工具包,帮助用户**自动**的进行[特征工程](docs/zh_CN/FeatureEngineering/Overview.md),[神经网络架构搜索](docs/zh_CN/NAS/Overview.md),[超参调优](docs/zh_CN/Tuner/BuiltinTuner.md)以及[模型压缩](docs/zh_CN/Compressor/Overview.md)。 -NNI 管理自动机器学习 (AutoML) 的 Experiment,**调度运行**由调优算法生成的 Trial 任务来找到最好的神经网络架构和/或超参,支持**各种训练环境**,如[本机](docs/zh_CN/TrainingService/LocalMode. md),[远程服务器](docs/zh_CN/TrainingService/RemoteMachineMode. md),[OpenPAI](docs/zh_CN/TrainingService/PaiMode. md),[Kubeflow](docs/zh_CN/TrainingService/KubeflowMode. md),[基于 K8S 的 FrameworkController(如,AKS 等)](docs/zh_CN/TrainingService/FrameworkControllerMode. md), [DLWorkspace](docs/zh_CN/TrainingService/DLTSMode. md) (又称 DLTS), [AML](docs/zh_CN/TrainingService/AMLMode.md) (Azure Machine Learning) 以及其它环境。 +NNI 管理自动机器学习 (AutoML) 的 Experiment,**调度运行**由调优算法生成的 Trial 任务来找到最好的神经网络架构和/或超参,支持**各种训练环境**,如[本机](docs/zh_CN/TrainingService/LocalMode.md),[远程服务器](docs/zh_CN/TrainingService/RemoteMachineMode.md),[OpenPAI](docs/zh_CN/TrainingService/PaiMode.md),[Kubeflow](docs/zh_CN/TrainingService/KubeflowMode.md),[基于 K8S 的 FrameworkController(如,AKS 等)](docs/zh_CN/TrainingService/FrameworkControllerMode.md), [DLWorkspace](docs/zh_CN/TrainingService/DLTSMode.md) (又称 DLTS), [AML](docs/zh_CN/TrainingService/AMLMode.md) (Azure Machine Learning) 以及其它环境。 ## **使用场景** @@ -328,8 +328,8 @@ You can use these commands to get more information about the experiment * 使用 NNI 的 [矩阵分解超参调优](https://github.com/microsoft/recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb) * [scikit-nni](https://github.com/ksachdeva/scikit-nni) 使用 NNI 为 scikit-learn 开发的超参搜索。 * ### **相关文章** ### - * [超参数优化的对比](docs/zh_CN/CommunitySharings/HpoComparision.md) - * [神经网络结构搜索的对比](docs/zh_CN/CommunitySharings/NasComparision.md) + * [超参数优化的对比](docs/zh_CN/CommunitySharings/HpoComparison.md) + * [神经网络结构搜索的对比](docs/zh_CN/CommunitySharings/NasComparison.md) * [并行化顺序算法:TPE](docs/zh_CN/CommunitySharings/ParallelizingTpeSearch.md) * [使用 NNI 为 SVD 自动调参](docs/zh_CN/CommunitySharings/RecommendersSvd.md) * [使用 NNI 为 SPTAG 自动调参](docs/zh_CN/CommunitySharings/SptagAutoTune.md) diff --git a/docs/en_US/Compression/CompressionUtils.md b/docs/en_US/Compression/CompressionUtils.md index 6f99bf5574..63d9719b19 100644 --- a/docs/en_US/Compression/CompressionUtils.md +++ b/docs/en_US/Compression/CompressionUtils.md @@ -121,14 +121,28 @@ fixed_mask = fix_mask_conflict('./resnet18_mask', net, data) ``` ## Model FLOPs/Parameters Counter -We provide a model counter for calculating the model FLOPs and parameters. This counter supports calculating FLOPs/parameters of a normal model without masks, it can also calculates FLOPs/parameters of a model with mask wrappers, which helps users easily check model complexity during model compression on NNI. Note that, for sturctured pruning, we only identify the remained filters according to its mask, which not taking the pruned input channels into consideration, so the calculated FLOPs will be larger than real number (i.e., the number calculated after Model Speedup). +We provide a model counter for calculating the model FLOPs and parameters. This counter supports calculating FLOPs/parameters of a normal model without masks, it can also calculates FLOPs/parameters of a model with mask wrappers, which helps users easily check model complexity during model compression on NNI. Note that, for sturctured pruning, we only identify the remained filters according to its mask, which not taking the pruned input channels into consideration, so the calculated FLOPs will be larger than real number (i.e., the number calculated after Model Speedup). + +We support two modes to collect information of modules. The first mode is `default`, which only collect the information of convolution and linear. The second mode is `full`, which also collect the information of other operations. Users can easily use our collected `results` for futher analysis. ### Usage ``` from nni.compression.pytorch.utils.counter import count_flops_params -# Given input size (1, 1, 28, 28) -flops, params = count_flops_params(model, (1, 1, 28, 28)) +# Given input size (1, 1, 28, 28) +flops, params, results = count_flops_params(model, (1, 1, 28, 28)) + +# Given input tensor with size (1, 1, 28, 28) and switch to full mode +x = torch.randn(1, 1, 28, 28) + +flops, params, results = count_flops_params(model, (x,) mode='full') # tuple of tensor as input + # Format output size to M (i.e., 10^6) print(f'FLOPs: {flops/1e6:.3f}M, Params: {params/1e6:.3f}M) +print(results) +{ +'conv': {'flops': [60], 'params': [20], 'weight_size': [(5, 3, 1, 1)], 'input_size': [(1, 3, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']}, +'conv2': {'flops': [100], 'params': [30], 'weight_size': [(5, 5, 1, 1)], 'input_size': [(1, 5, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']} +} + ``` diff --git a/docs/en_US/NAS/CDARTS.md b/docs/en_US/NAS/CDARTS.md index 4152b8efa2..07f8faf22d 100644 --- a/docs/en_US/NAS/CDARTS.md +++ b/docs/en_US/NAS/CDARTS.md @@ -1,16 +1,17 @@ + # CDARTS ## Introduction -CDARTS builds a cyclic feedback mechanism between the search and evaluation networks. First, the search network generates an initial topology for evaluation, so that the weights of the evaluation network can be optimized. Second, the architecture topology in the search network is further optimized by the label supervision in classification, as well as the regularization from the evaluation network through feature distillation. Repeating the above cycle results in a joint optimization of the search and evaluation networks, and thus enables the evolution of the topology to fit the final evaluation network. +[CDARTS](https://arxiv.org/pdf/2006.10724.pdf) builds a cyclic feedback mechanism between the search and evaluation networks. First, the search network generates an initial topology for evaluation, so that the weights of the evaluation network can be optimized. Second, the architecture topology in the search network is further optimized by the label supervision in classification, as well as the regularization from the evaluation network through feature distillation. Repeating the above cycle results in a joint optimization of the search and evaluation networks, and thus enables the evolution of the topology to fit the final evaluation network. -In implementation of `CdartsTrainer`, it first instantiates two models and two mutators (one for each). The first model is the so-called "search network", which is mutated with a `RegularizedDartsMutator` -- a mutator with subtle differences with `DartsMutator`. The second model is the "evaluation network", which is mutated with a discrete mutator that leverages the previous search network mutator, to sample a single path each time. Trainers train models and mutators alternatively. Users can refer to [references](#reference) if they are interested in more details on these trainers and mutators. +In implementation of `CdartsTrainer`, it first instantiates two models and two mutators (one for each). The first model is the so-called "search network", which is mutated with a `RegularizedDartsMutator` -- a mutator with subtle differences with `DartsMutator`. The second model is the "evaluation network", which is mutated with a discrete mutator that leverages the previous search network mutator, to sample a single path each time. Trainers train models and mutators alternatively. Users can refer to [paper](https://arxiv.org/pdf/2006.10724.pdf) if they are interested in more details on these trainers and mutators. ## Reproduction Results This is CDARTS based on the NNI platform, which currently supports CIFAR10 search and retrain. ImageNet search and retrain should also be supported, and we provide corresponding interfaces. Our reproduced results on NNI are slightly lower than the paper, but much higher than the original DARTS. Here we show the results of three independent experiments on CIFAR10. -| Runs | Paper | NNI | +| Runs | Paper | NNI | | ---- |:-------------:| :-----:| | 1 | 97.52 | 97.44 | | 2 | 97.53 | 97.48 | @@ -19,7 +20,7 @@ This is CDARTS based on the NNI platform, which currently supports CIFAR10 searc ## Examples -[Example code](https://github.com/microsoft/nni/tree/v1.9/examples/nas/cdarts) +[Example code](https://github.com/microsoft/nni/tree/master/examples/nas/cdarts) ```bash # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder. @@ -55,3 +56,4 @@ bash run_retrain_cifar.sh .. autoclass:: nni.algorithms.nas.pytorch.cdarts.RegularizedMutatorParallel :members: ``` + diff --git a/docs/en_US/NAS/Cream.md b/docs/en_US/NAS/Cream.md new file mode 100644 index 0000000000..beb232c085 --- /dev/null +++ b/docs/en_US/NAS/Cream.md @@ -0,0 +1,127 @@ +# Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search + +**[[Paper]](https://papers.nips.cc/paper/2020/file/d072677d210ac4c03ba046120f0802ec-Paper.pdf) [[Models-Google Drive]](https://drive.google.com/drive/folders/1NLGAbBF9bA1IUAxKlk2VjgRXhr6RHvRW?usp=sharing)[[Models-Baidu Disk (PWD: wqw6)]](https://pan.baidu.com/s/1TqQNm2s14oEdyNPimw3T9g) [[BibTex]](https://scholar.googleusercontent.com/scholar.bib?q=info:ICWVXc_SsKAJ:scholar.google.com/&output=citation&scisdr=CgUmooXfEMfTi0cV5aU:AAGBfm0AAAAAX7sQ_aXoamdKRaBI12tAVN8REq1VKNwM&scisig=AAGBfm0AAAAAX7sQ_RdYtp6BSro3zgbXVJU2MCgsG730&scisf=4&ct=citation&cd=-1&hl=ja)**
    + +In this work, we present a simple yet effective architecture distillation method. The central idea is that subnetworks can learn collaboratively and teach each other throughout the training process, aiming to boost the convergence of individual models. We introduce the concept of prioritized path, which refers to the architecture candidates exhibiting superior performance during training. Distilling knowledge from the prioritized paths is able to boost the training of subnetworks. Since the prioritized paths are changed on the fly depending on their performance and complexity, the final obtained paths are the cream of the crop. The discovered architectures achieve superior performance compared to the recent [MobileNetV3](https://arxiv.org/abs/1905.02244) and [EfficientNet](https://arxiv.org/abs/1905.11946) families under aligned settings. + +
    + +
    + + +## Reproduced Results +Top-1 Accuracy on ImageNet. The top-1 accuracy of Cream search algorithm surpasses MobileNetV3 and EfficientNet-B0/B1 on ImageNet. +The training with 16 Gpus is a little bit superior than 8 Gpus, as below. + +| Model (M Flops) | 8Gpus | 16Gpus | +| ---- |:-------------:| :-----:| +| 14M | 53.7 | 53.8 | +| 43M | 65.8 | 66.5 | +| 114M | 72.1 | 72.8 | +| 287M | 76.7 | 77.6 | +| 481M | 78.9 | 79.2 | +| 604M | 79.4 | 80.0 | + + + + +
    drawingdrawing
    + +## Examples + +[Example code](https://github.com/microsoft/nni/tree/master/examples/nas/cream) + +Please run the following scripts in the example folder. + +## Data Preparation + +You need to first download the [ImageNet-2012](http://www.image-net.org/) to the folder `./data/imagenet` and move the validation set to the subfolder `./data/imagenet/val`. To move the validation set, you cloud use the following script: + +Put the imagenet data in `./data`. It should be like following: + +``` +./data/imagenet/train +./data/imagenet/val +... +``` + +## Quick Start + +### I. Search + +First, build environments for searching. + +``` +pip install -r ./requirements + +git clone https://github.com/NVIDIA/apex.git +cd apex +python setup.py install --cpp_ext --cuda_ext +``` + +To search for an architecture, you need to configure the parameters `FLOPS_MINIMUM` and `FLOPS_MAXIMUM` to specify the desired model flops, such as [0,600]MB flops. You can specify the flops interval by changing these two parameters in `./configs/train.yaml` + +``` +FLOPS_MINIMUM: 0 # Minimum Flops of Architecture +FLOPS_MAXIMUM: 600 # Maximum Flops of Architecture +``` + +For example, if you expect to search an architecture with model flops <= 200M, please set the `FLOPS_MINIMUM` and `FLOPS_MAXIMUM` to be `0` and `200`. + +After you specify the flops of the architectures you would like to search, you can search an architecture now by running: + +``` +python -m torch.distributed.launch --nproc_per_node=8 ./train.py --cfg ./configs/train.yaml +``` + +The searched architectures need to be retrained and obtain the final model. The final model is saved in `.pth.tar` format. Retraining code will be released soon. + +### II. Retrain + +To train searched architectures, you need to configure the parameter `MODEL_SELECTION` to specify the model Flops. To specify which model to train, you should add `MODEL_SELECTION` in `./configs/retrain.yaml`. You can select one from [14,43,112,287,481,604], which stands for different Flops(MB). + +``` +MODEL_SELECTION: 43 # Retrain 43m model +MODEL_SELECTION: 481 # Retrain 481m model +...... +``` + +To train random architectures, you need specify `MODEL_SELECTION` to `-1` and configure the parameter `INPUT_ARCH`: + +``` +MODEL_SELECTION: -1 # Train random architectures +INPUT_ARCH: [[0], [3], [3, 3], [3, 1, 3], [3, 3, 3, 3], [3, 3, 3], [0]] # Random Architectures +...... +``` + +After adding `MODEL_SELECTION` in `./configs/retrain.yaml`, you need to use the following command to train the model. + +``` +python -m torch.distributed.launch --nproc_per_node=8 ./retrain.py --cfg ./configs/retrain.yaml +``` + +### III. Test + +To test our trained of models, you need to use `MODEL_SELECTION` in `./configs/test.yaml` to specify which model to test. + +``` +MODEL_SELECTION: 43 # test 43m model +MODEL_SELECTION: 481 # test 470m model +...... +``` + +After specifying the flops of the model, you need to write the path to the resume model in `./test.sh`. + +``` +RESUME_PATH: './43.pth.tar' +RESUME_PATH: './481.pth.tar' +...... +``` + +We provide 14M/43M/114M/287M/481M/604M pretrained models in [google drive](https://drive.google.com/drive/folders/1CQjyBryZ4F20Rutj7coF8HWFcedApUn2) or [[Models-Baidu Disk (password: wqw6)]](https://pan.baidu.com/s/1TqQNm2s14oEdyNPimw3T9g) . + +After downloading the pretrained models and adding `MODEL_SELECTION` and `RESUME_PATH` in './configs/test.yaml', you need to use the following command to test the model. + +``` +python -m torch.distributed.launch --nproc_per_node=8 ./test.py --cfg ./configs/test.yaml +``` diff --git a/docs/en_US/NAS/one_shot_nas.rst b/docs/en_US/NAS/one_shot_nas.rst index cc7fa688b6..77b3cfcc94 100644 --- a/docs/en_US/NAS/one_shot_nas.rst +++ b/docs/en_US/NAS/one_shot_nas.rst @@ -14,4 +14,5 @@ One-shot NAS algorithms leverage weight sharing among models in neural architect SPOS CDARTS ProxylessNAS - TextNAS \ No newline at end of file + TextNAS + Cream diff --git a/docs/en_US/TrainingService/AdaptDLMode.md b/docs/en_US/TrainingService/AdaptDLMode.md new file mode 100644 index 0000000000..4b95a00865 --- /dev/null +++ b/docs/en_US/TrainingService/AdaptDLMode.md @@ -0,0 +1,188 @@ +# Run an Experiment on AdaptDL + +Now NNI supports running experiment on [AdaptDL](https://github.com/petuum/adaptdl). Before starting to use NNI AdaptDL mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. In AdaptDL mode, your trial program will run as AdaptDL job in Kubernetes cluster. + +AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud. + +## Prerequisite for Kubernetes Service + +1. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes [on Azure](https://azure.microsoft.com/en-us/services/kubernetes-service/), or [on-premise](https://kubernetes.io/docs/setup/) with [cephfs](https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd), or [microk8s with storage add-on enabled](https://microk8s.io/docs/addons). +2. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this [guideline](https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html) to setup AdaptDL scheduler. +3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig. +4. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**. +5. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage. +6. Install **NNI**, follow the install guide [here](../Tutorial/QuickStart.md). + +### Verify Prerequisites + +```bash +nnictl --version +# Expected: +``` + +```bash +kubectl version +# Expected that the kubectl client version matches the server version. +``` + +```bash +kubectl api-versions | grep adaptdl +# Expected: adaptdl.petuum.com/v1 +``` + +## Run an experiment + +We have a CIFAR10 example that fully leverages the AdaptDL scheduler under `examples/trials/cifar10_pytorch` folder. (`main_adl.py` and `config_adl.yaml`) + +Here is a template configuration specification to use AdaptDL as a training service. + +```yaml +authorName: default +experimentName: minimal_adl + +trainingServicePlatform: adl +nniManagerIp: 10.1.10.11 +logCollection: http + +tuner: + builtinTunerName: GridSearch +searchSpacePath: search_space.json + +trialConcurrency: 2 +maxTrialNum: 2 + +trial: + adaptive: false # optional. + image: + imagePullSecrets: # optional + - name: stagingsecret + codeDir: . + command: python main.py + gpuNum: 1 + cpuNum: 1 # optional + memorySize: 8Gi # optional + nfs: # optional + server: 10.20.41.55 + path: / + containerMountPath: /nfs + checkpoint: # optional + storageClass: microk8s-hostpath + storageSize: 1Gi +``` + +Those configs not mentioned below, are following the +[default specs defined in the NNI doc](https://nni.readthedocs.io/en/latest/Tutorial/ExperimentConfig.html#configuration-spec). + +* **trainingServicePlatform**: Choose `adl` to use the Kubernetes cluster with AdaptDL scheduler. +* **nniManagerIp**: *Required* to get the correct info and metrics back from the cluster, for `adl` training service. +IP address of the machine with NNI manager (NNICTL) that launches NNI experiment. +* **logCollection**: *Recommended* to set as `http`. It will collect the trial logs on cluster back to your machine via http. +* **tuner**: It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners). +* **trial**: It defines the specs of an `adl` trial. + * **adaptive**: (*Optional*) Boolean for AdaptDL trainer. While `true`, it the job is preemptible and adaptive. + * **image**: Docker image for the trial + * **imagePullSecret**: (*Optional*) If you are using a private registry, + you need to provide the secret to successfully pull the image. + * **codeDir**: the working directory of the container. `.` means the default working directory defined by the image. + * **command**: the bash command to start the trial + * **gpuNum**: the number of GPUs requested for this trial. It must be non-negative integer. + * **cpuNum**: (*Optional*) the number of CPUs requested for this trial. It must be non-negative integer. + * **memorySize**: (*Optional*) the size of memory requested for this trial. It must follow the Kubernetes + [default format](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory). + * **nfs**: (*Optional*) mounting external storage. For more information about using NFS please check the below paragraph. + * **checkpoint** (*Optional*) [storage settings](https://kubernetes.io/docs/concepts/storage/storage-classes/) for AdaptDL internal checkpoints. You can keep it optional if you are not dev users. + +### NFS Storage + +As you may have noticed in the above configuration spec, +an *optional* section is available to configure NFS external storage. It is optional when no external storage is required, when for example an docker image is sufficient with codes and data inside. + +Note that `adl` training service does NOT help mount an NFS to the local dev machine, so that one can manually mount it to local, manage the filesystem, copy the data or code etc. +The `adl` training service can then mount it to the kubernetes for every trials, with the proper configurations: + +* **server**: NFS server address, e.g. IP address or domain +* **path**: NFS server export path, i.e. the absolute path in NFS that can be mounted to trials +* **containerMountPath**: In container absolute path to mount the NFS **path** above, +so that every trial will have the access to the NFS. +In the trial containers, you can access the NFS with this path. + +Use cases: + +* If your training trials depend on a dataset of large size, you may want to download it first onto the NFS first, + and mount it so that it can be shared across multiple trials. +* The storage for containers are ephemeral and the trial containers will be deleted after a trial's lifecycle is over. +So if you want to export your trained models, +you may mount the NFS to the trial to persist and export your trained models. + +In short, it is not limited how a trial wants to read from or write on the NFS storage, so you may use it flexibly as per your needs. + + +## Monitor via Log Stream + +Follow the log streaming of a certain trial: + +```bash +nnictl log trial --trial_id= +``` + +```bash +nnictl log trial --trial_id= +``` + +Note that *after* a trial has done and its pod has been deleted, +no logs can be retrieved then via this command. +However you may still be able to access the past trial logs +according to the following approach. + + +## Monitor via TensorBoard + +In the context of NNI, an experiment has multiple trials. +For easy comparison across trials for a model tuning process, +we support TensorBoard integration. Here one experiment has +an independent TensorBoard logging directory thus dashboard. + + +You can only use the TensorBoard while the monitored experiment is running. +In other words, it is not supported to monitor stopped experiments. + + +In the trial container you may have access to two environment variables: + +* `ADAPTDL_TENSORBOARD_LOGDIR`: the TensorBoard logging directory for the current experiment, +* `NNI_TRIAL_JOB_ID`: the `trial` job id for the current trial. + +It is recommended for to have them joined as the directory for trial, +for example in Python: + +```python +import os +tensorboard_logdir = os.path.join( + os.getenv("ADAPTDL_TENSORBOARD_LOGDIR"), + os.getenv("NNI_TRIAL_JOB_ID") +) +``` + +If an experiment is stopped, the data logged here +(defined by *the above envs* for monitoring with the following commands) +will be lost. To persist the logged data, you can use the external storage (e.g. to mount an NFS) +to export it and view the TensorBoard locally. + + +With the above setting, you can monitor the experiment easily +via TensorBoard by + +```bash +nnictl tensorboard start +``` + +If having multiple experiment running at the same time, you may use + +```bash +nnictl tensorboard start +``` + +It will provide you the web url to access the tensorboard. + +Note that you have the flexibility to set up the local `--port` +for the TensorBoard. diff --git a/docs/en_US/TrainingService/Overview.md b/docs/en_US/TrainingService/Overview.md index bd87037f0f..4daa21aceb 100644 --- a/docs/en_US/TrainingService/Overview.md +++ b/docs/en_US/TrainingService/Overview.md @@ -4,7 +4,7 @@ NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled. -Users can use training service provided by NNI, to run trial jobs on [local machine](./LocalMode.md), [remote machines](./RemoteMachineMode.md), and on clusters like [PAI](./PaiMode.md), [Kubeflow](./KubeflowMode.md), [FrameworkController](./FrameworkControllerMode.md), [DLTS](./DLTSMode.md) and [AML](./AMLMode.md). These are called *built-in training services*. +Users can use training service provided by NNI, to run trial jobs on [local machine](./LocalMode.md), [remote machines](./RemoteMachineMode.md), and on clusters like [PAI](./PaiMode.md), [Kubeflow](./KubeflowMode.md), [AdaptDL](./AdaptDLMode.md), [FrameworkController](./FrameworkControllerMode.md), [DLTS](./DLTSMode.md) and [AML](./AMLMode.md). These are called *built-in training services*. If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to "[how to implement training service](./HowToImplementTrainingService)" for details. @@ -24,6 +24,7 @@ In case users intend to use large files in their experiment (like large-scaled d |[__Remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.| |[__PAI__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.| |[__Kubeflow__](./KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.| +|[__AdaptDL__](./AdaptDLMode.md)|NNI supports running experiment on [AdaptDL](https://github.com/petuum/adaptdl), called AdaptDL mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster.| |[__FrameworkController__](./FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.| |[__DLTS__](./DLTSMode.md)|NNI supports running experiment using [DLTS](https://github.com/microsoft/DLWorkspace.git), which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.| |[__AML__](./AMLMode.md)|NNI supports running an experiment on [AML](https://azure.microsoft.com/en-us/services/machine-learning/) , called aml mode. diff --git a/docs/en_US/TrainingService/PaiMode.md b/docs/en_US/TrainingService/PaiMode.md index 4e9acd234c..b86eb914e0 100644 --- a/docs/en_US/TrainingService/PaiMode.md +++ b/docs/en_US/TrainingService/PaiMode.md @@ -2,25 +2,27 @@ === NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker. +[toc] + ## Setup environment -Step 1. Install NNI, follow the install guide [here](../Tutorial/QuickStart.md). +**Step 1. Install NNI, follow the install guide [here](../Tutorial/QuickStart.md).** -Step 2. Get token. +**Step 2. Get token.** Open web portal of OpenPAI, and click `My profile` button in the top-right side. -![](../../img/pai_profile.jpg) + Click `copy` button in the page to copy a jwt token. -![](../../img/pai_token.jpg) + -Step 3. Mount NFS storage to local machine. +**Step 3. Mount NFS storage to local machine.** Click `Submit job` button in web portal. -![](../../img/pai_job_submission_page.jpg) + Find the data management region in job submission page. -![](../../img/pai_data_management_page.jpg) + The `Preview container paths` is the NFS host and path that OpenPAI provided, you need to mount the corresponding host and path to your local machine first, then NNI could use the OpenPAI's NFS storage. For example, use the following command: @@ -33,9 +35,9 @@ You could use the following configuration in your NNI's config file: ```yaml nniManagerNFSMountPath: /local/mnt -``` +``` -Step 4. Get OpenPAI's storage config name and nniManagerMountPath +**Step 4. Get OpenPAI's storage config name and nniManagerMountPath** The `Team share storage` field is storage configuration used to specify storage value in OpenPAI. You can get `paiStorageConfigName` and `containerNFSMountPath` field in `Team share storage`, for example: @@ -44,7 +46,10 @@ paiStorageConfigName: confignfs-data containerNFSMountPath: /mnt/confignfs-data ``` + + ## Run an experiment + Use `examples/trials/mnist-annotation` as an example. The NNI config YAML file's content is like: ```yaml @@ -88,9 +93,11 @@ paiConfig: Note: You should set `trainingServicePlatform: pai` in NNI config YAML file if you want to start experiment in pai mode. The host field in configuration file is PAI's job submission page uri, like `10.10.5.1`, the default http protocol in NNI is `http`, if your PAI's cluster enabled https, please use the uri in `https://10.10.5.1` format. + + ### Trial configurations -Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMode.md), `trial` configuration in pai mode have these additional keys: +Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMode.md), `trial` configuration in pai mode has the following additional keys: * cpuNum @@ -136,6 +143,8 @@ Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMod 2. If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error. + + ### OpenPAI configurations `paiConfig` includes OpenPAI specific configurations, @@ -171,17 +180,23 @@ Notice: In pai mode, NNIManager will start a rest server and listen on a port wh Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information. Expand a trial information in trial list view, click the logPath link like: -![](../../img/nni_webui_joblist.jpg) + And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS: -![](../../img/nni_trial_hdfs_output.jpg) + You can see there're three fils in output folder: stderr, stdout, and trial.log + + ## data management + Before using NNI to start your experiment, users should set the corresponding mount data path in your nniManager machine. OpenPAI has their own storage(NFS, AzureBlob ...), and the storage will used in OpenPAI will be mounted to the container when it start a job. Users should set the OpenPAI storage type by `paiStorageConfigName` field to choose a storage in OpenPAI. Then users should mount the storage to their nniManager machine, and set the `nniManagerNFSMountPath` field in configuration file, NNI will generate bash files and copy data in `codeDir` to the `nniManagerNFSMountPath` folder, then NNI will start a trial job. The data in `nniManagerNFSMountPath` will be sync to OpenPAI storage, and will be mounted to OpenPAI's container. The data path in container is set in `containerNFSMountPath`, NNI will enter this folder first, and then run scripts to start a trial job. + + ## version check + NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility. Check policy: @@ -190,4 +205,5 @@ Check policy: 3. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7. If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check. -![](../../img/version_check.png) + + diff --git a/docs/en_US/Tutorial/ExperimentConfig.md b/docs/en_US/Tutorial/ExperimentConfig.md index 75e14bcaca..d41f7a5d0a 100644 --- a/docs/en_US/Tutorial/ExperimentConfig.md +++ b/docs/en_US/Tutorial/ExperimentConfig.md @@ -260,6 +260,8 @@ Specifies the platform to run the experiment, including __local__, __remote__, _ * __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md) +* __adl__ submit trial jobs to [AdaptDL](https://www.kubeflow.org/docs/about/kubeflow/), NNI support AdaptDL on Kubernetes cluster. For detail please refer to [AdaptDL Docs](../TrainingService/AdaptDLMode.md) + * TODO: explain frameworkcontroller. ### searchSpacePath diff --git a/docs/en_US/Tutorial/InstallationLinux.md b/docs/en_US/Tutorial/InstallationLinux.md index a2f5cc0242..089ef08bbb 100644 --- a/docs/en_US/Tutorial/InstallationLinux.md +++ b/docs/en_US/Tutorial/InstallationLinux.md @@ -118,3 +118,4 @@ Due to potential programming changes, the minimum system requirements of NNI may * [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md) * [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md) * [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md) +* [How to run an experiment on Kubernetes through AdaptDL?](../TrainingService/AdaptDLMode.md) \ No newline at end of file diff --git a/docs/en_US/Tutorial/QuickStart.md b/docs/en_US/Tutorial/QuickStart.md index e70fe7ca25..a70410c219 100644 --- a/docs/en_US/Tutorial/QuickStart.md +++ b/docs/en_US/Tutorial/QuickStart.md @@ -281,3 +281,4 @@ Below is the status of all trials. Specifically: * [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md) * [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md) * [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md) +* [How to run an experiment on Kubernetes through AdaptDL?](../TrainingService/AdaptDLMode.md) \ No newline at end of file diff --git a/docs/en_US/training_services.rst b/docs/en_US/training_services.rst index 71b2cf6f8c..da286b9d50 100644 --- a/docs/en_US/training_services.rst +++ b/docs/en_US/training_services.rst @@ -8,6 +8,7 @@ Introduction to NNI Training Services OpenPAI<./TrainingService/PaiMode> OpenPAI Yarn Mode<./TrainingService/PaiYarnMode> Kubeflow<./TrainingService/KubeflowMode> + AdaptDL<./TrainingService/AdaptDLMode> FrameworkController<./TrainingService/FrameworkControllerMode> DLTS<./TrainingService/DLTSMode> AML<./TrainingService/AMLMode> diff --git a/docs/img/cream.png b/docs/img/cream.png new file mode 100644 index 0000000000..99a24840a7 Binary files /dev/null and b/docs/img/cream.png differ diff --git a/docs/img/cream_flops100.jpg b/docs/img/cream_flops100.jpg new file mode 100644 index 0000000000..a31078dd8f Binary files /dev/null and b/docs/img/cream_flops100.jpg differ diff --git a/docs/img/cream_flops600.jpg b/docs/img/cream_flops600.jpg new file mode 100644 index 0000000000..e9f7a5a6d0 Binary files /dev/null and b/docs/img/cream_flops600.jpg differ diff --git a/examples/__init__.py b/examples/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/nas/__init__.py b/examples/nas/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/nas/cream/Cream.md b/examples/nas/cream/Cream.md new file mode 100644 index 0000000000..a871bddf78 --- /dev/null +++ b/examples/nas/cream/Cream.md @@ -0,0 +1 @@ +[Documentation](https://nni.readthedocs.io/en/latest/NAS/Cream.html) diff --git a/examples/nas/cream/__init__.py b/examples/nas/cream/__init__.py new file mode 100755 index 0000000000..e69de29bb2 diff --git a/examples/nas/cream/configs/retrain.yaml b/examples/nas/cream/configs/retrain.yaml new file mode 100644 index 0000000000..2339dea982 --- /dev/null +++ b/examples/nas/cream/configs/retrain.yaml @@ -0,0 +1,52 @@ +AUTO_RESUME: False +DATA_DIR: './data/imagenet' +MODEL: '604m_retrain' +RESUME_PATH: './experiments/workspace/retrain/resume.pth.tar' +SAVE_PATH: './' +SEED: 42 +LOG_INTERVAL: 50 +RECOVERY_INTERVAL: 0 +WORKERS: 4 +NUM_GPU: 2 +SAVE_IMAGES: False +AMP: False +OUTPUT: 'None' +EVAL_METRICS: 'prec1' +TTA: 0 +LOCAL_RANK: 0 + +DATASET: + NUM_CLASSES: 1000 + IMAGE_SIZE: 224 # image patch size + INTERPOLATION: 'random' # Image resize interpolation type + BATCH_SIZE: 32 # batch size + NO_PREFECHTER: False + +NET: + GP: 'avg' + DROPOUT_RATE: 0.0 + SELECTION: 42 + + EMA: + USE: True + FORCE_CPU: False # force model ema to be tracked on CPU + DECAY: 0.9998 + +OPT: 'sgd' +OPT_EPS: 1e-2 +MOMENTUM: 0.9 +DECAY_RATE: 0.1 + +SCHED: 'sgd' +LR_NOISE: None +LR_NOISE_PCT: 0.67 +LR_NOISE_STD: 1.0 +WARMUP_LR: 1e-4 +MIN_LR: 1e-5 +EPOCHS: 200 +START_EPOCH: None +DECAY_EPOCHS: 30.0 +WARMUP_EPOCHS: 3 +COOLDOWN_EPOCHS: 10 +PATIENCE_EPOCHS: 10 +LR: 1e-2 \ No newline at end of file diff --git a/examples/nas/cream/configs/test.yaml b/examples/nas/cream/configs/test.yaml new file mode 100644 index 0000000000..4bf568517f --- /dev/null +++ b/examples/nas/cream/configs/test.yaml @@ -0,0 +1,37 @@ +AUTO_RESUME: True +DATA_DIR: './data/imagenet' +MODEL: 'Childnet_Testing' +RESUME_PATH: './experiments/workspace/ckps/42.pth.tar' +SAVE_PATH: './' +SEED: 42 +LOG_INTERVAL: 50 +RECOVERY_INTERVAL: 0 +WORKERS: 4 +NUM_GPU: 2 +SAVE_IMAGES: False +AMP: False +OUTPUT: 'None' +EVAL_METRICS: 'prec1' +TTA: 0 +LOCAL_RANK: 0 + +DATASET: + NUM_CLASSES: 1000 + IMAGE_SIZE: 224 # image patch size + INTERPOLATION: 'bilinear' # Image resize interpolation type + BATCH_SIZE: 32 # batch size + NO_PREFECHTER: False + +NET: + GP: 'avg' + DROPOUT_RATE: 0.0 + SELECTION: 42 + + EMA: + USE: True + FORCE_CPU: False # force model ema to be tracked on CPU + DECAY: 0.9998 + +OPTIMIZER: + MOMENTUM: 0.9 + WEIGHT_DECAY: 1e-3 \ No newline at end of file diff --git a/examples/nas/cream/configs/train.yaml b/examples/nas/cream/configs/train.yaml new file mode 100644 index 0000000000..85164e0eda --- /dev/null +++ b/examples/nas/cream/configs/train.yaml @@ -0,0 +1,53 @@ +AUTO_RESUME: False +DATA_DIR: './data/imagenet' +MODEL: 'Supernet_Training' +RESUME_PATH: './experiments/workspace/train/resume.pth.tar' +SAVE_PATH: './' +SEED: 42 +LOG_INTERVAL: 50 +RECOVERY_INTERVAL: 0 +WORKERS: 8 +NUM_GPU: 8 +SAVE_IMAGES: False +AMP: False +OUTPUT: 'None' +EVAL_METRICS: 'prec1' +TTA: 0 +LOCAL_RANK: 0 + +DATASET: + NUM_CLASSES: 1000 + IMAGE_SIZE: 224 # image patch size + INTERPOLATION: 'bilinear' # Image resize interpolation type + BATCH_SIZE: 128 # batch size + +NET: + GP: 'avg' + DROPOUT_RATE: 0.0 + + EMA: + USE: True + FORCE_CPU: False # force model ema to be tracked on CPU + DECAY: 0.9998 + +OPT: 'sgd' +LR: 1.0 +EPOCHS: 120 +META_LR: 1e-4 + +BATCHNORM: + SYNC_BN: False + +SUPERNET: + UPDATE_ITER: 200 + SLICE: 4 + POOL_SIZE: 10 + RESUNIT: False + DIL_CONV: False + UPDATE_2ND: True + FLOPS_MINIMUM: 0 + FLOPS_MAXIMUM: 600 + PICK_METHOD: 'meta' + META_STA_EPOCH: 20 + HOW_TO_PROB: 'pre_prob' + PRE_PROB: (0.05,0.2,0.05,0.5,0.05,0.15) \ No newline at end of file diff --git a/examples/nas/cream/lib/config.py b/examples/nas/cream/lib/config.py new file mode 100644 index 0000000000..fd50b4a9a5 --- /dev/null +++ b/examples/nas/cream/lib/config.py @@ -0,0 +1,123 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# Written by Hao Du and Houwen Peng +# email: haodu8-c@my.cityu.edu.hk and houwen.peng@microsoft.com + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +from __future__ import unicode_literals + +from yacs.config import CfgNode as CN + +DEFAULT_CROP_PCT = 0.875 +IMAGENET_DEFAULT_MEAN = (0.485, 0.456, 0.406) +IMAGENET_DEFAULT_STD = (0.229, 0.224, 0.225) + +__C = CN() + +cfg = __C + +__C.AUTO_RESUME = True +__C.DATA_DIR = './data/imagenet' +__C.MODEL = 'cream' +__C.RESUME_PATH = './experiments/ckps/resume.pth.tar' +__C.SAVE_PATH = './experiments/ckps/' +__C.SEED = 42 +__C.LOG_INTERVAL = 50 +__C.RECOVERY_INTERVAL = 0 +__C.WORKERS = 4 +__C.NUM_GPU = 1 +__C.SAVE_IMAGES = False +__C.AMP = False +__C.ACC_GAP = 5 +__C.OUTPUT = 'output/path/' +__C.EVAL_METRICS = 'prec1' +__C.TTA = 0 # Test or inference time augmentation +__C.LOCAL_RANK = 0 +__C.VERBOSE = False + +# dataset configs +__C.DATASET = CN() +__C.DATASET.NUM_CLASSES = 1000 +__C.DATASET.IMAGE_SIZE = 224 # image patch size +__C.DATASET.INTERPOLATION = 'bilinear' # Image resize interpolation type +__C.DATASET.BATCH_SIZE = 32 # batch size +__C.DATASET.NO_PREFECHTER = False +__C.DATASET.PIN_MEM = True +__C.DATASET.VAL_BATCH_MUL = 4 + + +# model configs +__C.NET = CN() +__C.NET.SELECTION = 14 +__C.NET.GP = 'avg' # type of global pool ["avg", "max", "avgmax", "avgmaxc"] +__C.NET.DROPOUT_RATE = 0.0 # dropout rate +__C.NET.INPUT_ARCH = [[0], [3], [3, 3], [3, 1, 3], [3, 3, 3, 3], [3, 3, 3], [0]] + +# model ema parameters +__C.NET.EMA = CN() +__C.NET.EMA.USE = True +__C.NET.EMA.FORCE_CPU = False # force model ema to be tracked on CPU +__C.NET.EMA.DECAY = 0.9998 + +# optimizer configs +__C.OPT = 'sgd' +__C.OPT_EPS = 1e-2 +__C.MOMENTUM = 0.9 +__C.WEIGHT_DECAY = 1e-4 +__C.OPTIMIZER = CN() +__C.OPTIMIZER.NAME = 'sgd' +__C.OPTIMIZER.MOMENTUM = 0.9 +__C.OPTIMIZER.WEIGHT_DECAY = 1e-3 + +# scheduler configs +__C.SCHED = 'sgd' +__C.LR_NOISE = None +__C.LR_NOISE_PCT = 0.67 +__C.LR_NOISE_STD = 1.0 +__C.WARMUP_LR = 1e-4 +__C.MIN_LR = 1e-5 +__C.EPOCHS = 200 +__C.START_EPOCH = None +__C.DECAY_EPOCHS = 30.0 +__C.WARMUP_EPOCHS = 3 +__C.COOLDOWN_EPOCHS = 10 +__C.PATIENCE_EPOCHS = 10 +__C.DECAY_RATE = 0.1 +__C.LR = 1e-2 +__C.META_LR = 1e-4 + +# data augmentation parameters +__C.AUGMENTATION = CN() +__C.AUGMENTATION.AA = 'rand-m9-mstd0.5' +__C.AUGMENTATION.COLOR_JITTER = 0.4 +__C.AUGMENTATION.RE_PROB = 0.2 # random erase prob +__C.AUGMENTATION.RE_MODE = 'pixel' # random erase mode +__C.AUGMENTATION.MIXUP = 0.0 # mixup alpha +__C.AUGMENTATION.MIXUP_OFF_EPOCH = 0 # turn off mixup after this epoch +__C.AUGMENTATION.SMOOTHING = 0.1 # label smoothing parameters + +# batch norm parameters (only works with gen_efficientnet based models +# currently) +__C.BATCHNORM = CN() +__C.BATCHNORM.SYNC_BN = True +__C.BATCHNORM.BN_TF = False +__C.BATCHNORM.BN_MOMENTUM = 0.1 # batchnorm momentum override +__C.BATCHNORM.BN_EPS = 1e-5 # batchnorm eps override + +# supernet training hyperparameters +__C.SUPERNET = CN() +__C.SUPERNET.UPDATE_ITER = 1300 +__C.SUPERNET.SLICE = 4 +__C.SUPERNET.POOL_SIZE = 10 +__C.SUPERNET.RESUNIT = False +__C.SUPERNET.DIL_CONV = False +__C.SUPERNET.UPDATE_2ND = True +__C.SUPERNET.FLOPS_MAXIMUM = 600 +__C.SUPERNET.FLOPS_MINIMUM = 0 +__C.SUPERNET.PICK_METHOD = 'meta' # pick teacher method +__C.SUPERNET.META_STA_EPOCH = 20 # start using meta picking method +__C.SUPERNET.HOW_TO_PROB = 'pre_prob' # sample method +__C.SUPERNET.PRE_PROB = (0.05, 0.2, 0.05, 0.5, 0.05, + 0.15) # sample prob in 'pre_prob' diff --git a/examples/nas/cream/lib/core/retrain.py b/examples/nas/cream/lib/core/retrain.py new file mode 100644 index 0000000000..7468db2bb5 --- /dev/null +++ b/examples/nas/cream/lib/core/retrain.py @@ -0,0 +1,135 @@ +import os +import time +import torch +import torchvision + +from collections import OrderedDict + +from lib.utils.util import AverageMeter, accuracy, reduce_tensor + +def train_epoch( + epoch, model, loader, optimizer, loss_fn, cfg, + lr_scheduler=None, saver=None, output_dir='', use_amp=False, + model_ema=None, logger=None, writer=None, local_rank=0): + batch_time_m = AverageMeter() + data_time_m = AverageMeter() + losses_m = AverageMeter() + prec1_m = AverageMeter() + prec5_m = AverageMeter() + + model.train() + + end = time.time() + last_idx = len(loader) - 1 + num_updates = epoch * len(loader) + optimizer.zero_grad() + for batch_idx, (input, target) in enumerate(loader): + last_batch = batch_idx == last_idx + data_time_m.update(time.time() - end) + + input = input.cuda() + target = target.cuda() + output = model(input) + + loss = loss_fn(output, target) + + prec1, prec5 = accuracy(output, target, topk=(1, 5)) + + if cfg.NUM_GPU > 1: + reduced_loss = reduce_tensor(loss.data, cfg.NUM_GPU) + prec1 = reduce_tensor(prec1, cfg.NUM_GPU) + prec5 = reduce_tensor(prec5, cfg.NUM_GPU) + else: + reduced_loss = loss.data + + optimizer.zero_grad() + loss.backward() + optimizer.step() + + torch.cuda.synchronize() + + losses_m.update(reduced_loss.item(), input.size(0)) + prec1_m.update(prec1.item(), output.size(0)) + prec5_m.update(prec5.item(), output.size(0)) + + if model_ema is not None: + model_ema.update(model) + num_updates += 1 + + batch_time_m.update(time.time() - end) + if last_batch or batch_idx % cfg.LOG_INTERVAL == 0: + lrl = [param_group['lr'] for param_group in optimizer.param_groups] + lr = sum(lrl) / len(lrl) + + if local_rank == 0: + logger.info( + 'Train: {} [{:>4d}/{}] ' + 'Loss: {loss.val:>9.6f} ({loss.avg:>6.4f}) ' + 'Prec@1: {top1.val:>7.4f} ({top1.avg:>7.4f}) ' + 'Prec@5: {top5.val:>7.4f} ({top5.avg:>7.4f}) ' + 'Time: {batch_time.val:.3f}s, {rate:>7.2f}/s ' + '({batch_time.avg:.3f}s, {rate_avg:>7.2f}/s) ' + 'LR: {lr:.3e}' + 'Data: {data_time.val:.3f} ({data_time.avg:.3f})'.format( + epoch, + batch_idx, + len(loader), + loss=losses_m, + top1=prec1_m, + top5=prec5_m, + batch_time=batch_time_m, + rate=input.size(0) * + cfg.NUM_GPU / + batch_time_m.val, + rate_avg=input.size(0) * + cfg.NUM_GPU / + batch_time_m.avg, + lr=lr, + data_time=data_time_m)) + + writer.add_scalar( + 'Loss/train', + prec1_m.avg, + epoch * + len(loader) + + batch_idx) + writer.add_scalar( + 'Accuracy/train', + prec1_m.avg, + epoch * + len(loader) + + batch_idx) + writer.add_scalar( + 'Learning_Rate', + optimizer.param_groups[0]['lr'], + epoch * len(loader) + batch_idx) + + if cfg.SAVE_IMAGES and output_dir: + torchvision.utils.save_image( + input, os.path.join( + output_dir, 'train-batch-%d.jpg' % + batch_idx), padding=0, normalize=True) + + if saver is not None and cfg.RECOVERY_INTERVAL and ( + last_batch or (batch_idx + 1) % cfg.RECOVERY_INTERVAL == 0): + saver.save_recovery( + model, + optimizer, + cfg, + epoch, + model_ema=model_ema, + use_amp=use_amp, + batch_idx=batch_idx) + + if lr_scheduler is not None: + lr_scheduler.step_update( + num_updates=num_updates, + metric=losses_m.avg) + + end = time.time() + # end for + + if hasattr(optimizer, 'sync_lookahead'): + optimizer.sync_lookahead() + + return OrderedDict([('loss', losses_m.avg)]) diff --git a/examples/nas/cream/lib/core/test.py b/examples/nas/cream/lib/core/test.py new file mode 100644 index 0000000000..7ab69b57c0 --- /dev/null +++ b/examples/nas/cream/lib/core/test.py @@ -0,0 +1,87 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# Written by Hao Du and Houwen Peng +# email: haodu8-c@my.cityu.edu.hk and houwen.peng@microsoft.com + +import time +import torch + +from collections import OrderedDict +from lib.utils.util import AverageMeter, accuracy, reduce_tensor + + +def validate(epoch, model, loader, loss_fn, cfg, log_suffix='', logger=None, writer=None, local_rank=0): + batch_time_m = AverageMeter() + losses_m = AverageMeter() + prec1_m = AverageMeter() + prec5_m = AverageMeter() + + model.eval() + + end = time.time() + last_idx = len(loader) - 1 + with torch.no_grad(): + for batch_idx, (input, target) in enumerate(loader): + last_batch = batch_idx == last_idx + + output = model(input) + if isinstance(output, (tuple, list)): + output = output[0] + + # augmentation reduction + reduce_factor = cfg.TTA + if reduce_factor > 1: + output = output.unfold( + 0, + reduce_factor, + reduce_factor).mean( + dim=2) + target = target[0:target.size(0):reduce_factor] + + loss = loss_fn(output, target) + prec1, prec5 = accuracy(output, target, topk=(1, 5)) + + if cfg.NUM_GPU > 1: + reduced_loss = reduce_tensor(loss.data, cfg.NUM_GPU) + prec1 = reduce_tensor(prec1, cfg.NUM_GPU) + prec5 = reduce_tensor(prec5, cfg.NUM_GPU) + else: + reduced_loss = loss.data + + torch.cuda.synchronize() + + losses_m.update(reduced_loss.item(), input.size(0)) + prec1_m.update(prec1.item(), output.size(0)) + prec5_m.update(prec5.item(), output.size(0)) + + batch_time_m.update(time.time() - end) + end = time.time() + if local_rank == 0 and (last_batch or batch_idx % cfg.LOG_INTERVAL == 0): + log_name = 'Test' + log_suffix + logger.info( + '{0}: [{1:>4d}/{2}] ' + 'Time: {batch_time.val:.3f} ({batch_time.avg:.3f}) ' + 'Loss: {loss.val:>7.4f} ({loss.avg:>6.4f}) ' + 'Prec@1: {top1.val:>7.4f} ({top1.avg:>7.4f}) ' + 'Prec@5: {top5.val:>7.4f} ({top5.avg:>7.4f})'.format( + log_name, batch_idx, last_idx, + batch_time=batch_time_m, loss=losses_m, + top1=prec1_m, top5=prec5_m)) + + writer.add_scalar( + 'Loss' + log_suffix + '/vaild', + prec1_m.avg, + epoch * len(loader) + batch_idx) + writer.add_scalar( + 'Accuracy' + + log_suffix + + '/vaild', + prec1_m.avg, + epoch * + len(loader) + + batch_idx) + + metrics = OrderedDict( + [('loss', losses_m.avg), ('prec1', prec1_m.avg), ('prec5', prec5_m.avg)]) + + return metrics diff --git a/examples/nas/cream/lib/models/blocks/__init__.py b/examples/nas/cream/lib/models/blocks/__init__.py new file mode 100644 index 0000000000..83a19f2b91 --- /dev/null +++ b/examples/nas/cream/lib/models/blocks/__init__.py @@ -0,0 +1,2 @@ +from lib.models.blocks.residual_block import get_Bottleneck, get_BasicBlock +from lib.models.blocks.inverted_residual_block import InvertedResidual \ No newline at end of file diff --git a/examples/nas/cream/lib/models/blocks/inverted_residual_block.py b/examples/nas/cream/lib/models/blocks/inverted_residual_block.py new file mode 100644 index 0000000000..2f501b561b --- /dev/null +++ b/examples/nas/cream/lib/models/blocks/inverted_residual_block.py @@ -0,0 +1,113 @@ +# This file is downloaded from +# https://github.com/rwightman/pytorch-image-models + +import torch.nn as nn + +from timm.models.layers import create_conv2d +from timm.models.efficientnet_blocks import make_divisible, resolve_se_args, \ + SqueezeExcite, drop_path + + +class InvertedResidual(nn.Module): + """ Inverted residual block w/ optional SE and CondConv routing""" + + def __init__( + self, + in_chs, + out_chs, + dw_kernel_size=3, + stride=1, + dilation=1, + pad_type='', + act_layer=nn.ReLU, + noskip=False, + exp_ratio=1.0, + exp_kernel_size=1, + pw_kernel_size=1, + se_ratio=0., + se_kwargs=None, + norm_layer=nn.BatchNorm2d, + norm_kwargs=None, + conv_kwargs=None, + drop_path_rate=0.): + super(InvertedResidual, self).__init__() + norm_kwargs = norm_kwargs or {} + conv_kwargs = conv_kwargs or {} + mid_chs = make_divisible(in_chs * exp_ratio) + has_se = se_ratio is not None and se_ratio > 0. + self.has_residual = (in_chs == out_chs and stride == 1) and not noskip + self.drop_path_rate = drop_path_rate + + # Point-wise expansion + self.conv_pw = create_conv2d( + in_chs, + mid_chs, + exp_kernel_size, + padding=pad_type, + **conv_kwargs) + self.bn1 = norm_layer(mid_chs, **norm_kwargs) + self.act1 = act_layer(inplace=True) + + # Depth-wise convolution + self.conv_dw = create_conv2d( + mid_chs, mid_chs, dw_kernel_size, stride=stride, dilation=dilation, + padding=pad_type, depthwise=True, **conv_kwargs) + self.bn2 = norm_layer(mid_chs, **norm_kwargs) + self.act2 = act_layer(inplace=True) + + # Squeeze-and-excitation + if has_se: + se_kwargs = resolve_se_args(se_kwargs, in_chs, act_layer) + self.se = SqueezeExcite(mid_chs, se_ratio=se_ratio, **se_kwargs) + else: + self.se = None + + # Point-wise linear projection + self.conv_pwl = create_conv2d( + mid_chs, + out_chs, + pw_kernel_size, + padding=pad_type, + **conv_kwargs) + self.bn3 = norm_layer(out_chs, **norm_kwargs) + + def feature_info(self, location): + if location == 'expansion': # after SE, input to PWL + info = dict( + module='conv_pwl', + hook_type='forward_pre', + num_chs=self.conv_pwl.in_channels) + else: # location == 'bottleneck', block output + info = dict( + module='', + hook_type='', + num_chs=self.conv_pwl.out_channels) + return info + + def forward(self, x): + residual = x + + # Point-wise expansion + x = self.conv_pw(x) + x = self.bn1(x) + x = self.act1(x) + + # Depth-wise convolution + x = self.conv_dw(x) + x = self.bn2(x) + x = self.act2(x) + + # Squeeze-and-excitation + if self.se is not None: + x = self.se(x) + + # Point-wise linear projection + x = self.conv_pwl(x) + x = self.bn3(x) + + if self.has_residual: + if self.drop_path_rate > 0.: + x = drop_path(x, self.drop_path_rate, self.training) + x += residual + + return x diff --git a/examples/nas/cream/lib/models/blocks/residual_block.py b/examples/nas/cream/lib/models/blocks/residual_block.py new file mode 100644 index 0000000000..75892eee79 --- /dev/null +++ b/examples/nas/cream/lib/models/blocks/residual_block.py @@ -0,0 +1,105 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# Written by Hao Du and Houwen Peng +# email: haodu8-c@my.cityu.edu.hk and houwen.peng@microsoft.com + +import torch +import torch.nn as nn +import torch.nn.functional as F + + +def conv3x3(in_planes, out_planes, stride=1): + "3x3 convolution with padding" + return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride, + padding=1, bias=True) + + +class BasicBlock(nn.Module): + expansion = 1 + + def __init__(self, inplanes, planes, stride=1, downsample=None): + super(BasicBlock, self).__init__() + self.conv1 = conv3x3(inplanes, planes, stride) + self.bn1 = nn.BatchNorm2d(planes) + self.relu = nn.ReLU(inplace=True) + self.conv2 = conv3x3(planes, planes) + self.bn2 = nn.BatchNorm2d(planes) + self.downsample = downsample + self.stride = stride + + def forward(self, x): + residual = x + + out = self.conv1(x) + out = self.bn1(out) + out = self.relu(out) + + out = self.conv2(out) + out = self.bn2(out) + + if self.downsample is not None: + residual = self.downsample(x) + + out += residual + out = self.relu(out) + + return out + + +class Bottleneck(nn.Module): + + def __init__(self, inplanes, planes, stride=1, expansion=4): + super(Bottleneck, self).__init__() + planes = int(planes / expansion) + self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=True) + self.bn1 = nn.BatchNorm2d(planes) + self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, + padding=1, bias=True) + self.bn2 = nn.BatchNorm2d(planes) + self.conv3 = nn.Conv2d( + planes, + planes * expansion, + kernel_size=1, + bias=True) + self.bn3 = nn.BatchNorm2d(planes * expansion) + self.relu = nn.ReLU(inplace=True) + self.stride = stride + self.expansion = expansion + if inplanes != planes * self.expansion: + self.downsample = nn.Sequential( + nn.Conv2d(inplanes, planes * self.expansion, + kernel_size=1, stride=stride, bias=True), + nn.BatchNorm2d(planes * self.expansion), + ) + else: + self.downsample = None + + def forward(self, x): + residual = x + + out = self.conv1(x) + out = self.bn1(out) + out = self.relu(out) + + out = self.conv2(out) + out = self.bn2(out) + out = self.relu(out) + + out = self.conv3(out) + out = self.bn3(out) + + if self.downsample is not None: + residual = self.downsample(x) + + out += residual + out = self.relu(out) + + return out + + +def get_Bottleneck(in_c, out_c, stride): + return Bottleneck(in_c, out_c, stride=stride) + + +def get_BasicBlock(in_c, out_c, stride): + return BasicBlock(in_c, out_c, stride=stride) diff --git a/examples/nas/cream/lib/models/builders/build_childnet.py b/examples/nas/cream/lib/models/builders/build_childnet.py new file mode 100755 index 0000000000..8ddfb40024 --- /dev/null +++ b/examples/nas/cream/lib/models/builders/build_childnet.py @@ -0,0 +1,181 @@ +from lib.utils.util import * + +from timm.models.efficientnet_blocks import * + + +class ChildNetBuilder: + def __init__( + self, + channel_multiplier=1.0, + channel_divisor=8, + channel_min=None, + output_stride=32, + pad_type='', + act_layer=None, + se_kwargs=None, + norm_layer=nn.BatchNorm2d, + norm_kwargs=None, + drop_path_rate=0., + feature_location='', + verbose=False, + logger=None): + self.channel_multiplier = channel_multiplier + self.channel_divisor = channel_divisor + self.channel_min = channel_min + self.output_stride = output_stride + self.pad_type = pad_type + self.act_layer = act_layer + self.se_kwargs = se_kwargs + self.norm_layer = norm_layer + self.norm_kwargs = norm_kwargs + self.drop_path_rate = drop_path_rate + self.feature_location = feature_location + assert feature_location in ('pre_pwl', 'post_exp', '') + self.verbose = verbose + self.in_chs = None + self.features = OrderedDict() + self.logger = logger + + def _round_channels(self, chs): + return round_channels( + chs, + self.channel_multiplier, + self.channel_divisor, + self.channel_min) + + def _make_block(self, ba, block_idx, block_count): + drop_path_rate = self.drop_path_rate * block_idx / block_count + bt = ba.pop('block_type') + ba['in_chs'] = self.in_chs + ba['out_chs'] = self._round_channels(ba['out_chs']) + if 'fake_in_chs' in ba and ba['fake_in_chs']: + ba['fake_in_chs'] = self._round_channels(ba['fake_in_chs']) + ba['norm_layer'] = self.norm_layer + ba['norm_kwargs'] = self.norm_kwargs + ba['pad_type'] = self.pad_type + # block act fn overrides the model default + ba['act_layer'] = ba['act_layer'] if ba['act_layer'] is not None else self.act_layer + assert ba['act_layer'] is not None + if bt == 'ir': + ba['drop_path_rate'] = drop_path_rate + ba['se_kwargs'] = self.se_kwargs + if self.verbose: + self.logger.info( + ' InvertedResidual {}, Args: {}'.format( + block_idx, str(ba))) + block = InvertedResidual(**ba) + elif bt == 'ds' or bt == 'dsa': + ba['drop_path_rate'] = drop_path_rate + ba['se_kwargs'] = self.se_kwargs + if self.verbose: + self.logger.info( + ' DepthwiseSeparable {}, Args: {}'.format( + block_idx, str(ba))) + block = DepthwiseSeparableConv(**ba) + elif bt == 'cn': + if self.verbose: + self.logger.info( + ' ConvBnAct {}, Args: {}'.format( + block_idx, str(ba))) + block = ConvBnAct(**ba) + else: + assert False, 'Uknkown block type (%s) while building model.' % bt + self.in_chs = ba['out_chs'] # update in_chs for arg of next block + + return block + + def __call__(self, in_chs, model_block_args): + """ Build the blocks + Args: + in_chs: Number of input-channels passed to first block + model_block_args: A list of lists, outer list defines stages, inner + list contains strings defining block configuration(s) + Return: + List of block stacks (each stack wrapped in nn.Sequential) + """ + if self.verbose: + self.logger.info( + 'Building model trunk with %d stages...' % + len(model_block_args)) + self.in_chs = in_chs + total_block_count = sum([len(x) for x in model_block_args]) + total_block_idx = 0 + current_stride = 2 + current_dilation = 1 + feature_idx = 0 + stages = [] + # outer list of block_args defines the stacks ('stages' by some + # conventions) + for stage_idx, stage_block_args in enumerate(model_block_args): + last_stack = stage_idx == (len(model_block_args) - 1) + if self.verbose: + self.logger.info('Stack: {}'.format(stage_idx)) + assert isinstance(stage_block_args, list) + + blocks = [] + # each stack (stage) contains a list of block arguments + for block_idx, block_args in enumerate(stage_block_args): + last_block = block_idx == (len(stage_block_args) - 1) + extract_features = '' # No features extracted + if self.verbose: + self.logger.info(' Block: {}'.format(block_idx)) + + # Sort out stride, dilation, and feature extraction details + assert block_args['stride'] in (1, 2) + if block_idx >= 1: + # only the first block in any stack can have a stride > 1 + block_args['stride'] = 1 + + do_extract = False + if self.feature_location == 'pre_pwl': + if last_block: + next_stage_idx = stage_idx + 1 + if next_stage_idx >= len(model_block_args): + do_extract = True + else: + do_extract = model_block_args[next_stage_idx][0]['stride'] > 1 + elif self.feature_location == 'post_exp': + if block_args['stride'] > 1 or (last_stack and last_block): + do_extract = True + if do_extract: + extract_features = self.feature_location + + next_dilation = current_dilation + if block_args['stride'] > 1: + next_output_stride = current_stride * block_args['stride'] + if next_output_stride > self.output_stride: + next_dilation = current_dilation * block_args['stride'] + block_args['stride'] = 1 + if self.verbose: + self.logger.info( + ' Converting stride to dilation to maintain output_stride=={}'.format( + self.output_stride)) + else: + current_stride = next_output_stride + block_args['dilation'] = current_dilation + if next_dilation != current_dilation: + current_dilation = next_dilation + + # create the block + block = self._make_block( + block_args, total_block_idx, total_block_count) + blocks.append(block) + + # stash feature module name and channel info for model feature + # extraction + if extract_features: + feature_module = block.feature_module(extract_features) + if feature_module: + feature_module = 'blocks.{}.{}.'.format( + stage_idx, block_idx) + feature_module + feature_channels = block.feature_channels(extract_features) + self.features[feature_idx] = dict( + name=feature_module, + num_chs=feature_channels + ) + feature_idx += 1 + + # incr global block idx (across all stacks) + total_block_idx += 1 + stages.append(nn.Sequential(*blocks)) + return stages diff --git a/examples/nas/cream/lib/models/builders/build_supernet.py b/examples/nas/cream/lib/models/builders/build_supernet.py new file mode 100644 index 0000000000..37d9c575c8 --- /dev/null +++ b/examples/nas/cream/lib/models/builders/build_supernet.py @@ -0,0 +1,214 @@ +from copy import deepcopy + +from lib.utils.builder_util import modify_block_args +from lib.models.blocks import get_Bottleneck, InvertedResidual + +from timm.models.efficientnet_blocks import * + +from nni.nas.pytorch import mutables + +class SuperNetBuilder: + """ Build Trunk Blocks + """ + + def __init__( + self, + choices, + channel_multiplier=1.0, + channel_divisor=8, + channel_min=None, + output_stride=32, + pad_type='', + act_layer=None, + se_kwargs=None, + norm_layer=nn.BatchNorm2d, + norm_kwargs=None, + drop_path_rate=0., + feature_location='', + verbose=False, + resunit=False, + dil_conv=False, + logger=None): + + # dict + # choices = {'kernel_size': [3, 5, 7], 'exp_ratio': [4, 6]} + self.choices = [[x, y] for x in choices['kernel_size'] + for y in choices['exp_ratio']] + self.choices_num = len(self.choices) - 1 + self.channel_multiplier = channel_multiplier + self.channel_divisor = channel_divisor + self.channel_min = channel_min + self.output_stride = output_stride + self.pad_type = pad_type + self.act_layer = act_layer + self.se_kwargs = se_kwargs + self.norm_layer = norm_layer + self.norm_kwargs = norm_kwargs + self.drop_path_rate = drop_path_rate + self.feature_location = feature_location + assert feature_location in ('pre_pwl', 'post_exp', '') + self.verbose = verbose + self.resunit = resunit + self.dil_conv = dil_conv + self.logger = logger + + # state updated during build, consumed by model + self.in_chs = None + + def _round_channels(self, chs): + return round_channels( + chs, + self.channel_multiplier, + self.channel_divisor, + self.channel_min) + + def _make_block( + self, + ba, + choice_idx, + block_idx, + block_count, + resunit=False, + dil_conv=False): + drop_path_rate = self.drop_path_rate * block_idx / block_count + bt = ba.pop('block_type') + ba['in_chs'] = self.in_chs + ba['out_chs'] = self._round_channels(ba['out_chs']) + if 'fake_in_chs' in ba and ba['fake_in_chs']: + # FIXME this is a hack to work around mismatch in origin impl input + # filters + ba['fake_in_chs'] = self._round_channels(ba['fake_in_chs']) + ba['norm_layer'] = self.norm_layer + ba['norm_kwargs'] = self.norm_kwargs + ba['pad_type'] = self.pad_type + # block act fn overrides the model default + ba['act_layer'] = ba['act_layer'] if ba['act_layer'] is not None else self.act_layer + assert ba['act_layer'] is not None + if bt == 'ir': + ba['drop_path_rate'] = drop_path_rate + ba['se_kwargs'] = self.se_kwargs + if self.verbose: + self.logger.info( + ' InvertedResidual {}, Args: {}'.format( + block_idx, str(ba))) + block = InvertedResidual(**ba) + elif bt == 'ds' or bt == 'dsa': + ba['drop_path_rate'] = drop_path_rate + ba['se_kwargs'] = self.se_kwargs + if self.verbose: + self.logger.info( + ' DepthwiseSeparable {}, Args: {}'.format( + block_idx, str(ba))) + block = DepthwiseSeparableConv(**ba) + elif bt == 'cn': + if self.verbose: + self.logger.info( + ' ConvBnAct {}, Args: {}'.format( + block_idx, str(ba))) + block = ConvBnAct(**ba) + else: + assert False, 'Uknkown block type (%s) while building model.' % bt + if choice_idx == self.choice_num - 1: + self.in_chs = ba['out_chs'] # update in_chs for arg of next block + + return block + + def __call__(self, in_chs, model_block_args): + """ Build the blocks + Args: + in_chs: Number of input-channels passed to first block + model_block_args: A list of lists, outer list defines stages, inner + list contains strings defining block configuration(s) + Return: + List of block stacks (each stack wrapped in nn.Sequential) + """ + if self.verbose: + logging.info('Building model trunk with %d stages...' % len(model_block_args)) + self.in_chs = in_chs + total_block_count = sum([len(x) for x in model_block_args]) + total_block_idx = 0 + current_stride = 2 + current_dilation = 1 + feature_idx = 0 + stages = [] + # outer list of block_args defines the stacks ('stages' by some conventions) + for stage_idx, stage_block_args in enumerate(model_block_args): + last_stack = stage_idx == (len(model_block_args) - 1) + if self.verbose: + self.logger.info('Stack: {}'.format(stage_idx)) + assert isinstance(stage_block_args, list) + + # blocks = [] + # each stack (stage) contains a list of block arguments + for block_idx, block_args in enumerate(stage_block_args): + last_block = block_idx == (len(stage_block_args) - 1) + if self.verbose: + self.logger.info(' Block: {}'.format(block_idx)) + + # Sort out stride, dilation, and feature extraction details + assert block_args['stride'] in (1, 2) + if block_idx >= 1: + # only the first block in any stack can have a stride > 1 + block_args['stride'] = 1 + + next_dilation = current_dilation + if block_args['stride'] > 1: + next_output_stride = current_stride * block_args['stride'] + if next_output_stride > self.output_stride: + next_dilation = current_dilation * block_args['stride'] + block_args['stride'] = 1 + else: + current_stride = next_output_stride + block_args['dilation'] = current_dilation + if next_dilation != current_dilation: + current_dilation = next_dilation + + + if stage_idx==0 or stage_idx==6: + self.choice_num = 1 + else: + self.choice_num = len(self.choices) + + if self.dil_conv: + self.choice_num += 2 + + choice_blocks = [] + block_args_copy = deepcopy(block_args) + if self.choice_num == 1: + # create the block + block = self._make_block(block_args, 0, total_block_idx, total_block_count) + choice_blocks.append(block) + else: + for choice_idx, choice in enumerate(self.choices): + # create the block + block_args = deepcopy(block_args_copy) + block_args = modify_block_args(block_args, choice[0], choice[1]) + block = self._make_block(block_args, choice_idx, total_block_idx, total_block_count) + choice_blocks.append(block) + if self.dil_conv: + block_args = deepcopy(block_args_copy) + block_args = modify_block_args(block_args, 3, 0) + block = self._make_block(block_args, self.choice_num - 2, total_block_idx, total_block_count, + resunit=self.resunit, dil_conv=self.dil_conv) + choice_blocks.append(block) + + block_args = deepcopy(block_args_copy) + block_args = modify_block_args(block_args, 5, 0) + block = self._make_block(block_args, self.choice_num - 1, total_block_idx, total_block_count, + resunit=self.resunit, dil_conv=self.dil_conv) + choice_blocks.append(block) + + if self.resunit: + block = get_Bottleneck(block.conv_pw.in_channels, + block.conv_pwl.out_channels, + block.conv_dw.stride[0]) + choice_blocks.append(block) + + choice_block = mutables.LayerChoice(choice_blocks) + stages.append(choice_block) + # create the block + # block = self._make_block(block_args, total_block_idx, total_block_count) + total_block_idx += 1 # incr global block idx (across all stacks) + + # stages.append(blocks) + return stages diff --git a/examples/nas/cream/lib/models/structures/childnet.py b/examples/nas/cream/lib/models/structures/childnet.py new file mode 100755 index 0000000000..668b92e157 --- /dev/null +++ b/examples/nas/cream/lib/models/structures/childnet.py @@ -0,0 +1,145 @@ +from lib.utils.builder_util import * +from lib.models.builders.build_childnet import * + +from timm.models.layers import SelectAdaptivePool2d +from timm.models.layers.activations import hard_sigmoid + + +class ChildNet(nn.Module): + + def __init__( + self, + block_args, + num_classes=1000, + in_chans=3, + stem_size=16, + num_features=1280, + head_bias=True, + channel_multiplier=1.0, + pad_type='', + act_layer=nn.ReLU, + drop_rate=0., + drop_path_rate=0., + se_kwargs=None, + norm_layer=nn.BatchNorm2d, + norm_kwargs=None, + global_pool='avg', + logger=None, + verbose=False): + super(ChildNet, self).__init__() + + self.num_classes = num_classes + self.num_features = num_features + self.drop_rate = drop_rate + self._in_chs = in_chans + self.logger = logger + + # Stem + stem_size = round_channels(stem_size, channel_multiplier) + self.conv_stem = create_conv2d( + self._in_chs, stem_size, 3, stride=2, padding=pad_type) + self.bn1 = norm_layer(stem_size, **norm_kwargs) + self.act1 = act_layer(inplace=True) + self._in_chs = stem_size + + # Middle stages (IR/ER/DS Blocks) + builder = ChildNetBuilder( + channel_multiplier, 8, None, 32, pad_type, act_layer, se_kwargs, + norm_layer, norm_kwargs, drop_path_rate, verbose=verbose) + self.blocks = nn.Sequential(*builder(self._in_chs, block_args)) + # self.blocks = builder(self._in_chs, block_args) + self._in_chs = builder.in_chs + + # Head + Pooling + self.global_pool = SelectAdaptivePool2d(pool_type=global_pool) + self.conv_head = create_conv2d( + self._in_chs, + self.num_features, + 1, + padding=pad_type, + bias=head_bias) + self.act2 = act_layer(inplace=True) + + # Classifier + self.classifier = nn.Linear( + self.num_features * + self.global_pool.feat_mult(), + self.num_classes) + + efficientnet_init_weights(self) + + def get_classifier(self): + return self.classifier + + def reset_classifier(self, num_classes, global_pool='avg'): + self.global_pool = SelectAdaptivePool2d(pool_type=global_pool) + self.num_classes = num_classes + self.classifier = nn.Linear( + self.num_features * self.global_pool.feat_mult(), + num_classes) if self.num_classes else None + + def forward_features(self, x): + # architecture = [[0], [], [], [], [], [0]] + x = self.conv_stem(x) + x = self.bn1(x) + x = self.act1(x) + x = self.blocks(x) + x = self.global_pool(x) + x = self.conv_head(x) + x = self.act2(x) + return x + + def forward(self, x): + x = self.forward_features(x) + x = x.flatten(1) + if self.drop_rate > 0.: + x = F.dropout(x, p=self.drop_rate, training=self.training) + x = self.classifier(x) + return x + + +def gen_childnet(arch_list, arch_def, **kwargs): + # arch_list = [[0], [], [], [], [], [0]] + choices = {'kernel_size': [3, 5, 7], 'exp_ratio': [4, 6]} + choices_list = [[x, y] for x in choices['kernel_size'] + for y in choices['exp_ratio']] + + num_features = 1280 + + # act_layer = HardSwish + act_layer = Swish + + new_arch = [] + # change to child arch_def + for i, (layer_choice, layer_arch) in enumerate(zip(arch_list, arch_def)): + if len(layer_arch) == 1: + new_arch.append(layer_arch) + continue + else: + new_layer = [] + for j, (block_choice, block_arch) in enumerate( + zip(layer_choice, layer_arch)): + kernel_size, exp_ratio = choices_list[block_choice] + elements = block_arch.split('_') + block_arch = block_arch.replace( + elements[2], 'k{}'.format(str(kernel_size))) + block_arch = block_arch.replace( + elements[4], 'e{}'.format(str(exp_ratio))) + new_layer.append(block_arch) + new_arch.append(new_layer) + + model_kwargs = dict( + block_args=decode_arch_def(new_arch), + num_features=num_features, + stem_size=16, + norm_kwargs=resolve_bn_args(kwargs), + act_layer=act_layer, + se_kwargs=dict( + act_layer=nn.ReLU, + gate_fn=hard_sigmoid, + reduce_mid=True, + divisor=8), + **kwargs, + ) + model = ChildNet(**model_kwargs) + return model diff --git a/examples/nas/cream/lib/models/structures/supernet.py b/examples/nas/cream/lib/models/structures/supernet.py new file mode 100644 index 0000000000..ea09377eb5 --- /dev/null +++ b/examples/nas/cream/lib/models/structures/supernet.py @@ -0,0 +1,202 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# Written by Hao Du and Houwen Peng +# email: haodu8-c@my.cityu.edu.hk and houwen.peng@microsoft.com + +from lib.utils.builder_util import * +from lib.utils.search_structure_supernet import * +from lib.models.builders.build_supernet import * +from lib.utils.op_by_layer_dict import flops_op_dict + +from timm.models.layers import SelectAdaptivePool2d +from timm.models.layers.activations import hard_sigmoid + + +class SuperNet(nn.Module): + + def __init__( + self, + block_args, + choices, + num_classes=1000, + in_chans=3, + stem_size=16, + num_features=1280, + head_bias=True, + channel_multiplier=1.0, + pad_type='', + act_layer=nn.ReLU, + drop_rate=0., + drop_path_rate=0., + slice=4, + se_kwargs=None, + norm_layer=nn.BatchNorm2d, + logger=None, + norm_kwargs=None, + global_pool='avg', + resunit=False, + dil_conv=False, + verbose=False): + super(SuperNet, self).__init__() + + self.num_classes = num_classes + self.num_features = num_features + self.drop_rate = drop_rate + self._in_chs = in_chans + self.logger = logger + + # Stem + stem_size = round_channels(stem_size, channel_multiplier) + self.conv_stem = create_conv2d( + self._in_chs, stem_size, 3, stride=2, padding=pad_type) + self.bn1 = norm_layer(stem_size, **norm_kwargs) + self.act1 = act_layer(inplace=True) + self._in_chs = stem_size + + # Middle stages (IR/ER/DS Blocks) + builder = SuperNetBuilder( + choices, + channel_multiplier, + 8, + None, + 32, + pad_type, + act_layer, + se_kwargs, + norm_layer, + norm_kwargs, + drop_path_rate, + verbose=verbose, + resunit=resunit, + dil_conv=dil_conv, + logger=self.logger) + blocks = builder(self._in_chs, block_args) + self.blocks = nn.Sequential(*blocks) + self._in_chs = builder.in_chs + + # Head + Pooling + self.global_pool = SelectAdaptivePool2d(pool_type=global_pool) + self.conv_head = create_conv2d( + self._in_chs, + self.num_features, + 1, + padding=pad_type, + bias=head_bias) + self.act2 = act_layer(inplace=True) + + # Classifier + self.classifier = nn.Linear( + self.num_features * + self.global_pool.feat_mult(), + self.num_classes) + + self.meta_layer = nn.Linear(self.num_classes * slice, 1) + efficientnet_init_weights(self) + + def get_classifier(self): + return self.classifier + + def reset_classifier(self, num_classes, global_pool='avg'): + self.global_pool = SelectAdaptivePool2d(pool_type=global_pool) + self.num_classes = num_classes + self.classifier = nn.Linear( + self.num_features * self.global_pool.feat_mult(), + num_classes) if self.num_classes else None + + def forward_features(self, x): + x = self.conv_stem(x) + x = self.bn1(x) + x = self.act1(x) + x = self.blocks(x) + x = self.global_pool(x) + x = self.conv_head(x) + x = self.act2(x) + return x + + def forward(self, x): + x = self.forward_features(x) + x = x.flatten(1) + if self.drop_rate > 0.: + x = F.dropout(x, p=self.drop_rate, training=self.training) + return self.classifier(x) + + def forward_meta(self, features): + return self.meta_layer(features.view(1, -1)) + + def rand_parameters(self, architecture, meta=False): + for name, param in self.named_parameters(recurse=True): + if 'meta' in name and meta: + yield param + elif 'blocks' not in name and 'meta' not in name and (not meta): + yield param + + if not meta: + for layer, layer_arch in zip(self.blocks, architecture): + for blocks, arch in zip(layer, layer_arch): + if arch == -1: + continue + for name, param in blocks[arch].named_parameters( + recurse=True): + yield param + + +class Classifier(nn.Module): + def __init__(self, num_classes=1000): + super(Classifier, self).__init__() + self.classifier = nn.Linear(num_classes, num_classes) + + def forward(self, x): + return self.classifier(x) + + +def gen_supernet(flops_minimum=0, flops_maximum=600, **kwargs): + choices = {'kernel_size': [3, 5, 7], 'exp_ratio': [4, 6]} + + num_features = 1280 + + # act_layer = HardSwish + act_layer = Swish + arch_def = [ + # stage 0, 112x112 in + ['ds_r1_k3_s1_e1_c16_se0.25'], + # stage 1, 112x112 in + ['ir_r1_k3_s2_e4_c24_se0.25', 'ir_r1_k3_s1_e4_c24_se0.25', 'ir_r1_k3_s1_e4_c24_se0.25', + 'ir_r1_k3_s1_e4_c24_se0.25'], + # stage 2, 56x56 in + ['ir_r1_k5_s2_e4_c40_se0.25', 'ir_r1_k5_s1_e4_c40_se0.25', 'ir_r1_k5_s2_e4_c40_se0.25', + 'ir_r1_k5_s2_e4_c40_se0.25'], + # stage 3, 28x28 in + ['ir_r1_k3_s2_e6_c80_se0.25', 'ir_r1_k3_s1_e4_c80_se0.25', 'ir_r1_k3_s1_e4_c80_se0.25', + 'ir_r2_k3_s1_e4_c80_se0.25'], + # stage 4, 14x14in + ['ir_r1_k3_s1_e6_c96_se0.25', 'ir_r1_k3_s1_e6_c96_se0.25', 'ir_r1_k3_s1_e6_c96_se0.25', + 'ir_r1_k3_s1_e6_c96_se0.25'], + # stage 5, 14x14in + ['ir_r1_k5_s2_e6_c192_se0.25', 'ir_r1_k5_s1_e6_c192_se0.25', 'ir_r1_k5_s2_e6_c192_se0.25', + 'ir_r1_k5_s2_e6_c192_se0.25'], + # stage 6, 7x7 in + ['cn_r1_k1_s1_c320_se0.25'], + ] + + sta_num, arch_def, resolution = search_for_layer( + flops_op_dict, arch_def, flops_minimum, flops_maximum) + + if sta_num is None or arch_def is None or resolution is None: + raise ValueError('Invalid FLOPs Settings') + + model_kwargs = dict( + block_args=decode_arch_def(arch_def), + choices=choices, + num_features=num_features, + stem_size=16, + norm_kwargs=resolve_bn_args(kwargs), + act_layer=act_layer, + se_kwargs=dict( + act_layer=nn.ReLU, + gate_fn=hard_sigmoid, + reduce_mid=True, + divisor=8), + **kwargs, + ) + model = SuperNet(**model_kwargs) + return model, sta_num, resolution diff --git a/examples/nas/cream/lib/utils/builder_util.py b/examples/nas/cream/lib/utils/builder_util.py new file mode 100644 index 0000000000..138e08299c --- /dev/null +++ b/examples/nas/cream/lib/utils/builder_util.py @@ -0,0 +1,273 @@ +import math +import torch.nn as nn + +from timm.utils import * +from timm.models.layers.activations import Swish +from timm.models.layers import CondConv2d, get_condconv_initializer + + +def parse_ksize(ss): + if ss.isdigit(): + return int(ss) + else: + return [int(k) for k in ss.split('.')] + + +def decode_arch_def( + arch_def, + depth_multiplier=1.0, + depth_trunc='ceil', + experts_multiplier=1): + arch_args = [] + for stack_idx, block_strings in enumerate(arch_def): + assert isinstance(block_strings, list) + stack_args = [] + repeats = [] + for block_str in block_strings: + assert isinstance(block_str, str) + ba, rep = decode_block_str(block_str) + if ba.get('num_experts', 0) > 0 and experts_multiplier > 1: + ba['num_experts'] *= experts_multiplier + stack_args.append(ba) + repeats.append(rep) + arch_args.append( + scale_stage_depth( + stack_args, + repeats, + depth_multiplier, + depth_trunc)) + return arch_args + + +def modify_block_args(block_args, kernel_size, exp_ratio): + block_type = block_args['block_type'] + if block_type == 'cn': + block_args['kernel_size'] = kernel_size + elif block_type == 'er': + block_args['exp_kernel_size'] = kernel_size + else: + block_args['dw_kernel_size'] = kernel_size + + if block_type == 'ir' or block_type == 'er': + block_args['exp_ratio'] = exp_ratio + return block_args + + +def decode_block_str(block_str): + """ Decode block definition string + Gets a list of block arg (dicts) through a string notation of arguments. + E.g. ir_r2_k3_s2_e1_i32_o16_se0.25_noskip + All args can exist in any order with the exception of the leading string which + is assumed to indicate the block type. + leading string - block type ( + ir = InvertedResidual, ds = DepthwiseSep, dsa = DeptwhiseSep with pw act, cn = ConvBnAct) + r - number of repeat blocks, + k - kernel size, + s - strides (1-9), + e - expansion ratio, + c - output channels, + se - squeeze/excitation ratio + n - activation fn ('re', 'r6', 'hs', or 'sw') + Args: + block_str: a string representation of block arguments. + Returns: + A list of block args (dicts) + Raises: + ValueError: if the string def not properly specified (TODO) + """ + assert isinstance(block_str, str) + ops = block_str.split('_') + block_type = ops[0] # take the block type off the front + ops = ops[1:] + options = {} + noskip = False + for op in ops: + # string options being checked on individual basis, combine if they + # grow + if op == 'noskip': + noskip = True + elif op.startswith('n'): + # activation fn + key = op[0] + v = op[1:] + if v == 're': + value = nn.ReLU + elif v == 'r6': + value = nn.ReLU6 + elif v == 'sw': + value = Swish + else: + continue + options[key] = value + else: + # all numeric options + splits = re.split(r'(\d.*)', op) + if len(splits) >= 2: + key, value = splits[:2] + options[key] = value + + # if act_layer is None, the model default (passed to model init) will be + # used + act_layer = options['n'] if 'n' in options else None + exp_kernel_size = parse_ksize(options['a']) if 'a' in options else 1 + pw_kernel_size = parse_ksize(options['p']) if 'p' in options else 1 + # FIXME hack to deal with in_chs issue in TPU def + fake_in_chs = int(options['fc']) if 'fc' in options else 0 + + num_repeat = int(options['r']) + # each type of block has different valid arguments, fill accordingly + if block_type == 'ir': + block_args = dict( + block_type=block_type, + dw_kernel_size=parse_ksize(options['k']), + exp_kernel_size=exp_kernel_size, + pw_kernel_size=pw_kernel_size, + out_chs=int(options['c']), + exp_ratio=float(options['e']), + se_ratio=float(options['se']) if 'se' in options else None, + stride=int(options['s']), + act_layer=act_layer, + noskip=noskip, + ) + if 'cc' in options: + block_args['num_experts'] = int(options['cc']) + elif block_type == 'ds' or block_type == 'dsa': + block_args = dict( + block_type=block_type, + dw_kernel_size=parse_ksize(options['k']), + pw_kernel_size=pw_kernel_size, + out_chs=int(options['c']), + se_ratio=float(options['se']) if 'se' in options else None, + stride=int(options['s']), + act_layer=act_layer, + pw_act=block_type == 'dsa', + noskip=block_type == 'dsa' or noskip, + ) + elif block_type == 'cn': + block_args = dict( + block_type=block_type, + kernel_size=int(options['k']), + out_chs=int(options['c']), + stride=int(options['s']), + act_layer=act_layer, + ) + else: + assert False, 'Unknown block type (%s)' % block_type + + return block_args, num_repeat + + +def scale_stage_depth( + stack_args, + repeats, + depth_multiplier=1.0, + depth_trunc='ceil'): + """ Per-stage depth scaling + Scales the block repeats in each stage. This depth scaling impl maintains + compatibility with the EfficientNet scaling method, while allowing sensible + scaling for other models that may have multiple block arg definitions in each stage. + """ + + # We scale the total repeat count for each stage, there may be multiple + # block arg defs per stage so we need to sum. + num_repeat = sum(repeats) + if depth_trunc == 'round': + # Truncating to int by rounding allows stages with few repeats to remain + # proportionally smaller for longer. This is a good choice when stage definitions + # include single repeat stages that we'd prefer to keep that way as + # long as possible + num_repeat_scaled = max(1, round(num_repeat * depth_multiplier)) + else: + # The default for EfficientNet truncates repeats to int via 'ceil'. + # Any multiplier > 1.0 will result in an increased depth for every + # stage. + num_repeat_scaled = int(math.ceil(num_repeat * depth_multiplier)) + + # Proportionally distribute repeat count scaling to each block definition in the stage. + # Allocation is done in reverse as it results in the first block being less likely to be scaled. + # The first block makes less sense to repeat in most of the arch + # definitions. + repeats_scaled = [] + for r in repeats[::-1]: + rs = max(1, round((r / num_repeat * num_repeat_scaled))) + repeats_scaled.append(rs) + num_repeat -= r + num_repeat_scaled -= rs + repeats_scaled = repeats_scaled[::-1] + + # Apply the calculated scaling to each block arg in the stage + sa_scaled = [] + for ba, rep in zip(stack_args, repeats_scaled): + sa_scaled.extend([deepcopy(ba) for _ in range(rep)]) + return sa_scaled + + +def init_weight_goog(m, n='', fix_group_fanout=True, last_bn=None): + """ Weight initialization as per Tensorflow official implementations. + Args: + m (nn.Module): module to init + n (str): module name + fix_group_fanout (bool): enable correct (matching Tensorflow TPU impl) fanout calculation w/ group convs + Handles layers in EfficientNet, EfficientNet-CondConv, MixNet, MnasNet, MobileNetV3, etc: + * https://github.com/tensorflow/tpu/blob/master/models/official/mnasnet/mnasnet_model.py + * https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/efficientnet_model.py + """ + if isinstance(m, CondConv2d): + fan_out = m.kernel_size[0] * m.kernel_size[1] * m.out_channels + if fix_group_fanout: + fan_out //= m.groups + init_weight_fn = get_condconv_initializer(lambda w: w.data.normal_( + 0, math.sqrt(2.0 / fan_out)), m.num_experts, m.weight_shape) + init_weight_fn(m.weight) + if m.bias is not None: + m.bias.data.zero_() + elif isinstance(m, nn.Conv2d): + fan_out = m.kernel_size[0] * m.kernel_size[1] * m.out_channels + if fix_group_fanout: + fan_out //= m.groups + m.weight.data.normal_(0, math.sqrt(2.0 / fan_out)) + if m.bias is not None: + m.bias.data.zero_() + elif isinstance(m, nn.BatchNorm2d): + if n in last_bn: + m.weight.data.zero_() + m.bias.data.zero_() + else: + m.weight.data.fill_(1.0) + m.bias.data.zero_() + m.weight.data.fill_(1.0) + m.bias.data.zero_() + elif isinstance(m, nn.Linear): + fan_out = m.weight.size(0) # fan-out + fan_in = 0 + if 'routing_fn' in n: + fan_in = m.weight.size(1) + init_range = 1.0 / math.sqrt(fan_in + fan_out) + m.weight.data.uniform_(-init_range, init_range) + m.bias.data.zero_() + + +def efficientnet_init_weights( + model: nn.Module, + init_fn=None, + zero_gamma=False): + last_bn = [] + if zero_gamma: + prev_n = '' + for n, m in model.named_modules(): + if isinstance(m, nn.BatchNorm2d): + if ''.join( + prev_n.split('.')[ + :- + 1]) != ''.join( + n.split('.')[ + :- + 1]): + last_bn.append(prev_n) + prev_n = n + last_bn.append(prev_n) + + init_fn = init_fn or init_weight_goog + for n, m in model.named_modules(): + init_fn(m, n, last_bn=last_bn) + init_fn(m, n, last_bn=last_bn) diff --git a/examples/nas/cream/lib/utils/flops_table.py b/examples/nas/cream/lib/utils/flops_table.py new file mode 100644 index 0000000000..254241a075 --- /dev/null +++ b/examples/nas/cream/lib/utils/flops_table.py @@ -0,0 +1,79 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# Written by Hao Du and Houwen Peng +# email: haodu8-c@my.cityu.edu.hk and houwen.peng@microsoft.com + +import torch + +from ptflops import get_model_complexity_info + + +class FlopsEst(object): + def __init__(self, model, input_shape=(2, 3, 224, 224), device='cpu'): + self.block_num = len(model.blocks) + self.choice_num = len(model.blocks[0]) + self.flops_dict = {} + self.params_dict = {} + + if device == 'cpu': + model = model.cpu() + else: + model = model.cuda() + + self.params_fixed = 0 + self.flops_fixed = 0 + + input = torch.randn(input_shape) + + flops, params = get_model_complexity_info( + model.conv_stem, (3, 224, 224), as_strings=False, print_per_layer_stat=False) + self.params_fixed += params / 1e6 + self.flops_fixed += flops / 1e6 + + input = model.conv_stem(input) + + for block_id, block in enumerate(model.blocks): + self.flops_dict[block_id] = {} + self.params_dict[block_id] = {} + for module_id, module in enumerate(block): + flops, params = get_model_complexity_info(module, tuple( + input.shape[1:]), as_strings=False, print_per_layer_stat=False) + # Flops(M) + self.flops_dict[block_id][module_id] = flops / 1e6 + # Params(M) + self.params_dict[block_id][module_id] = params / 1e6 + + input = module(input) + + # conv_last + flops, params = get_model_complexity_info(model.global_pool, tuple( + input.shape[1:]), as_strings=False, print_per_layer_stat=False) + self.params_fixed += params / 1e6 + self.flops_fixed += flops / 1e6 + + input = model.global_pool(input) + + # globalpool + flops, params = get_model_complexity_info(model.conv_head, tuple( + input.shape[1:]), as_strings=False, print_per_layer_stat=False) + self.params_fixed += params / 1e6 + self.flops_fixed += flops / 1e6 + + # return params (M) + def get_params(self, arch): + params = 0 + for block_id, block in enumerate(arch): + if block == -1: + continue + params += self.params_dict[block_id][block] + return params + self.params_fixed + + # return flops (M) + def get_flops(self, arch): + flops = 0 + for block_id, block in enumerate(arch): + if block == 'LayerChoice1' or block_id == 'LayerChoice23': + continue + for idx, choice in enumerate(arch[block]): + flops += self.flops_dict[block_id][idx] * (1 if choice else 0) + return flops + self.flops_fixed diff --git a/examples/nas/cream/lib/utils/op_by_layer_dict.py b/examples/nas/cream/lib/utils/op_by_layer_dict.py new file mode 100644 index 0000000000..47ca509ce4 --- /dev/null +++ b/examples/nas/cream/lib/utils/op_by_layer_dict.py @@ -0,0 +1,42 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# Written by Hao Du and Houwen Peng +# email: haodu8-c@my.cityu.edu.hk and houwen.peng@microsoft.com + +# This dictionary is generated from calculating each operation of each layer to quickly search for layers. +# flops_op_dict[which_stage][which_operation] = +# (flops_of_operation_with_stride1, flops_of_operation_with_stride2) + +flops_op_dict = {} +for i in range(5): + flops_op_dict[i] = {} +flops_op_dict[0][0] = (21.828704, 18.820752) +flops_op_dict[0][1] = (32.669328, 28.16048) +flops_op_dict[0][2] = (25.039968, 23.637648) +flops_op_dict[0][3] = (37.486224, 35.385824) +flops_op_dict[0][4] = (29.856864, 30.862992) +flops_op_dict[0][5] = (44.711568, 46.22384) +flops_op_dict[1][0] = (11.808656, 11.86712) +flops_op_dict[1][1] = (17.68624, 17.780848) +flops_op_dict[1][2] = (13.01288, 13.87416) +flops_op_dict[1][3] = (19.492576, 20.791408) +flops_op_dict[1][4] = (14.819216, 16.88472) +flops_op_dict[1][5] = (22.20208, 25.307248) +flops_op_dict[2][0] = (8.198, 10.99632) +flops_op_dict[2][1] = (12.292848, 16.5172) +flops_op_dict[2][2] = (8.69976, 11.99984) +flops_op_dict[2][3] = (13.045488, 18.02248) +flops_op_dict[2][4] = (9.4524, 13.50512) +flops_op_dict[2][5] = (14.174448, 20.2804) +flops_op_dict[3][0] = (12.006112, 15.61632) +flops_op_dict[3][1] = (18.028752, 23.46096) +flops_op_dict[3][2] = (13.009632, 16.820544) +flops_op_dict[3][3] = (19.534032, 25.267296) +flops_op_dict[3][4] = (14.514912, 18.62688) +flops_op_dict[3][5] = (21.791952, 27.9768) +flops_op_dict[4][0] = (11.307456, 15.292416) +flops_op_dict[4][1] = (17.007072, 23.1504) +flops_op_dict[4][2] = (11.608512, 15.894528) +flops_op_dict[4][3] = (17.458656, 24.053568) +flops_op_dict[4][4] = (12.060096, 16.797696) +flops_op_dict[4][5] = (18.136032, 25.40832) \ No newline at end of file diff --git a/examples/nas/cream/lib/utils/search_structure_supernet.py b/examples/nas/cream/lib/utils/search_structure_supernet.py new file mode 100644 index 0000000000..b13491c2c7 --- /dev/null +++ b/examples/nas/cream/lib/utils/search_structure_supernet.py @@ -0,0 +1,47 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# Written by Hao Du and Houwen Peng +# email: haodu8-c@my.cityu.edu.hk and houwen.peng@microsoft.com + +def search_for_layer(flops_op_dict, arch_def, flops_minimum, flops_maximum): + sta_num = [1, 1, 1, 1, 1] + order = [2, 3, 4, 1, 0, 2, 3, 4, 1, 0] + limits = [3, 3, 3, 2, 2, 4, 4, 4, 4, 4] + size_factor = 224 // 32 + base_min_flops = sum([flops_op_dict[i][0][0] for i in range(5)]) + base_max_flops = sum([flops_op_dict[i][5][0] for i in range(5)]) + + if base_min_flops > flops_maximum: + while base_min_flops > flops_maximum and size_factor >= 2: + size_factor = size_factor - 1 + flops_minimum = flops_minimum * (7. / size_factor) + flops_maximum = flops_maximum * (7. / size_factor) + if size_factor < 2: + return None, None, None + elif base_max_flops < flops_minimum: + cur_ptr = 0 + while base_max_flops < flops_minimum and cur_ptr <= 9: + if sta_num[order[cur_ptr]] >= limits[cur_ptr]: + cur_ptr += 1 + continue + base_max_flops = base_max_flops + \ + flops_op_dict[order[cur_ptr]][5][1] + sta_num[order[cur_ptr]] += 1 + if cur_ptr > 7 and base_max_flops < flops_minimum: + return None, None, None + + cur_ptr = 0 + while cur_ptr <= 9: + if sta_num[order[cur_ptr]] >= limits[cur_ptr]: + cur_ptr += 1 + continue + base_max_flops = base_max_flops + flops_op_dict[order[cur_ptr]][5][1] + if base_max_flops <= flops_maximum: + sta_num[order[cur_ptr]] += 1 + else: + break + + arch_def = [item[:i] for i, item in zip([1] + sta_num + [1], arch_def)] + # print(arch_def) + + return sta_num, arch_def, size_factor * 32 diff --git a/examples/nas/cream/lib/utils/util.py b/examples/nas/cream/lib/utils/util.py new file mode 100644 index 0000000000..9324a003cc --- /dev/null +++ b/examples/nas/cream/lib/utils/util.py @@ -0,0 +1,178 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# Written by Hao Du and Houwen Peng +# email: haodu8-c@my.cityu.edu.hk and houwen.peng@microsoft.com + +import sys +import argparse +import torch.nn as nn + +from torch import optim as optim +from thop import profile, clever_format + +from timm.utils import * + +from lib.config import cfg + + +def get_path_acc(model, path, val_loader, args, val_iters=50): + prec1_m = AverageMeter() + prec5_m = AverageMeter() + with torch.no_grad(): + for batch_idx, (input, target) in enumerate(val_loader): + if batch_idx >= val_iters: + break + if not args.prefetcher: + input = input.cuda() + target = target.cuda() + + output = model(input, path) + if isinstance(output, (tuple, list)): + output = output[0] + + # augmentation reduction + reduce_factor = args.tta + if reduce_factor > 1: + output = output.unfold( + 0, + reduce_factor, + reduce_factor).mean( + dim=2) + target = target[0:target.size(0):reduce_factor] + + prec1, prec5 = accuracy(output, target, topk=(1, 5)) + + torch.cuda.synchronize() + + prec1_m.update(prec1.item(), output.size(0)) + prec5_m.update(prec5.item(), output.size(0)) + + return (prec1_m.avg, prec5_m.avg) + + +def get_logger(file_path): + """ Make python logger """ + log_format = '%(asctime)s | %(message)s' + logging.basicConfig(stream=sys.stdout, level=logging.INFO, + format=log_format, datefmt='%m/%d %I:%M:%S %p') + logger = logging.getLogger('') + + formatter = logging.Formatter(log_format, datefmt='%m/%d %I:%M:%S %p') + file_handler = logging.FileHandler(file_path) + file_handler.setFormatter(formatter) + + logger.addHandler(file_handler) + + return logger + + +def add_weight_decay_supernet(model, args, weight_decay=1e-5, skip_list=()): + decay = [] + no_decay = [] + meta_layer_no_decay = [] + meta_layer_decay = [] + for name, param in model.named_parameters(): + if not param.requires_grad: + continue # frozen weights + if len(param.shape) == 1 or name.endswith( + ".bias") or name in skip_list: + if 'meta_layer' in name: + meta_layer_no_decay.append(param) + else: + no_decay.append(param) + else: + if 'meta_layer' in name: + meta_layer_decay.append(param) + else: + decay.append(param) + return [ + {'params': no_decay, 'weight_decay': 0., 'lr': args.lr}, + {'params': decay, 'weight_decay': weight_decay, 'lr': args.lr}, + {'params': meta_layer_no_decay, 'weight_decay': 0., 'lr': args.meta_lr}, + {'params': meta_layer_decay, 'weight_decay': 0, 'lr': args.meta_lr}, + ] + + +def create_optimizer_supernet(args, model, has_apex, filter_bias_and_bn=True): + opt_lower = args.opt.lower() + weight_decay = args.weight_decay + if 'adamw' in opt_lower or 'radam' in opt_lower: + weight_decay /= args.lr + if weight_decay and filter_bias_and_bn: + parameters = add_weight_decay_supernet(model, args, weight_decay) + weight_decay = 0. + else: + parameters = model.parameters() + + if 'fused' in opt_lower: + assert has_apex and torch.cuda.is_available( + ), 'APEX and CUDA required for fused optimizers' + + opt_split = opt_lower.split('_') + opt_lower = opt_split[-1] + if opt_lower == 'sgd' or opt_lower == 'nesterov': + optimizer = optim.SGD( + parameters, + momentum=args.momentum, + weight_decay=weight_decay, + nesterov=True) + elif opt_lower == 'momentum': + optimizer = optim.SGD( + parameters, + momentum=args.momentum, + weight_decay=weight_decay, + nesterov=False) + elif opt_lower == 'adam': + optimizer = optim.Adam( + parameters, weight_decay=weight_decay, eps=args.opt_eps) + else: + assert False and "Invalid optimizer" + raise ValueError + + return optimizer + + +def convert_lowercase(cfg): + keys = cfg.keys() + lowercase_keys = [key.lower() for key in keys] + values = [cfg.get(key) for key in keys] + for lowercase_key, value in zip(lowercase_keys, values): + cfg.setdefault(lowercase_key, value) + return cfg + + +def parse_config_args(exp_name): + parser = argparse.ArgumentParser(description=exp_name) + parser.add_argument( + '--cfg', + type=str, + default='../experiments/workspace/retrain/retrain.yaml', + help='configuration of cream') + parser.add_argument('--local_rank', type=int, default=0, + help='local_rank') + args = parser.parse_args() + + cfg.merge_from_file(args.cfg) + converted_cfg = convert_lowercase(cfg) + + return args, converted_cfg + + +def get_model_flops_params(model, input_size=(1, 3, 224, 224)): + input = torch.randn(input_size) + macs, params = profile(deepcopy(model), inputs=(input,), verbose=False) + macs, params = clever_format([macs, params], "%.3f") + return macs, params + + +def cross_entropy_loss_with_soft_target(pred, soft_target): + logsoftmax = nn.LogSoftmax() + return torch.mean(torch.sum(- soft_target * logsoftmax(pred), 1)) + + +def create_supernet_scheduler(cfg, optimizer): + ITERS = cfg.EPOCHS * \ + (1280000 / (cfg.NUM_GPU * cfg.DATASET.BATCH_SIZE)) + lr_scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda step: ( + cfg.LR - step / ITERS) if step <= ITERS else 0, last_epoch=-1) + return lr_scheduler, cfg.EPOCHS diff --git a/examples/nas/cream/requirements b/examples/nas/cream/requirements new file mode 100644 index 0000000000..5ddae72e4c --- /dev/null +++ b/examples/nas/cream/requirements @@ -0,0 +1,12 @@ +yacs +numpy==1.17 +opencv-python==4.0.1.24 +torchvision==0.2.1 +thop +git+https://github.com/sovrasov/flops-counter.pytorch.git +pillow==6.1.0 +torch==1.2 +timm==0.1.20 +tensorboardx==1.2 +tensorboard +future \ No newline at end of file diff --git a/examples/nas/cream/retrain.py b/examples/nas/cream/retrain.py new file mode 100644 index 0000000000..566c929b8c --- /dev/null +++ b/examples/nas/cream/retrain.py @@ -0,0 +1,321 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# Written by Hao Du and Houwen Peng +# email: haodu8-c@my.cityu.edu.hk and houwen.peng@microsoft.com + +import os +import warnings +import datetime +import torch +import numpy as np +import torch.nn as nn + +from torchscope import scope +from torch.utils.tensorboard import SummaryWriter + +# import timm packages +from timm.optim import create_optimizer +from timm.models import resume_checkpoint +from timm.scheduler import create_scheduler +from timm.data import Dataset, create_loader +from timm.utils import ModelEma, update_summary +from timm.loss import LabelSmoothingCrossEntropy + +# import apex as distributed package +try: + from apex import amp + from apex.parallel import DistributedDataParallel as DDP + from apex.parallel import convert_syncbn_model + HAS_APEX = True +except ImportError: + from torch.nn.parallel import DistributedDataParallel as DDP + HAS_APEX = False + +# import models and training functions +from lib.core.test import validate +from lib.core.retrain import train_epoch +from lib.models.structures.childnet import gen_childnet +from lib.utils.util import parse_config_args, get_logger, get_model_flops_params +from lib.config import DEFAULT_CROP_PCT, IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD + + +def main(): + args, cfg = parse_config_args('nni.cream.childnet') + + # resolve logging + output_dir = os.path.join(cfg.SAVE_PATH, + "{}-{}".format(datetime.date.today().strftime('%m%d'), + cfg.MODEL)) + if not os.path.exists(output_dir): + os.mkdir(output_dir) + + if args.local_rank == 0: + logger = get_logger(os.path.join(output_dir, 'retrain.log')) + writer = SummaryWriter(os.path.join(output_dir, 'runs')) + else: + writer, logger = None, None + + # retrain model selection + if cfg.NET.SELECTION == 481: + arch_list = [ + [0], [ + 3, 4, 3, 1], [ + 3, 2, 3, 0], [ + 3, 3, 3, 1], [ + 3, 3, 3, 3], [ + 3, 3, 3, 3], [0]] + cfg.DATASET.IMAGE_SIZE = 224 + elif cfg.NET.SELECTION == 43: + arch_list = [[0], [3], [3, 1], [3, 1], [3, 3, 3], [3, 3], [0]] + cfg.DATASET.IMAGE_SIZE = 96 + elif cfg.NET.SELECTION == 14: + arch_list = [[0], [3], [3, 3], [3, 3], [3], [3], [0]] + cfg.DATASET.IMAGE_SIZE = 64 + elif cfg.NET.SELECTION == 112: + arch_list = [[0], [3], [3, 3], [3, 3], [3, 3, 3], [3, 3], [0]] + cfg.DATASET.IMAGE_SIZE = 160 + elif cfg.NET.SELECTION == 287: + arch_list = [[0], [3], [3, 3], [3, 1, 3], [3, 3, 3, 3], [3, 3, 3], [0]] + cfg.DATASET.IMAGE_SIZE = 224 + elif cfg.NET.SELECTION == 604: + arch_list = [ + [0], [ + 3, 3, 2, 3, 3], [ + 3, 2, 3, 2, 3], [ + 3, 2, 3, 2, 3], [ + 3, 3, 2, 2, 3, 3], [ + 3, 3, 2, 3, 3, 3], [0]] + cfg.DATASET.IMAGE_SIZE = 224 + elif cfg.NET.SELECTION == -1: + arch_list = cfg.NET.INPUT_ARCH + cfg.DATASET.IMAGE_SIZE = 224 + else: + raise ValueError("Model Retrain Selection is not Supported!") + + # define childnet architecture from arch_list + stem = ['ds_r1_k3_s1_e1_c16_se0.25', 'cn_r1_k1_s1_c320_se0.25'] + choice_block_pool = ['ir_r1_k3_s2_e4_c24_se0.25', + 'ir_r1_k5_s2_e4_c40_se0.25', + 'ir_r1_k3_s2_e6_c80_se0.25', + 'ir_r1_k3_s1_e6_c96_se0.25', + 'ir_r1_k3_s2_e6_c192_se0.25'] + arch_def = [[stem[0]]] + [[choice_block_pool[idx] + for repeat_times in range(len(arch_list[idx + 1]))] + for idx in range(len(choice_block_pool))] + [[stem[1]]] + + # generate childnet + model = gen_childnet( + arch_list, + arch_def, + num_classes=cfg.DATASET.NUM_CLASSES, + drop_rate=cfg.NET.DROPOUT_RATE, + global_pool=cfg.NET.GP) + + # initialize training parameters + eval_metric = cfg.EVAL_METRICS + best_metric, best_epoch, saver = None, None, None + + # initialize distributed parameters + distributed = cfg.NUM_GPU > 1 + torch.cuda.set_device(args.local_rank) + torch.distributed.init_process_group(backend='nccl', init_method='env://') + if args.local_rank == 0: + logger.info( + 'Training on Process {} with {} GPUs.'.format( + args.local_rank, cfg.NUM_GPU)) + + # fix random seeds + torch.manual_seed(cfg.SEED) + torch.cuda.manual_seed_all(cfg.SEED) + np.random.seed(cfg.SEED) + torch.backends.cudnn.deterministic = True + torch.backends.cudnn.benchmark = False + + # get parameters and FLOPs of model + if args.local_rank == 0: + macs, params = get_model_flops_params(model, input_size=( + 1, 3, cfg.DATASET.IMAGE_SIZE, cfg.DATASET.IMAGE_SIZE)) + logger.info( + '[Model-{}] Flops: {} Params: {}'.format(cfg.NET.SELECTION, macs, params)) + + # create optimizer + optimizer = create_optimizer(cfg, model) + model = model.cuda() + + # optionally resume from a checkpoint + resume_state, resume_epoch = {}, None + if cfg.AUTO_RESUME: + resume_state, resume_epoch = resume_checkpoint(model, cfg.RESUME_PATH) + optimizer.load_state_dict(resume_state['optimizer']) + del resume_state + + model_ema = None + if cfg.NET.EMA.USE: + model_ema = ModelEma( + model, + decay=cfg.NET.EMA.DECAY, + device='cpu' if cfg.NET.EMA.FORCE_CPU else '', + resume=cfg.RESUME_PATH if cfg.AUTO_RESUME else None) + + if distributed: + if cfg.BATCHNORM.SYNC_BN: + try: + if HAS_APEX: + model = convert_syncbn_model(model) + else: + model = torch.nn.SyncBatchNorm.convert_sync_batchnorm( + model) + if args.local_rank == 0: + logger.info( + 'Converted model to use Synchronized BatchNorm.') + except Exception as e: + if args.local_rank == 0: + logger.error( + 'Failed to enable Synchronized BatchNorm. Install Apex or Torch >= 1.1 with exception {}'.format(e)) + if HAS_APEX: + model = DDP(model, delay_allreduce=True) + else: + if args.local_rank == 0: + logger.info( + "Using torch DistributedDataParallel. Install NVIDIA Apex for Apex DDP.") + # can use device str in Torch >= 1.1 + model = DDP(model, device_ids=[args.local_rank]) + + # imagenet train dataset + train_dir = os.path.join(cfg.DATA_DIR, 'train') + if not os.path.exists(train_dir) and args.local_rank == 0: + logger.error('Training folder does not exist at: {}'.format(train_dir)) + exit(1) + dataset_train = Dataset(train_dir) + loader_train = create_loader( + dataset_train, + input_size=(3, cfg.DATASET.IMAGE_SIZE, cfg.DATASET.IMAGE_SIZE), + batch_size=cfg.DATASET.BATCH_SIZE, + is_training=True, + color_jitter=cfg.AUGMENTATION.COLOR_JITTER, + auto_augment=cfg.AUGMENTATION.AA, + num_aug_splits=0, + crop_pct=DEFAULT_CROP_PCT, + mean=IMAGENET_DEFAULT_MEAN, + std=IMAGENET_DEFAULT_STD, + num_workers=cfg.WORKERS, + distributed=distributed, + collate_fn=None, + pin_memory=cfg.DATASET.PIN_MEM, + interpolation='random', + re_mode=cfg.AUGMENTATION.RE_MODE, + re_prob=cfg.AUGMENTATION.RE_PROB + ) + + # imagenet validation dataset + eval_dir = os.path.join(cfg.DATA_DIR, 'val') + if not os.path.exists(eval_dir) and args.local_rank == 0: + logger.error( + 'Validation folder does not exist at: {}'.format(eval_dir)) + exit(1) + dataset_eval = Dataset(eval_dir) + loader_eval = create_loader( + dataset_eval, + input_size=(3, cfg.DATASET.IMAGE_SIZE, cfg.DATASET.IMAGE_SIZE), + batch_size=cfg.DATASET.VAL_BATCH_MUL * cfg.DATASET.BATCH_SIZE, + is_training=False, + interpolation=cfg.DATASET.INTERPOLATION, + crop_pct=DEFAULT_CROP_PCT, + mean=IMAGENET_DEFAULT_MEAN, + std=IMAGENET_DEFAULT_STD, + num_workers=cfg.WORKERS, + distributed=distributed, + pin_memory=cfg.DATASET.PIN_MEM + ) + + # whether to use label smoothing + if cfg.AUGMENTATION.SMOOTHING > 0.: + train_loss_fn = LabelSmoothingCrossEntropy( + smoothing=cfg.AUGMENTATION.SMOOTHING).cuda() + validate_loss_fn = nn.CrossEntropyLoss().cuda() + else: + train_loss_fn = nn.CrossEntropyLoss().cuda() + validate_loss_fn = train_loss_fn + + # create learning rate scheduler + lr_scheduler, num_epochs = create_scheduler(cfg, optimizer) + start_epoch = resume_epoch if resume_epoch is not None else 0 + if start_epoch > 0: + lr_scheduler.step(start_epoch) + if args.local_rank == 0: + logger.info('Scheduled epochs: {}'.format(num_epochs)) + + try: + best_record, best_ep = 0, 0 + for epoch in range(start_epoch, num_epochs): + if distributed: + loader_train.sampler.set_epoch(epoch) + + train_metrics = train_epoch( + epoch, + model, + loader_train, + optimizer, + train_loss_fn, + cfg, + lr_scheduler=lr_scheduler, + saver=saver, + output_dir=output_dir, + model_ema=model_ema, + logger=logger, + writer=writer, + local_rank=args.local_rank) + + eval_metrics = validate( + epoch, + model, + loader_eval, + validate_loss_fn, + cfg, + logger=logger, + writer=writer, + local_rank=args.local_rank) + + if model_ema is not None and not cfg.NET.EMA.FORCE_CPU: + ema_eval_metrics = validate( + epoch, + model_ema.ema, + loader_eval, + validate_loss_fn, + cfg, + log_suffix='_EMA', + logger=logger, + writer=writer) + eval_metrics = ema_eval_metrics + + if lr_scheduler is not None: + lr_scheduler.step(epoch + 1, eval_metrics[eval_metric]) + + update_summary(epoch, train_metrics, eval_metrics, os.path.join( + output_dir, 'summary.csv'), write_header=best_metric is None) + + if saver is not None: + # save proper checkpoint with eval metric + save_metric = eval_metrics[eval_metric] + best_metric, best_epoch = saver.save_checkpoint( + model, optimizer, cfg, + epoch=epoch, model_ema=model_ema, metric=save_metric) + + if best_record < eval_metrics[eval_metric]: + best_record = eval_metrics[eval_metric] + best_ep = epoch + + if args.local_rank == 0: + logger.info( + '*** Best metric: {0} (epoch {1})'.format(best_record, best_ep)) + + except KeyboardInterrupt: + pass + + if best_metric is not None: + logger.info( + '*** Best metric: {0} (epoch {1})'.format(best_metric, best_epoch)) + + +if __name__ == '__main__': + main() diff --git a/examples/nas/cream/test.py b/examples/nas/cream/test.py new file mode 100644 index 0000000000..67ee822853 --- /dev/null +++ b/examples/nas/cream/test.py @@ -0,0 +1,158 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# Written by Hao Du and Houwen Peng +# email: haodu8-c@my.cityu.edu.hk and houwen.peng@microsoft.com + +import os +import warnings +import datetime +import torch +import torch.nn as nn + +from torch.utils.tensorboard import SummaryWriter + +# import timm packages +from timm.utils import ModelEma +from timm.models import resume_checkpoint +from timm.data import Dataset, create_loader + +# import apex as distributed package +try: + from apex.parallel import convert_syncbn_model + from apex.parallel import DistributedDataParallel as DDP + HAS_APEX = True +except ImportError: + from torch.nn.parallel import DistributedDataParallel as DDP + HAS_APEX = False + +# import models and training functions +from lib.core.test import validate +from lib.models.structures.childnet import gen_childnet +from lib.utils.util import parse_config_args, get_logger, get_model_flops_params +from lib.config import DEFAULT_CROP_PCT, IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD + + +def main(): + args, cfg = parse_config_args('child net testing') + + # resolve logging + output_dir = os.path.join(cfg.SAVE_PATH, + "{}-{}".format(datetime.date.today().strftime('%m%d'), + cfg.MODEL)) + if not os.path.exists(output_dir): + os.mkdir(output_dir) + + if args.local_rank == 0: + logger = get_logger(os.path.join(output_dir, 'test.log')) + writer = SummaryWriter(os.path.join(output_dir, 'runs')) + else: + writer, logger = None, None + + # retrain model selection + if cfg.NET.SELECTION == 481: + arch_list = [ + [0], [ + 3, 4, 3, 1], [ + 3, 2, 3, 0], [ + 3, 3, 3, 1], [ + 3, 3, 3, 3], [ + 3, 3, 3, 3], [0]] + cfg.DATASET.IMAGE_SIZE = 224 + elif cfg.NET.SELECTION == 43: + arch_list = [[0], [3], [3, 1], [3, 1], [3, 3, 3], [3, 3], [0]] + cfg.DATASET.IMAGE_SIZE = 96 + elif cfg.NET.SELECTION == 14: + arch_list = [[0], [3], [3, 3], [3, 3], [3], [3], [0]] + cfg.DATASET.IMAGE_SIZE = 64 + elif cfg.NET.SELECTION == 112: + arch_list = [[0], [3], [3, 3], [3, 3], [3, 3, 3], [3, 3], [0]] + cfg.DATASET.IMAGE_SIZE = 160 + elif cfg.NET.SELECTION == 287: + arch_list = [[0], [3], [3, 3], [3, 1, 3], [3, 3, 3, 3], [3, 3, 3], [0]] + cfg.DATASET.IMAGE_SIZE = 224 + elif cfg.NET.SELECTION == 604: + arch_list = [[0], [3, 3, 2, 3, 3], [3, 2, 3, 2, 3], [3, 2, 3, 2, 3], + [3, 3, 2, 2, 3, 3], [3, 3, 2, 3, 3, 3], [0]] + cfg.DATASET.IMAGE_SIZE = 224 + else: + raise ValueError("Model Test Selection is not Supported!") + + # define childnet architecture from arch_list + stem = ['ds_r1_k3_s1_e1_c16_se0.25', 'cn_r1_k1_s1_c320_se0.25'] + choice_block_pool = ['ir_r1_k3_s2_e4_c24_se0.25', + 'ir_r1_k5_s2_e4_c40_se0.25', + 'ir_r1_k3_s2_e6_c80_se0.25', + 'ir_r1_k3_s1_e6_c96_se0.25', + 'ir_r1_k3_s2_e6_c192_se0.25'] + arch_def = [[stem[0]]] + [[choice_block_pool[idx] + for repeat_times in range(len(arch_list[idx + 1]))] + for idx in range(len(choice_block_pool))] + [[stem[1]]] + + # generate childnet + model = gen_childnet( + arch_list, + arch_def, + num_classes=cfg.DATASET.NUM_CLASSES, + drop_rate=cfg.NET.DROPOUT_RATE, + global_pool=cfg.NET.GP) + + if args.local_rank == 0: + macs, params = get_model_flops_params(model, input_size=( + 1, 3, cfg.DATASET.IMAGE_SIZE, cfg.DATASET.IMAGE_SIZE)) + logger.info( + '[Model-{}] Flops: {} Params: {}'.format(cfg.NET.SELECTION, macs, params)) + + # initialize distributed parameters + torch.cuda.set_device(args.local_rank) + torch.distributed.init_process_group(backend='nccl', init_method='env://') + if args.local_rank == 0: + logger.info( + "Training on Process {} with {} GPUs.".format( + args.local_rank, cfg.NUM_GPU)) + + # resume model from checkpoint + assert cfg.AUTO_RESUME is True and os.path.exists(cfg.RESUME_PATH) + _, __ = resume_checkpoint(model, cfg.RESUME_PATH) + + model = model.cuda() + + model_ema = None + if cfg.NET.EMA.USE: + # Important to create EMA model after cuda(), DP wrapper, and AMP but + # before SyncBN and DDP wrapper + model_ema = ModelEma( + model, + decay=cfg.NET.EMA.DECAY, + device='cpu' if cfg.NET.EMA.FORCE_CPU else '', + resume=cfg.RESUME_PATH) + + # imagenet validation dataset + eval_dir = os.path.join(cfg.DATA_DIR, 'val') + if not os.path.exists(eval_dir) and args.local_rank == 0: + logger.error( + 'Validation folder does not exist at: {}'.format(eval_dir)) + exit(1) + + dataset_eval = Dataset(eval_dir) + loader_eval = create_loader( + dataset_eval, + input_size=(3, cfg.DATASET.IMAGE_SIZE, cfg.DATASET.IMAGE_SIZE), + batch_size=cfg.DATASET.VAL_BATCH_MUL * cfg.DATASET.BATCH_SIZE, + is_training=False, + num_workers=cfg.WORKERS, + distributed=True, + pin_memory=cfg.DATASET.PIN_MEM, + crop_pct=DEFAULT_CROP_PCT, + mean=IMAGENET_DEFAULT_MEAN, + std=IMAGENET_DEFAULT_STD + ) + + # only test accuracy of model-EMA + validate_loss_fn = nn.CrossEntropyLoss().cuda() + validate(0, model_ema.ema, loader_eval, validate_loss_fn, cfg, + log_suffix='_EMA', logger=logger, + writer=writer, local_rank=args.local_rank) + + +if __name__ == '__main__': + main() diff --git a/examples/nas/cream/train.py b/examples/nas/cream/train.py new file mode 100644 index 0000000000..50d340c1ef --- /dev/null +++ b/examples/nas/cream/train.py @@ -0,0 +1,213 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# Written by Hao Du and Houwen Peng +# email: haodu8-c@my.cityu.edu.hk and houwen.peng@microsoft.com + +import os +import sys +import datetime +import torch +import numpy as np +import torch.nn as nn + +# import timm packages +from timm.loss import LabelSmoothingCrossEntropy +from timm.data import Dataset, create_loader +from timm.models import resume_checkpoint + +# import apex as distributed package +try: + from apex.parallel import DistributedDataParallel as DDP + from apex.parallel import convert_syncbn_model + USE_APEX = True +except ImportError: + from torch.nn.parallel import DistributedDataParallel as DDP + USE_APEX = False + +# import models and training functions +from lib.utils.flops_table import FlopsEst +from lib.models.structures.supernet import gen_supernet +from lib.config import DEFAULT_CROP_PCT, IMAGENET_DEFAULT_STD, IMAGENET_DEFAULT_MEAN +from lib.utils.util import parse_config_args, get_logger, \ + create_optimizer_supernet, create_supernet_scheduler + +from nni.nas.pytorch.callbacks import LRSchedulerCallback +from nni.nas.pytorch.callbacks import ModelCheckpoint +from nni.algorithms.nas.pytorch.cream import CreamSupernetTrainer +from nni.algorithms.nas.pytorch.random import RandomMutator + +def main(): + args, cfg = parse_config_args('nni.cream.supernet') + + # resolve logging + output_dir = os.path.join(cfg.SAVE_PATH, + "{}-{}".format(datetime.date.today().strftime('%m%d'), + cfg.MODEL)) + if not os.path.exists(output_dir): + os.mkdir(output_dir) + + if args.local_rank == 0: + logger = get_logger(os.path.join(output_dir, "train.log")) + else: + logger = None + + # initialize distributed parameters + torch.cuda.set_device(args.local_rank) + torch.distributed.init_process_group(backend='nccl', init_method='env://') + if args.local_rank == 0: + logger.info( + 'Training on Process %d with %d GPUs.', + args.local_rank, cfg.NUM_GPU) + + # fix random seeds + torch.manual_seed(cfg.SEED) + torch.cuda.manual_seed_all(cfg.SEED) + np.random.seed(cfg.SEED) + torch.backends.cudnn.deterministic = True + torch.backends.cudnn.benchmark = False + + # generate supernet + model, sta_num, resolution = gen_supernet( + flops_minimum=cfg.SUPERNET.FLOPS_MINIMUM, + flops_maximum=cfg.SUPERNET.FLOPS_MAXIMUM, + num_classes=cfg.DATASET.NUM_CLASSES, + drop_rate=cfg.NET.DROPOUT_RATE, + global_pool=cfg.NET.GP, + resunit=cfg.SUPERNET.RESUNIT, + dil_conv=cfg.SUPERNET.DIL_CONV, + slice=cfg.SUPERNET.SLICE, + verbose=cfg.VERBOSE, + logger=logger) + + # number of choice blocks in supernet + choice_num = len(model.blocks[7]) + if args.local_rank == 0: + logger.info('Supernet created, param count: %d', ( + sum([m.numel() for m in model.parameters()]))) + logger.info('resolution: %d', (resolution)) + logger.info('choice number: %d', (choice_num)) + + # initialize flops look-up table + model_est = FlopsEst(model) + flops_dict, flops_fixed = model_est.flops_dict, model_est.flops_fixed + + # optionally resume from a checkpoint + optimizer_state = None + resume_epoch = None + if cfg.AUTO_RESUME: + optimizer_state, resume_epoch = resume_checkpoint( + model, cfg.RESUME_PATH) + + # create optimizer and resume from checkpoint + optimizer = create_optimizer_supernet(cfg, model, USE_APEX) + if optimizer_state is not None: + optimizer.load_state_dict(optimizer_state['optimizer']) + model = model.cuda() + + # convert model to distributed mode + if cfg.BATCHNORM.SYNC_BN: + try: + if USE_APEX: + model = convert_syncbn_model(model) + else: + model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) + if args.local_rank == 0: + logger.info('Converted model to use Synchronized BatchNorm.') + except Exception as exception: + logger.info( + 'Failed to enable Synchronized BatchNorm. ' + 'Install Apex or Torch >= 1.1 with Exception %s', exception) + if USE_APEX: + model = DDP(model, delay_allreduce=True) + else: + if args.local_rank == 0: + logger.info( + "Using torch DistributedDataParallel. Install NVIDIA Apex for Apex DDP.") + # can use device str in Torch >= 1.1 + model = DDP(model, device_ids=[args.local_rank]) + + # create learning rate scheduler + lr_scheduler, num_epochs = create_supernet_scheduler(cfg, optimizer) + + start_epoch = resume_epoch if resume_epoch is not None else 0 + if start_epoch > 0: + lr_scheduler.step(start_epoch) + + if args.local_rank == 0: + logger.info('Scheduled epochs: %d', num_epochs) + + # imagenet train dataset + train_dir = os.path.join(cfg.DATA_DIR, 'train') + if not os.path.exists(train_dir): + logger.info('Training folder does not exist at: %s', train_dir) + sys.exit() + + dataset_train = Dataset(train_dir) + loader_train = create_loader( + dataset_train, + input_size=(3, cfg.DATASET.IMAGE_SIZE, cfg.DATASET.IMAGE_SIZE), + batch_size=cfg.DATASET.BATCH_SIZE, + is_training=True, + use_prefetcher=True, + re_prob=cfg.AUGMENTATION.RE_PROB, + re_mode=cfg.AUGMENTATION.RE_MODE, + color_jitter=cfg.AUGMENTATION.COLOR_JITTER, + interpolation='random', + num_workers=cfg.WORKERS, + distributed=True, + collate_fn=None, + crop_pct=DEFAULT_CROP_PCT, + mean=IMAGENET_DEFAULT_MEAN, + std=IMAGENET_DEFAULT_STD + ) + + # imagenet validation dataset + eval_dir = os.path.join(cfg.DATA_DIR, 'val') + if not os.path.isdir(eval_dir): + logger.info('Validation folder does not exist at: %s', eval_dir) + sys.exit() + dataset_eval = Dataset(eval_dir) + loader_eval = create_loader( + dataset_eval, + input_size=(3, cfg.DATASET.IMAGE_SIZE, cfg.DATASET.IMAGE_SIZE), + batch_size=4 * cfg.DATASET.BATCH_SIZE, + is_training=False, + use_prefetcher=True, + num_workers=cfg.WORKERS, + distributed=True, + crop_pct=DEFAULT_CROP_PCT, + mean=IMAGENET_DEFAULT_MEAN, + std=IMAGENET_DEFAULT_STD, + interpolation=cfg.DATASET.INTERPOLATION + ) + + # whether to use label smoothing + if cfg.AUGMENTATION.SMOOTHING > 0.: + train_loss_fn = LabelSmoothingCrossEntropy( + smoothing=cfg.AUGMENTATION.SMOOTHING).cuda() + validate_loss_fn = nn.CrossEntropyLoss().cuda() + else: + train_loss_fn = nn.CrossEntropyLoss().cuda() + validate_loss_fn = train_loss_fn + + mutator = RandomMutator(model) + + trainer = CreamSupernetTrainer(model, train_loss_fn, validate_loss_fn, + optimizer, num_epochs, loader_train, loader_eval, + mutator=mutator, batch_size=cfg.DATASET.BATCH_SIZE, + log_frequency=cfg.LOG_INTERVAL, + meta_sta_epoch=cfg.SUPERNET.META_STA_EPOCH, + update_iter=cfg.SUPERNET.UPDATE_ITER, + slices=cfg.SUPERNET.SLICE, + pool_size=cfg.SUPERNET.POOL_SIZE, + pick_method=cfg.SUPERNET.PICK_METHOD, + choice_num=choice_num, sta_num=sta_num, acc_gap=cfg.ACC_GAP, + flops_dict=flops_dict, flops_fixed=flops_fixed, local_rank=args.local_rank, + callbacks=[LRSchedulerCallback(lr_scheduler), + ModelCheckpoint(output_dir)]) + + trainer.train() + + +if __name__ == '__main__': + main() diff --git a/examples/trials/cifar10_pytorch/config_adl.yml b/examples/trials/cifar10_pytorch/config_adl.yml new file mode 100644 index 0000000000..b1a994d752 --- /dev/null +++ b/examples/trials/cifar10_pytorch/config_adl.yml @@ -0,0 +1,29 @@ +authorName: default +experimentName: example_pytorch_cifar10 +trialConcurrency: 1 +maxExecDuration: 100h +maxTrialNum: 10 +nniManagerIp: {replace_with_your_ip} +trainingServicePlatform: adl +searchSpacePath: search_space_adl.json +logCollection: http +#choice: true, false +useAnnotation: false +tuner: + #choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner + #SMAC (SMAC should be installed through nnictl) + builtinTunerName: TPE + classArgs: + #choice: maximize, minimize + optimize_mode: maximize +trial: + command: python3 main_adl.py + codeDir: . + gpuNum: 1 + image: {replace_with_the_image_that_has_adaptdl_installed} + adaptive: true + checkpoint: + storageClass: dfs + storageSize: 1Gi + cpuNum: 1 + memorySize: 1Gi diff --git a/examples/trials/cifar10_pytorch/main_adl.py b/examples/trials/cifar10_pytorch/main_adl.py new file mode 100644 index 0000000000..c162c2bfc9 --- /dev/null +++ b/examples/trials/cifar10_pytorch/main_adl.py @@ -0,0 +1,170 @@ +# Copyright 2020 Petuum, Inc. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +''' +Train CIFAR10 with PyTorch and AdaptDL. This example is based on: +https://github.com/petuum/adaptdl/blob/master/examples/pytorch-cifar/main.py +''' +import torch +import torch.nn as nn +import torch.optim as optim +import torch.backends.cudnn as cudnn +import torch.distributed as dist + +import torchvision +import torchvision.transforms as transforms + +import os +import argparse + +from models import * + +import adaptdl +import adaptdl.torch as adl + +from torch.optim.lr_scheduler import MultiStepLR +from torch.utils.tensorboard import SummaryWriter + +import nni + + +parser = argparse.ArgumentParser(description='PyTorch CIFAR10 Training') +parser.add_argument('--bs', default=128, type=int, help='batch size') +parser.add_argument('--lr', default=0.1, type=float, help='learning rate') +parser.add_argument('--epochs', default=30, type=int, help='number of epochs') +parser.add_argument('--model', default='ResNet18', type=str, help='model') +parser.add_argument('--autoscale-bsz', dest='autoscale_bsz', default=True, action='store_true', help='autoscale batchsize') +args = parser.parse_args() + +# load the parameters from nni +RCV_CONFIG = nni.get_next_parameter() +args.lr = RCV_CONFIG["lr"] + +device = 'cuda' if torch.cuda.is_available() else 'cpu' + +# Data +print('==> Preparing data..') +transform_train = transforms.Compose([ + transforms.RandomCrop(32, padding=4), + transforms.RandomHorizontalFlip(), + transforms.ToTensor(), + transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)), +]) + +transform_test = transforms.Compose([ + transforms.ToTensor(), + transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)), +]) + +adaptdl.torch.init_process_group("nccl" if torch.cuda.is_available() else "gloo") + +if adaptdl.env.replica_rank() == 0: + trainset = torchvision.datasets.CIFAR10(root=adaptdl.env.share_path(), train=True, download=True, transform=transform_train) + trainloader = adl.AdaptiveDataLoader(trainset, batch_size=args.bs, shuffle=True, num_workers=2, drop_last=True) + dist.barrier() # We use a barrier here so that non-master replicas would wait for master to download the data +else: + dist.barrier() + trainset = torchvision.datasets.CIFAR10(root=adaptdl.env.share_path(), train=True, download=False, transform=transform_train) + trainloader = adl.AdaptiveDataLoader(trainset, batch_size=args.bs, shuffle=True, num_workers=2, drop_last=True) + +if args.autoscale_bsz: + trainloader.autoscale_batch_size(4096, local_bsz_bounds=(32, 1024), gradient_accumulation=True) + +validset = torchvision.datasets.CIFAR10(root=adaptdl.env.share_path(), train=False, download=False, transform=transform_test) +validloader = adl.AdaptiveDataLoader(validset, batch_size=100, shuffle=False, num_workers=2) + +# Model +print('==> Building model..') +net = eval(args.model)() +net = net.to(device) +if device == 'cuda': + cudnn.benchmark = True + +criterion = nn.CrossEntropyLoss() +optimizer = optim.SGD([{"params": [param]} for param in net.parameters()], + lr=args.lr, momentum=0.9, weight_decay=5e-4) +lr_scheduler = MultiStepLR(optimizer, [30, 45], 0.1) + +net = adl.AdaptiveDataParallel(net, optimizer, lr_scheduler) + +# Training +def train(epoch): + print('\nEpoch: %d' % epoch) + net.train() + stats = adl.Accumulator() + for inputs, targets in trainloader: + inputs, targets = inputs.to(device), targets.to(device) + optimizer.zero_grad() + outputs = net(inputs) + loss = criterion(outputs, targets) + loss.backward() + optimizer.step() + + stats["loss_sum"] += loss.item() * targets.size(0) + _, predicted = outputs.max(1) + stats["total"] += targets.size(0) + stats["correct"] += predicted.eq(targets).sum().item() + + trainloader.to_tensorboard(writer, epoch, tag_prefix="AdaptDL/Data/") + net.to_tensorboard(writer, epoch, tag_prefix="AdaptDL/Model/") + with stats.synchronized(): + stats["loss_avg"] = stats["loss_sum"] / stats["total"] + stats["accuracy"] = stats["correct"] / stats["total"] + writer.add_scalar("Loss/Train", stats["loss_avg"], epoch) + writer.add_scalar("Accuracy/Train", stats["accuracy"], epoch) + print("Train:", stats) + +def valid(epoch): + net.eval() + stats = adl.Accumulator() + with torch.no_grad(): + for inputs, targets in validloader: + inputs, targets = inputs.to(device), targets.to(device) + outputs = net(inputs) + loss = criterion(outputs, targets) + + stats["loss_sum"] += loss.item() * targets.size(0) + _, predicted = outputs.max(1) + stats["total"] += targets.size(0) + stats["correct"] += predicted.eq(targets).sum().item() + + with stats.synchronized(): + stats["loss_avg"] = stats["loss_sum"] / stats["total"] + stats["accuracy"] = stats["correct"] / stats["total"] + writer.add_scalar("Loss/Valid", stats["loss_avg"], epoch) + writer.add_scalar("Accuracy/Valid", stats["accuracy"], epoch) + + if adaptdl.env.replica_rank() == 0: + nni.report_intermediate_result(stats["accuracy"], accum=stats) + + print("Valid:", stats) + return stats["accuracy"] + + +tensorboard_dir = os.path.join( + os.getenv("ADAPTDL_TENSORBOARD_LOGDIR", "/adaptdl/tensorboard"), + os.getenv("NNI_TRIAL_JOB_ID", "cifar-adaptdl") +) +if not os.path.exists(tensorboard_dir): + os.makedirs(tensorboard_dir) + +with SummaryWriter(tensorboard_dir) as writer: + acc = 0 + for epoch in adl.remaining_epochs_until(args.epochs): + train(epoch) + acc = valid(epoch) + lr_scheduler.step() + + if adaptdl.env.replica_rank() == 0: + nni.report_final_result(acc) diff --git a/examples/trials/cifar10_pytorch/search_space_adl.json b/examples/trials/cifar10_pytorch/search_space_adl.json new file mode 100644 index 0000000000..0dadf05f6a --- /dev/null +++ b/examples/trials/cifar10_pytorch/search_space_adl.json @@ -0,0 +1,5 @@ +{ + "lr":{"_type":"choice", "_value":[0.1, 0.01, 0.001]}, + "bs":{"_type":"choice","_value":[64, 96, 128]}, + "model":{"_type":"choice", "_value":["ResNet18", "SENet18", "MobileNet"]} +} diff --git a/examples/trials/ga_squad/trial.py b/examples/trials/ga_squad/trial.py index 911da42bc8..1e9b53b895 100644 --- a/examples/trials/ga_squad/trial.py +++ b/examples/trials/ga_squad/trial.py @@ -218,8 +218,7 @@ def run_epoch(batches, answer_net, is_training): loss, _, = sess.run( [answer_net.loss, answer_net.train_op], feed_dict=feed_dict) if count % 100 == 0: - logger.debug('%d %g except:%g, loss:%g' % - (count, used, used / count * len(batches), loss)) + logger.debug('%d %g except:%g, loss:%g', count, used, used / count * len(batches), loss) loss_sum += loss else: feed_dict = {answer_net.query_word: query, @@ -239,8 +238,7 @@ def run_epoch(batches, answer_net, is_training): contexts += context ids = np.concatenate((ids, sample_id)) if count % 100 == 0: - logger.debug('%d %g except:%g' % - (count, used, used / count * len(batches))) + logger.debug('%d %g except:%g', count, used, used / count * len(batches)) loss = loss_sum / len(batches) if is_training: return loss @@ -327,7 +325,7 @@ def train_with_graph(graph, qp_pairs, dev_qp_pairs): train_batches = data.get_batches(qp_pairs, cfg.batch_size) train_loss = run_epoch(train_batches, train_model, True) logger.debug('epoch ' + str(epoch) + - ' loss: ' + str(train_loss)) + ' loss: ', str(train_loss)) dev_batches = list(data.get_batches( dev_qp_pairs, cfg.batch_size)) _, position1, position2, ids, contexts = run_epoch( @@ -361,8 +359,7 @@ def train_with_graph(graph, qp_pairs, dev_qp_pairs): with open(os.path.join(save_path, 'epoch%d.score' % epoch), 'wb') as file: pickle.dump( (position1, position2, ids, contexts), file) - logger.debug('epoch %d acc %g bestacc %g' % - (epoch, acc, bestacc)) + logger.debug('epoch %d acc %g bestacc %g', epoch, acc, bestacc) if patience <= iter: break logger.debug('save done.') diff --git a/examples/trials/mnist-pbt-tuner-pytorch/mnist.py b/examples/trials/mnist-pbt-tuner-pytorch/mnist.py index 417e702b6b..6c632a864a 100644 --- a/examples/trials/mnist-pbt-tuner-pytorch/mnist.py +++ b/examples/trials/mnist-pbt-tuner-pytorch/mnist.py @@ -112,7 +112,7 @@ def main(args): if os.path.isfile(load_checkpoint_path): model_state_dict = load_checkpoint(load_checkpoint_path) - logger.info("test : " + load_checkpoint_path) + logger.info("test : ", load_checkpoint_path) logger.info(type(model_state_dict)) model.load_state_dict(model_state_dict) diff --git a/examples/trials/mnist-pytorch/config_adl.yml b/examples/trials/mnist-pytorch/config_adl.yml new file mode 100644 index 0000000000..77d2c04eac --- /dev/null +++ b/examples/trials/mnist-pytorch/config_adl.yml @@ -0,0 +1,21 @@ +authorName: default +experimentName: example_mnist_pytorch +trialConcurrency: 1 +maxExecDuration: 1h +maxTrialNum: 10 + +logCollection: http +trainingServicePlatform: adl + +searchSpacePath: search_space.json +useAnnotation: false +tuner: + builtinTunerName: TPE + classArgs: + optimize_mode: maximize + +trial: + image: {replace_to_your_image_tag} + command: python3 mnist.py + codeDir: . + gpuNum: 1 diff --git a/examples/trials/sklearn/classification/main.py b/examples/trials/sklearn/classification/main.py index 6839e830f6..fff86b41ac 100644 --- a/examples/trials/sklearn/classification/main.py +++ b/examples/trials/sklearn/classification/main.py @@ -63,7 +63,7 @@ def run(X_train, X_test, y_train, y_test, model): '''Train model and predict result''' model.fit(X_train, y_train) score = model.score(X_test, y_test) - LOG.debug('score: %s' % score) + LOG.debug('score: %s', score) nni.report_final_result(score) if __name__ == '__main__': diff --git a/examples/trials/sklearn/regression/main.py b/examples/trials/sklearn/regression/main.py index 9b9a2fb2f8..512111de90 100644 --- a/examples/trials/sklearn/regression/main.py +++ b/examples/trials/sklearn/regression/main.py @@ -74,7 +74,7 @@ def run(X_train, X_test, y_train, y_test, model): model.fit(X_train, y_train) predict_y = model.predict(X_test) score = r2_score(y_test, predict_y) - LOG.debug('r2 score: %s' % score) + LOG.debug('r2 score: %s', score) nni.report_final_result(score) if __name__ == '__main__': diff --git a/examples/trials/systems_auto_tuning/opevo/src/algorithms/opevo.py b/examples/trials/systems_auto_tuning/opevo/src/algorithms/opevo.py index 31346ab873..5eb439750c 100644 --- a/examples/trials/systems_auto_tuning/opevo/src/algorithms/opevo.py +++ b/examples/trials/systems_auto_tuning/opevo/src/algorithms/opevo.py @@ -387,8 +387,7 @@ def update_search_space(self, search_space): self.population = Population(search_space, self.mutate_rate, self.optimize_mode) - self.logger.debug('Total search space volume: ' - + str(self.population.volume)) + self.logger.debug('Total search space volume: ', str(self.population.volume)) if not self.serve_list: self.serve_list = self.population.get_offspring( diff --git a/examples/trials/weight_sharing/ga_squad/trial.py b/examples/trials/weight_sharing/ga_squad/trial.py index bafe1e707a..78aeca99b8 100644 --- a/examples/trials/weight_sharing/ga_squad/trial.py +++ b/examples/trials/weight_sharing/ga_squad/trial.py @@ -219,8 +219,7 @@ def run_epoch(batches, answer_net, is_training): loss, _, = sess.run( [answer_net.loss, answer_net.train_op], feed_dict=feed_dict) if count % 100 == 0: - logger.debug('%d %g except:%g, loss:%g' % - (count, used, used / count * len(batches), loss)) + logger.debug('%d %g except:%g, loss:%g', count, used, used / count * len(batches), loss) loss_sum += loss else: feed_dict = {answer_net.query_word: query, @@ -240,8 +239,7 @@ def run_epoch(batches, answer_net, is_training): contexts += context ids = np.concatenate((ids, sample_id)) if count % 100 == 0: - logger.debug('%d %g except:%g' % - (count, used, used / count * len(batches))) + logger.debug('%d %g except:%g', count, used, used / count * len(batches)) loss = loss_sum / len(batches) if is_training: return loss @@ -333,7 +331,7 @@ def train_with_graph(p_graph, qp_pairs, dev_qp_pairs): train_batches = data.get_batches(qp_pairs, cfg.batch_size) train_loss = run_epoch(train_batches, train_model, True) logger.debug('epoch ' + str(epoch) + - ' loss: ' + str(train_loss)) + ' loss: ', str(train_loss)) dev_batches = list(data.get_batches( dev_qp_pairs, cfg.batch_size)) _, position1, position2, ids, contexts = run_epoch( @@ -369,8 +367,7 @@ def train_with_graph(p_graph, qp_pairs, dev_qp_pairs): with open(os.path.join(save_path, 'epoch%d.score' % epoch), 'wb') as file: pickle.dump( (position1, position2, ids, contexts), file) - logger.debug('epoch %d acc %g bestacc %g' % - (epoch, acc, bestacc)) + logger.debug('epoch %d acc %g bestacc %g', epoch, acc, bestacc) if patience <= iter: break logger.debug('save done.') diff --git a/nni/algorithms/compression/pytorch/pruning/admm_pruner.py b/nni/algorithms/compression/pytorch/pruning/admm_pruner.py index f65f1405e1..30e73a23f8 100644 --- a/nni/algorithms/compression/pytorch/pruning/admm_pruner.py +++ b/nni/algorithms/compression/pytorch/pruning/admm_pruner.py @@ -4,6 +4,7 @@ import logging import torch from schema import And, Optional +import copy from nni.compression.pytorch.utils.config_validation import CompressorSchema from .constants import MASKER_DICT @@ -53,7 +54,7 @@ def trainer(model, criterion, optimizer, epoch, callback): row : float Penalty parameters for ADMM training. base_algo : str - Base pruning algorithm. `level`, `l1` or `l2`, by default `l1`. Given the sparsity distribution among the ops, + Base pruning algorithm. `level`, `l1`, `l2` or `fpgm`, by default `l1`. Given the sparsity distribution among the ops, the assigned `base_algo` is used to decide which filters/channels/weights to prune. """ @@ -87,7 +88,7 @@ def validate_config(self, model, config_list): Optional('op_types'): [str], Optional('op_names'): [str], }], model, _logger) - elif self._base_algo in ['l1', 'l2']: + elif self._base_algo in ['l1', 'l2', 'fpgm']: schema = CompressorSchema([{ 'sparsity': And(float, lambda n: 0 < n < 1), 'op_types': ['Conv2d'], @@ -96,7 +97,7 @@ def validate_config(self, model, config_list): schema.validate(config_list) - def _projection(self, weight, sparsity): + def _projection(self, weight, sparsity, wrapper): ''' Return the Euclidean projection of the weight matrix according to the pruning mode. @@ -106,31 +107,17 @@ def _projection(self, weight, sparsity): original matrix sparsity : float the ratio of parameters which need to be set to zero + wrapper: PrunerModuleWrapper + layer wrapper of this layer Returns ------- tensor the projected matrix ''' - w_abs = weight.abs() - if self._base_algo == 'level': - k = int(weight.numel() * sparsity) - if k == 0: - mask_weight = torch.ones(weight.shape).type_as(weight) - else: - threshold = torch.topk(w_abs.view(-1), k, largest=False)[0].max() - mask_weight = torch.gt(w_abs, threshold).type_as(weight) - elif self._base_algo in ['l1', 'l2']: - filters = weight.size(0) - num_prune = int(filters * sparsity) - if filters < 2 or num_prune < 1: - mask_weight = torch.ones(weight.size()).type_as(weight).detach() - else: - w_abs_structured = w_abs.view(filters, -1).sum(dim=1) - threshold = torch.topk(w_abs_structured.view(-1), num_prune, largest=False)[0].max() - mask_weight = torch.gt(w_abs_structured, threshold)[:, None, None, None].expand_as(weight).type_as(weight) - - return weight.data.mul(mask_weight) + wrapper_copy = copy.deepcopy(wrapper) + wrapper_copy.module.weight.data = weight + return weight.data.mul(self.masker.calc_mask(sparsity, wrapper_copy)['weight_mask']) def compress(self): """ @@ -179,7 +166,7 @@ def callback(): # U_i^{k+1} = U^k + W_i^{k+1} - Z_i^{k+1} for i, wrapper in enumerate(self.get_modules_wrapper()): z = wrapper.module.weight.data + U[i] - Z[i] = self._projection(z, wrapper.config['sparsity']) + Z[i] = self._projection(z, wrapper.config['sparsity'], wrapper) U[i] = U[i] + wrapper.module.weight.data - Z[i] # apply prune diff --git a/nni/algorithms/compression/pytorch/pruning/auto_compress_pruner.py b/nni/algorithms/compression/pytorch/pruning/auto_compress_pruner.py index d52c6ec42d..fdc27ac2f4 100644 --- a/nni/algorithms/compression/pytorch/pruning/auto_compress_pruner.py +++ b/nni/algorithms/compression/pytorch/pruning/auto_compress_pruner.py @@ -80,7 +80,7 @@ def evaluator(model): optimize_mode : str optimize mode, `maximize` or `minimize`, by default `maximize`. base_algo : str - Base pruning algorithm. `level`, `l1` or `l2`, by default `l1`. Given the sparsity distribution among the ops, + Base pruning algorithm. `level`, `l1`, `l2` or `fpgm`, by default `l1`. Given the sparsity distribution among the ops, the assigned `base_algo` is used to decide which filters/channels/weights to prune. start_temperature : float Start temperature of the simulated annealing process. @@ -151,7 +151,7 @@ def validate_config(self, model, config_list): Optional('op_types'): [str], Optional('op_names'): [str], }], model, _logger) - elif self._base_algo in ['l1', 'l2']: + elif self._base_algo in ['l1', 'l2', 'fpgm']: schema = CompressorSchema([{ 'sparsity': And(float, lambda n: 0 < n < 1), 'op_types': ['Conv2d'], diff --git a/nni/algorithms/compression/pytorch/pruning/constants_pruner.py b/nni/algorithms/compression/pytorch/pruning/constants_pruner.py index 5cc8005286..b0ad5cce37 100644 --- a/nni/algorithms/compression/pytorch/pruning/constants_pruner.py +++ b/nni/algorithms/compression/pytorch/pruning/constants_pruner.py @@ -2,10 +2,11 @@ # Licensed under the MIT license. -from .one_shot import LevelPruner, L1FilterPruner, L2FilterPruner +from .one_shot import LevelPruner, L1FilterPruner, L2FilterPruner, FPGMPruner PRUNER_DICT = { 'level': LevelPruner, 'l1': L1FilterPruner, - 'l2': L2FilterPruner + 'l2': L2FilterPruner, + 'fpgm': FPGMPruner } diff --git a/nni/algorithms/compression/pytorch/pruning/net_adapt_pruner.py b/nni/algorithms/compression/pytorch/pruning/net_adapt_pruner.py index 6f55234d5b..08416319ea 100644 --- a/nni/algorithms/compression/pytorch/pruning/net_adapt_pruner.py +++ b/nni/algorithms/compression/pytorch/pruning/net_adapt_pruner.py @@ -73,7 +73,7 @@ def evaluator(model): optimize_mode : str optimize mode, `maximize` or `minimize`, by default `maximize`. base_algo : str - Base pruning algorithm. `level`, `l1` or `l2`, by default `l1`. Given the sparsity distribution among the ops, + Base pruning algorithm. `level`, `l1`, `l2` or `fpgm`, by default `l1`. Given the sparsity distribution among the ops, the assigned `base_algo` is used to decide which filters/channels/weights to prune. sparsity_per_iteration : float sparsity to prune in each iteration. @@ -125,7 +125,7 @@ def validate_config(self, model, config_list): Optional('op_types'): [str], Optional('op_names'): [str], }], model, _logger) - elif self._base_algo in ['l1', 'l2']: + elif self._base_algo in ['l1', 'l2', 'fpgm']: schema = CompressorSchema([{ 'sparsity': And(float, lambda n: 0 < n < 1), 'op_types': ['Conv2d'], @@ -149,7 +149,7 @@ def _update_config_list(self, config_list, op_name, sparsity): return config_list_updated # if op_name is not in self._config_list_generated, create a new json item - if self._base_algo in ['l1', 'l2']: + if self._base_algo in ['l1', 'l2', 'fpgm']: config_list_updated.append( {'sparsity': sparsity, 'op_types': ['Conv2d'], 'op_names': [op_name]}) elif self._base_algo == 'level': diff --git a/nni/algorithms/compression/pytorch/pruning/sensitivity_pruner.py b/nni/algorithms/compression/pytorch/pruning/sensitivity_pruner.py index 037e7efc5e..02311fc51f 100644 --- a/nni/algorithms/compression/pytorch/pruning/sensitivity_pruner.py +++ b/nni/algorithms/compression/pytorch/pruning/sensitivity_pruner.py @@ -68,7 +68,7 @@ class SensitivityPruner(Pruner): >>> loss.backward() >>> optimizer.step() base_algo: str - base pruning algorithm. `level`, `l1` or `l2`, by default `l1`. + base pruning algorithm. `level`, `l1`, `l2` or `fpgm`, by default `l1`. sparsity_proportion_calc: function This function generate the sparsity proportion between the conv layers according to the sensitivity analysis results. We provide a default function to quantify the sparsity @@ -150,7 +150,7 @@ def validate_config(self, model, config_list): Optional('op_types'): [str], Optional('op_names'): [str], }], model, _logger) - elif self.base_algo in ['l1', 'l2']: + elif self.base_algo in ['l1', 'l2', 'fpgm']: schema = CompressorSchema([{ 'sparsity': And(float, lambda n: 0 < n < 1), 'op_types': ['Conv2d'], diff --git a/nni/algorithms/compression/pytorch/pruning/simulated_annealing_pruner.py b/nni/algorithms/compression/pytorch/pruning/simulated_annealing_pruner.py index 91cf160ff3..b371b2c6fb 100644 --- a/nni/algorithms/compression/pytorch/pruning/simulated_annealing_pruner.py +++ b/nni/algorithms/compression/pytorch/pruning/simulated_annealing_pruner.py @@ -54,7 +54,7 @@ def evaluator(model): optimize_mode : str Optimize mode, `maximize` or `minimize`, by default `maximize`. base_algo : str - Base pruning algorithm. `level`, `l1` or `l2`, by default `l1`. Given the sparsity distribution among the ops, + Base pruning algorithm. `level`, `l1`, `l2` or `fpgm`, by default `l1`. Given the sparsity distribution among the ops, the assigned `base_algo` is used to decide which filters/channels/weights to prune. start_temperature : float Start temperature of the simulated annealing process. @@ -120,7 +120,7 @@ def validate_config(self, model, config_list): Optional('op_types'): [str], Optional('op_names'): [str], }], model, _logger) - elif self._base_algo in ['l1', 'l2']: + elif self._base_algo in ['l1', 'l2', 'fpgm']: schema = CompressorSchema([{ 'sparsity': And(float, lambda n: 0 < n < 1), 'op_types': ['Conv2d'], @@ -152,7 +152,7 @@ def _sparsities_2_config_list(self, sparsities): # a layer with more weights will have no less pruning rate for idx, wrapper in enumerate(self.get_modules_wrapper()): # L1Filter Pruner requires to specify op_types - if self._base_algo in ['l1', 'l2']: + if self._base_algo in ['l1', 'l2', 'fpgm']: config_list.append( {'sparsity': sparsities[idx], 'op_types': ['Conv2d'], 'op_names': [wrapper.name]}) elif self._base_algo == 'level': diff --git a/nni/algorithms/nas/pytorch/cream/__init__.py b/nni/algorithms/nas/pytorch/cream/__init__.py new file mode 100755 index 0000000000..43a038b467 --- /dev/null +++ b/nni/algorithms/nas/pytorch/cream/__init__.py @@ -0,0 +1,4 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT license. + +from .trainer import CreamSupernetTrainer diff --git a/nni/algorithms/nas/pytorch/cream/trainer.py b/nni/algorithms/nas/pytorch/cream/trainer.py new file mode 100644 index 0000000000..0c5136d1b4 --- /dev/null +++ b/nni/algorithms/nas/pytorch/cream/trainer.py @@ -0,0 +1,406 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT license. + +import os +import torch +import logging + +from copy import deepcopy +from nni.nas.pytorch.trainer import Trainer +from nni.nas.pytorch.utils import AverageMeterGroup + +from .utils import accuracy, reduce_metrics + +logger = logging.getLogger(__name__) + + +class CreamSupernetTrainer(Trainer): + """ + This trainer trains a supernet and output prioritized architectures that can be used for other tasks. + + Parameters + ---------- + model : nn.Module + Model with mutables. + loss : callable + Called with logits and targets. Returns a loss tensor. + val_loss : callable + Called with logits and targets for validation only. Returns a loss tensor. + optimizer : Optimizer + Optimizer that optimizes the model. + num_epochs : int + Number of epochs of training. + train_loader : iterablez + Data loader of training. Raise ``StopIteration`` when one epoch is exhausted. + valid_loader : iterablez + Data loader of validation. Raise ``StopIteration`` when one epoch is exhausted. + mutator : Mutator + A mutator object that has been initialized with the model. + batch_size : int + Batch size. + log_frequency : int + Number of mini-batches to log metrics. + meta_sta_epoch : int + start epoch of using meta matching network to pick teacher architecture + update_iter : int + interval of updating meta matching networks + slices : int + batch size of mini training data in the process of training meta matching network + pool_size : int + board size + pick_method : basestring + how to pick teacher network + choice_num : int + number of operations in supernet + sta_num : int + layer number of each stage in supernet (5 stage in supernet) + acc_gap : int + maximum accuracy improvement to omit the limitation of flops + flops_dict : Dict + dictionary of each layer's operations in supernet + flops_fixed : int + flops of fixed part in supernet + local_rank : int + index of current rank + callbacks : list of Callback + Callbacks to plug into the trainer. See Callbacks. + """ + + def __init__(self, model, loss, val_loss, + optimizer, num_epochs, train_loader, valid_loader, + mutator=None, batch_size=64, log_frequency=None, + meta_sta_epoch=20, update_iter=200, slices=2, + pool_size=10, pick_method='meta', choice_num=6, + sta_num=(4, 4, 4, 4, 4), acc_gap=5, + flops_dict=None, flops_fixed=0, local_rank=0, callbacks=None): + assert torch.cuda.is_available() + super(CreamSupernetTrainer, self).__init__(model, mutator, loss, None, + optimizer, num_epochs, None, None, + batch_size, None, None, log_frequency, callbacks) + self.model = model + self.loss = loss + self.val_loss = val_loss + self.train_loader = train_loader + self.valid_loader = valid_loader + self.log_frequency = log_frequency + self.batch_size = batch_size + self.optimizer = optimizer + self.model = model + self.loss = loss + self.num_epochs = num_epochs + self.meta_sta_epoch = meta_sta_epoch + self.update_iter = update_iter + self.slices = slices + self.pick_method = pick_method + self.pool_size = pool_size + self.local_rank = local_rank + self.choice_num = choice_num + self.sta_num = sta_num + self.acc_gap = acc_gap + self.flops_dict = flops_dict + self.flops_fixed = flops_fixed + + self.current_student_arch = None + self.current_teacher_arch = None + self.main_proc = (local_rank == 0) + self.current_epoch = 0 + + self.prioritized_board = [] + + # size of prioritized board + def _board_size(self): + return len(self.prioritized_board) + + # select teacher architecture according to the logit difference + def _select_teacher(self): + self._replace_mutator_cand(self.current_student_arch) + + if self.pick_method == 'top1': + meta_value, teacher_cand = 0.5, sorted( + self.prioritized_board, reverse=True)[0][3] + elif self.pick_method == 'meta': + meta_value, cand_idx, teacher_cand = -1000000000, -1, None + for now_idx, item in enumerate(self.prioritized_board): + inputx = item[4] + output = torch.nn.functional.softmax(self.model(inputx), dim=1) + weight = self.model.module.forward_meta(output - item[5]) + if weight > meta_value: + meta_value = weight + cand_idx = now_idx + teacher_cand = self.prioritized_board[cand_idx][3] + assert teacher_cand is not None + meta_value = torch.nn.functional.sigmoid(-weight) + else: + raise ValueError('Method Not supported') + + return meta_value, teacher_cand + + # check whether to update prioritized board + def _isUpdateBoard(self, prec1, flops): + if self.current_epoch <= self.meta_sta_epoch: + return False + + if len(self.prioritized_board) < self.pool_size: + return True + + if prec1 > self.prioritized_board[-1][1] + self.acc_gap: + return True + + if prec1 > self.prioritized_board[-1][1] and flops < self.prioritized_board[-1][2]: + return True + + return False + + # update prioritized board + def _update_prioritized_board(self, inputs, teacher_output, outputs, prec1, flops): + if self._isUpdateBoard(prec1, flops): + val_prec1 = prec1 + training_data = deepcopy(inputs[:self.slices].detach()) + if len(self.prioritized_board) == 0: + features = deepcopy(outputs[:self.slices].detach()) + else: + features = deepcopy( + teacher_output[:self.slices].detach()) + self.prioritized_board.append( + (val_prec1, + prec1, + flops, + self.current_teacher_arch, + training_data, + torch.nn.functional.softmax( + features, + dim=1))) + self.prioritized_board = sorted( + self.prioritized_board, reverse=True) + + if len(self.prioritized_board) > self.pool_size: + self.prioritized_board = sorted( + self.prioritized_board, reverse=True) + del self.prioritized_board[-1] + + # only update student network weights + def _update_student_weights_only(self, grad_1): + for weight, grad_item in zip( + self.model.module.rand_parameters(self.current_student_arch), grad_1): + weight.grad = grad_item + torch.nn.utils.clip_grad_norm_( + self.model.module.rand_parameters(self.current_student_arch), 1) + self.optimizer.step() + for weight, grad_item in zip( + self.model.module.rand_parameters(self.current_student_arch), grad_1): + del weight.grad + + # only update meta networks weights + def _update_meta_weights_only(self, teacher_cand, grad_teacher): + for weight, grad_item in zip(self.model.module.rand_parameters( + teacher_cand, self.pick_method == 'meta'), grad_teacher): + weight.grad = grad_item + + # clip gradients + torch.nn.utils.clip_grad_norm_( + self.model.module.rand_parameters( + self.current_student_arch, self.pick_method == 'meta'), 1) + + self.optimizer.step() + for weight, grad_item in zip(self.model.module.rand_parameters( + teacher_cand, self.pick_method == 'meta'), grad_teacher): + del weight.grad + + # simulate sgd updating + def _simulate_sgd_update(self, w, g, optimizer): + return g * optimizer.param_groups[-1]['lr'] + w + + # split training images into several slices + def _get_minibatch_input(self, input): + slice = self.slices + x = deepcopy(input[:slice].clone().detach()) + return x + + # calculate 1st gradient of student architectures + def _calculate_1st_gradient(self, kd_loss): + self.optimizer.zero_grad() + grad = torch.autograd.grad( + kd_loss, + self.model.module.rand_parameters(self.current_student_arch), + create_graph=True) + return grad + + # calculate 2nd gradient of meta networks + def _calculate_2nd_gradient(self, validation_loss, teacher_cand, students_weight): + self.optimizer.zero_grad() + grad_student_val = torch.autograd.grad( + validation_loss, + self.model.module.rand_parameters(self.random_cand), + retain_graph=True) + + grad_teacher = torch.autograd.grad( + students_weight[0], + self.model.module.rand_parameters( + teacher_cand, + self.pick_method == 'meta'), + grad_outputs=grad_student_val) + return grad_teacher + + # forward training data + def _forward_training(self, x, meta_value): + self._replace_mutator_cand(self.current_student_arch) + output = self.model(x) + + with torch.no_grad(): + self._replace_mutator_cand(self.current_teacher_arch) + teacher_output = self.model(x) + soft_label = torch.nn.functional.softmax(teacher_output, dim=1) + + kd_loss = meta_value * \ + self._cross_entropy_loss_with_soft_target(output, soft_label) + return kd_loss + + # calculate soft target loss + def _cross_entropy_loss_with_soft_target(self, pred, soft_target): + logsoftmax = torch.nn.LogSoftmax() + return torch.mean(torch.sum(- soft_target * logsoftmax(pred), 1)) + + # forward validation data + def _forward_validation(self, input, target): + slice = self.slices + x = input[slice:slice * 2].clone() + + self._replace_mutator_cand(self.current_student_arch) + output_2 = self.model(x) + + validation_loss = self.loss(output_2, target[slice:slice * 2]) + return validation_loss + + def _isUpdateMeta(self, batch_idx): + isUpdate = True + isUpdate &= (self.current_epoch > self.meta_sta_epoch) + isUpdate &= (batch_idx > 0) + isUpdate &= (batch_idx % self.update_iter == 0) + isUpdate &= (self._board_size() > 0) + return isUpdate + + def _replace_mutator_cand(self, cand): + self.mutator._cache = cand + + # update meta matching networks + def _run_update(self, input, target, batch_idx): + if self._isUpdateMeta(batch_idx): + x = self._get_minibatch_input(input) + + meta_value, teacher_cand = self._select_teacher() + + kd_loss = self._forward_training(x, meta_value) + + # calculate 1st gradient + grad_1st = self._calculate_1st_gradient(kd_loss) + + # simulate updated student weights + students_weight = [ + self._simulate_sgd_update( + p, grad_item, self.optimizer) for p, grad_item in zip( + self.model.module.rand_parameters(self.current_student_arch), grad_1st)] + + # update student weights + self._update_student_weights_only(grad_1st) + + validation_loss = self._forward_validation(input, target) + + # calculate 2nd gradient + grad_teacher = self._calculate_2nd_gradient(validation_loss, teacher_cand, students_weight) + + # update meta matching networks + self._update_meta_weights_only(teacher_cand, grad_teacher) + + # delete internal variants + del grad_teacher, grad_1st, x, validation_loss, kd_loss, students_weight + + def _get_cand_flops(self, cand): + flops = 0 + for block_id, block in enumerate(cand): + if block == 'LayerChoice1' or block_id == 'LayerChoice23': + continue + for idx, choice in enumerate(cand[block]): + flops += self.flops_dict[block_id][idx] * (1 if choice else 0) + return flops + self.flops_fixed + + def train_one_epoch(self, epoch): + self.current_epoch = epoch + meters = AverageMeterGroup() + self.steps_per_epoch = len(self.train_loader) + for step, (input_data, target) in enumerate(self.train_loader): + self.mutator.reset() + self.current_student_arch = self.mutator._cache + + input_data, target = input_data.cuda(), target.cuda() + + # calculate flops of current architecture + cand_flops = self._get_cand_flops(self.mutator._cache) + + # update meta matching network + self._run_update(input_data, target, step) + + if self._board_size() > 0: + # select teacher architecture + meta_value, teacher_cand = self._select_teacher() + self.current_teacher_arch = teacher_cand + + # forward supernet + if self._board_size() == 0 or epoch <= self.meta_sta_epoch: + self._replace_mutator_cand(self.current_student_arch) + output = self.model(input_data) + + loss = self.loss(output, target) + kd_loss, teacher_output, teacher_cand = None, None, None + else: + self._replace_mutator_cand(self.current_student_arch) + output = self.model(input_data) + + gt_loss = self.loss(output, target) + + with torch.no_grad(): + self._replace_mutator_cand(self.current_teacher_arch) + teacher_output = self.model(input_data).detach() + + soft_label = torch.nn.functional.softmax(teacher_output, dim=1) + kd_loss = self._cross_entropy_loss_with_soft_target(output, soft_label) + + loss = (meta_value * kd_loss + (2 - meta_value) * gt_loss) / 2 + + # update network + self.optimizer.zero_grad() + loss.backward() + self.optimizer.step() + + # update metrics + prec1, prec5 = accuracy(output, target, topk=(1, 5)) + metrics = {"prec1": prec1, "prec5": prec5, "loss": loss} + metrics = reduce_metrics(metrics) + meters.update(metrics) + + # update prioritized board + self._update_prioritized_board(input_data, teacher_output, output, metrics['prec1'], cand_flops) + + if self.main_proc and (step % self.log_frequency == 0 or step + 1 == self.steps_per_epoch): + logger.info("Epoch [%d/%d] Step [%d/%d] %s", epoch + 1, self.num_epochs, + step + 1, len(self.train_loader), meters) + + if self.main_proc and self.num_epochs == epoch + 1: + for idx, i in enumerate(self.best_children_pool): + logger.info("No.%s %s", idx, i[:4]) + + def validate_one_epoch(self, epoch): + self.model.eval() + meters = AverageMeterGroup() + with torch.no_grad(): + for step, (x, y) in enumerate(self.valid_loader): + self.mutator.reset() + logits = self.model(x) + loss = self.val_loss(logits, y) + prec1, prec5 = self.accuracy(logits, y, topk=(1, 5)) + metrics = {"prec1": prec1, "prec5": prec5, "loss": loss} + metrics = self.reduce_metrics(metrics, self.distributed) + meters.update(metrics) + + if self.log_frequency is not None and step % self.log_frequency == 0: + logger.info("Epoch [%s/%s] Validation Step [%s/%s] %s", epoch + 1, + self.num_epochs, step + 1, len(self.valid_loader), meters) diff --git a/nni/algorithms/nas/pytorch/cream/utils.py b/nni/algorithms/nas/pytorch/cream/utils.py new file mode 100644 index 0000000000..e0542b2f3e --- /dev/null +++ b/nni/algorithms/nas/pytorch/cream/utils.py @@ -0,0 +1,37 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT license. + + +import os +import torch.distributed as dist + + +def accuracy(output, target, topk=(1,)): + """ Computes the precision@k for the specified values of k """ + maxk = max(topk) + batch_size = target.size(0) + + _, pred = output.topk(maxk, 1, True, True) + pred = pred.t() + # one-hot case + if target.ndimension() > 1: + target = target.max(1)[1] + + correct = pred.eq(target.view(1, -1).expand_as(pred)) + + res = [] + for k in topk: + correct_k = correct[:k].view(-1).float().sum(0) + res.append(correct_k.mul_(1.0 / batch_size)) + return res + + +def reduce_metrics(metrics): + return {k: reduce_tensor(v).item() for k, v in metrics.items()} + + +def reduce_tensor(tensor): + rt = tensor.clone() + dist.all_reduce(rt, op=dist.ReduceOp.SUM) + rt /= float(os.environ["WORLD_SIZE"]) + return rt diff --git a/nni/common/graph_utils.py b/nni/common/graph_utils.py index 0f18a91c8a..9757f1c533 100644 --- a/nni/common/graph_utils.py +++ b/nni/common/graph_utils.py @@ -15,6 +15,7 @@ LIST_UNPACK_KIND = 'prim::ListUnpack' TUPLE_CONSTRUCT_KIND = 'prim::TupleConstruct' TUPLE_UNPACK_KIND = 'prim::TupleUnpack' +CONSTANT_KIND = 'prim::Constant' _logger = logging.getLogger(__name__) @@ -68,9 +69,11 @@ def __init__(self, model=None, dummy_input=None, traced_model=None): 'Please provide model & dummy_input or the traced_model as inputs') def _trace(self, model, dummy_input): - with torch.onnx.set_training(model, False): - self.trace = torch.jit.trace(model, dummy_input) - torch._C._jit_pass_inline(self.trace.graph) + training = model.training + model.eval() + self.trace = torch.jit.trace(model, dummy_input) + torch._C._jit_pass_inline(self.trace.graph) + model.train(training) class TorchProtoGraph(TorchGraph): @@ -282,27 +285,35 @@ def _expand_key_func_node(self, node, nodes, input_to_node, output_to_node, self.global_count += 1 op_type = node.kind() node_group = [node] - inputs = list() - outputs = list() + inputs = set() + outputs = set() node_queue = queue.Queue() node_queue.put(node) while not node_queue.empty(): curr_node = node_queue.get() for _input in curr_node.inputs(): + if _input.node().kind() == CONSTANT_KIND: + continue input_name = _input.debugName() - if input_name in output_to_node and output_to_node[input_name] in nodes: - predecessor_node = output_to_node[input_name] - if not self._is_key_func(predecessor_node): - node_group.append(predecessor_node) - node_queue.put(predecessor_node) - else: - inputs.append(input_name) + if input_name in output_to_node: + for predecessor_node in output_to_node[input_name]: + if predecessor_node in nodes: + if not self._is_key_func(predecessor_node): + if predecessor_node not in node_group: + node_group.append(predecessor_node) + node_queue.put(predecessor_node) + else: + inputs.add(input_name) + else: + inputs.add(input_name) else: - inputs.append(input_name) + inputs.add(input_name) for output in node.outputs(): - outputs.append(output.debugName()) + if output.node().kind() == CONSTANT_KIND: + continue + outputs.add(output.debugName()) nodepy = NodePyGroup(node_name, unique_name, module_type, op_type, - node_group, inputs=inputs, outputs=outputs, key_node=node) + node_group, inputs=list(inputs), outputs=list(outputs), key_node=node) return nodepy def _expand_module_node(self, node, node_name, unique_name, op_type, nodes, @@ -342,36 +353,46 @@ def _expand_module_node(self, node, node_name, unique_name, op_type, nodes, if not op_type: op_type = node.kind() node_group = [node] - inputs = list() - outputs = list() + inputs = set() + outputs = set() node_queue = queue.Queue() node_queue.put(node) visited = {node} while not node_queue.empty(): curr_node = node_queue.get() for _input in curr_node.inputs(): + if _input.node().kind() == CONSTANT_KIND: + continue input_name = _input.debugName() - if input_name in output_to_node and output_to_node[input_name] in nodes: - predecessor_node = output_to_node[input_name] - if predecessor_node not in visited: - node_group.append(predecessor_node) - node_queue.put(predecessor_node) - visited.add(predecessor_node) + if input_name in output_to_node: + for predecessor_node in output_to_node[input_name]: + if predecessor_node in nodes: + if predecessor_node not in visited: + node_group.append(predecessor_node) + node_queue.put(predecessor_node) + visited.add(predecessor_node) + else: + inputs.add(input_name) else: - inputs.append(input_name) + inputs.add(input_name) for _output in curr_node.outputs(): + if _output.node().kind() == CONSTANT_KIND: + continue output_name = _output.debugName() - if output_name in input_to_node and input_to_node[output_name] in nodes: - successor_node = input_to_node[output_name] - if successor_node not in visited: - node_group.append(successor_node) - node_queue.put(successor_node) - visited.add(successor_node) + if output_name in input_to_node: + for successor_node in input_to_node[output_name]: + if successor_node in nodes: + if successor_node not in visited: + node_group.append(successor_node) + node_queue.put(successor_node) + visited.add(successor_node) + else: + outputs.add(output_name) else: - outputs.append(output_name) + outputs.add(output_name) nodepy = NodePyGroup(node_name, unique_name, module_type, op_type, - node_group, inputs=inputs, outputs=outputs) + node_group, inputs=list(inputs), outputs=list(outputs)) return nodepy def _extract_cat_info(self, node_group, cpp_node): @@ -544,7 +565,7 @@ def _build_index(self, nodes_op): input_to_node[_input].append(node) for output in node.outputs: assert not output in output_to_node, \ - "One output cannot be generated by multiple nodes" + "One output cannot be generated by multiple nodes %s" % output output_to_node[output] = node return name_to_node, input_to_node, output_to_node @@ -642,12 +663,22 @@ def _build_graph(self): omit_useless_nodes = True graph = self.trace.graph _logger.debug(graph) - # build output mapping, from output debugName to its node - output_to_node = {x.debugName(): n for n in graph.nodes() - for x in n.outputs()} - # build input mapping, from input debugName to its node - input_to_node = {x.debugName(): n for n in graph.nodes() - for x in n.inputs()} + # build input/output mapping, from input/output debugName to its node + input_to_node = defaultdict(list) + output_to_node = defaultdict(list) + for node in graph.nodes(): + if node.kind() == CONSTANT_KIND: + continue + for x in node.outputs(): + if x.node().kind() == CONSTANT_KIND: + continue + output_to_node[x.debugName()].append(node) + assert len(output_to_node[x.debugName()]) <= 1, "One output cannot be generated by multiple nodes %s" % x.debugName() + for x in node.inputs(): + if x.node().kind() == CONSTANT_KIND: + continue + input_to_node[x.debugName()].append(node) + # build module mapping, from module name to all nodes (as list) under this module scope module_to_nodes = defaultdict(list) # the mapping of function (non-module in forward) to nodes, key is scope name @@ -668,6 +699,8 @@ def _build_graph(self): # associate module name with their trace graph nodes for node in graph.nodes(): + if node.kind() == CONSTANT_KIND: + continue module_name = self._get_module_name(node.scopeName()) if module_name in self.leaf_modules: module_to_nodes[module_name].append(node) diff --git a/nni/compression/pytorch/utils/counter.py b/nni/compression/pytorch/utils/counter.py index 6061e8a8d0..ff7c1aa006 100644 --- a/nni/compression/pytorch/utils/counter.py +++ b/nni/compression/pytorch/utils/counter.py @@ -1,135 +1,296 @@ # Copyright (c) Microsoft Corporation. # Licensed under the MIT license. +import functools +from collections import Counter +from prettytable import PrettyTable + import torch import torch.nn as nn from nni.compression.pytorch.compressor import PrunerModuleWrapper -try: - from thop import profile -except Exception as e: - print('thop is not found, please install the python package: thop') - raise +__all__ = ['count_flops_params'] -def count_flops_params(model: nn.Module, input_size, custom_ops=None, verbose=True): - """ - Count FLOPs and Params of the given model. - This function would identify the mask on the module - and take the pruned shape into consideration. - Note that, for sturctured pruning, we only identify - the remained filters according to its mask, which - not taking the pruned input channels into consideration, - so the calculated FLOPs will be larger than real number. - Parameters - --------- - model : nn.Module - target model. - input_size: list, tuple - the input shape of data - custom_ops: dict - a mapping of (module: custom operation) - the custom operation will overwrite the default operation. - for reference, please see ``custom_mask_ops``. +def _get_params(m): + return sum([p.numel() for p in m.parameters()]) - Returns - ------- - flops: float - total flops of the model - params: - total params of the model - """ - assert input_size is not None +class ModelProfiler: - device = next(model.parameters()).device - inputs = torch.randn(input_size).to(device) + def __init__(self, custom_ops=None, mode='default'): + """ + ModelProfiler is used to share state to hooks. - hook_module_list = [] - if custom_ops is None: - custom_ops = {} - custom_mask_ops.update(custom_ops) - prev_m = None - for m in model.modules(): - weight_mask = None - m_type = type(m) - if m_type in custom_mask_ops: - if isinstance(prev_m, PrunerModuleWrapper): - weight_mask = prev_m.weight_mask - - m.register_buffer('weight_mask', weight_mask) - hook_module_list.append(m) - prev_m = m + Parameters + ---------- + custom_ops: dict + a mapping of (module -> torch.nn.Module : custom operation) + the custom operation is a callback funtion to calculate + the module flops, parameters and the weight shape, it will overwrite the default operation. + for reference, please see ``self.ops``. + mode: + the mode of how to collect information. If the mode is set to `default`, + only the information of convolution and linear will be collected. + If the mode is set to `full`, other operations will also be collected. + """ + self.ops = { + nn.Conv1d: self._count_convNd, + nn.Conv2d: self._count_convNd, + nn.Conv3d: self._count_convNd, + nn.Linear: self._count_linear + } + self._count_bias = False + if mode == 'full': + self.ops.update({ + nn.ConvTranspose1d: self._count_convNd, + nn.ConvTranspose2d: self._count_convNd, + nn.ConvTranspose3d: self._count_convNd, + nn.BatchNorm1d: self._count_bn, + nn.BatchNorm2d: self._count_bn, + nn.BatchNorm3d: self._count_bn, + nn.LeakyReLU: self._count_relu, + nn.AvgPool1d: self._count_avgpool, + nn.AvgPool2d: self._count_avgpool, + nn.AvgPool3d: self._count_avgpool, + nn.AdaptiveAvgPool1d: self._count_adap_avgpool, + nn.AdaptiveAvgPool2d: self._count_adap_avgpool, + nn.AdaptiveAvgPool3d: self._count_adap_avgpool, + nn.Upsample: self._count_upsample, + nn.UpsamplingBilinear2d: self._count_upsample, + nn.UpsamplingNearest2d: self._count_upsample + }) + self._count_bias = True - flops, params = profile(model, inputs=(inputs, ), custom_ops=custom_mask_ops, verbose=verbose) + if custom_ops is not None: + self.ops.update(custom_ops) + self.mode = mode + self.results = [] - for m in hook_module_list: - m._buffers.pop("weight_mask") - # Remove registerd buffer on the model, and fixed following issue: - # https://github.com/Lyken17/pytorch-OpCounter/issues/96 - for m in model.modules(): - if 'total_ops' in m._buffers: - m._buffers.pop("total_ops") - if 'total_params' in m._buffers: - m._buffers.pop("total_params") + def _push_result(self, result): + self.results.append(result) - return flops, params + def _get_result(self, m, flops): + # assume weight is called `weight`, otherwise it's not applicable + # if user customize the operation, the callback function should + # return the dict result, inluding calculated flops, params and weight_shape. -def count_convNd_mask(m, x, y): - """ - The forward hook to count FLOPs and Parameters of convolution operation. - Parameters - ---------- - m : torch.nn.Module - convolution module to calculate the FLOPs and Parameters - x : torch.Tensor - input data - y : torch.Tensor - output data - """ - output_channel = y.size()[1] - output_size = torch.zeros(y.size()[2:]).numel() - kernel_size = torch.zeros(m.weight.size()[2:]).numel() + result = { + 'flops': flops, + 'params': _get_params(m), + 'weight_shape': tuple(m.weight.size()) if hasattr(m, 'weight') else 0, + } + return result + + def _count_convNd(self, m, x, y): + cin = m.in_channels + kernel_ops = m.weight.size()[2] * m.weight.size()[3] + output_size = torch.zeros(y.size()[2:]).numel() + cout = y.size()[1] + + if hasattr(m, 'weight_mask'): + cout = m.weight_mask.sum() // (cin * kernel_ops) + + total_ops = cout * output_size * kernel_ops * cin // m.groups # cout x oW x oH + + if self._count_bias: + bias_flops = 1 if m.bias is not None else 0 + total_ops += cout * output_size * bias_flops + + return self._get_result(m, total_ops) + + def _count_linear(self, m, x, y): + out_features = m.out_features + if hasattr(m, 'weight_mask'): + out_features = m.weight_mask.sum() // m.in_features + total_ops = out_features * m.in_features + + if self._count_bias: + bias_flops = 1 if m.bias is not None else 0 + total_ops += out_features * bias_flops + + return self._get_result(m, total_ops) + + def _count_bn(self, m, x, y): + total_ops = 2 * x[0].numel() + return self._get_result(m, total_ops) + + def _count_relu(self, m, x, y): + total_ops = x[0].numel() + return self._get_result(m, total_ops) - bias_flops = 1 if m.bias is not None else 0 + def _count_avgpool(self, m, x, y): + total_ops = y.numel() + return self._get_result(m, total_ops) - if m.weight_mask is not None: - output_channel = m.weight_mask.sum() // (m.in_channels * kernel_size) + def _count_adap_avgpool(self, m, x, y): + kernel = torch.Tensor([*(x[0].shape[2:])]) // torch.Tensor(list((m.output_size,))).squeeze() + total_add = int(torch.prod(kernel)) + total_div = 1 + kernel_ops = total_add + total_div + num_elements = y.numel() + total_ops = kernel_ops * num_elements - total_ops = output_channel * output_size * (m.in_channels // m.groups * kernel_size + bias_flops) + return self._get_result(m, total_ops) - m.total_ops += torch.DoubleTensor([int(total_ops)]) + def _count_upsample(self, m, x, y): + if m.mode == 'linear': + total_ops = y.nelement() * 5 # 2 muls + 3 add + elif m.mode == 'bilinear': + # https://en.wikipedia.org/wiki/Bilinear_interpolation + total_ops = y.nelement() * 11 # 6 muls + 5 adds + elif m.mode == 'bicubic': + # https://en.wikipedia.org/wiki/Bicubic_interpolation + # Product matrix [4x4] x [4x4] x [4x4] + ops_solve_A = 224 # 128 muls + 96 adds + ops_solve_p = 35 # 16 muls + 12 adds + 4 muls + 3 adds + total_ops = y.nelement() * (ops_solve_A + ops_solve_p) + elif m.mode == 'trilinear': + # https://en.wikipedia.org/wiki/Trilinear_interpolation + # can viewed as 2 bilinear + 1 linear + total_ops = y.nelement() * (13 * 2 + 5) + else: + total_ops = 0 + return self._get_result(m, total_ops) -def count_linear_mask(m, x, y): + def count_module(self, m, x, y, name): + # assume x is tuple of single tensor + result = self.ops[type(m)](m, x, y) + total_result = { + 'name': name, + 'input_size': tuple(x[0].size()), + 'output_size': tuple(y.size()), + 'module_type': type(m).__name__, + **result + } + + self._push_result(total_result) + + def sum_flops(self): + return sum([s['flops'] for s in self.results]) + + def sum_params(self): + return sum({s['name']: s['params'] for s in self.results}.values()) + + def format_results(self): + table = PrettyTable() + name_counter = Counter([s['name'] for s in self.results]) + has_multi_use = any(map(lambda v: v > 1, name_counter.values())) + name_counter = Counter() # clear the counter to count from 0 + + headers = [ + 'Index', + 'Name', + 'Type', + 'Weight Shape', + 'FLOPs', + '#Params', + ] + if has_multi_use: + headers.append('#Call') + + table.field_names = headers + for i, result in enumerate(self.results): + row_values = [ + i, + result['name'], + result['module_type'], + str(result['weight_shape']), + result['flops'], + result['params'], + ] + name_counter[result['name']] += 1 + if has_multi_use: + row_values.append(name_counter[result['name']]) + table.add_row(row_values) + return table + + +def count_flops_params(model, x, custom_ops=None, verbose=True, mode='default'): """ - The forward hook to count FLOPs and Parameters of linear transformation. + Count FLOPs and Params of the given model. This function would + identify the mask on the module and take the pruned shape into consideration. + Note that, for sturctured pruning, we only identify the remained filters + according to its mask, and do not take the pruned input channels into consideration, + so the calculated FLOPs will be larger than real number. + Parameters - ---------- - m : torch.nn.Module - linear to calculate the FLOPs and Parameters - x : torch.Tensor - input data - y : torch.Tensor - output data + --------- + model : nn.Module + Target model. + x : tuple or tensor + The input shape of data (a tuple), a tensor or a tuple of tensor as input data. + custom_ops : dict + A mapping of (module -> torch.nn.Module : custom operation) + the custom operation is a callback funtion to calculate + the module flops and parameters, it will overwrite the default operation. + for reference, please see ``ops`` in ``ModelProfiler``. + verbose : bool + If False, mute detail information about modules. Default is True. + mode : str + the mode of how to collect information. If the mode is set to ``default``, + only the information of convolution and linear will be collected. + If the mode is set to ``full``, other operations will also be collected. + + Returns + ------- + tuple of int, int and dict + Representing total FLOPs, total parameters, and a detailed list of results respectively. + The list of results are a list of dict, each of which contains (name, module_type, weight_shape, + flops, params, input_size, output_size) as its keys. """ - output_channel = y.numel() - bias_flops = 1 if m.bias is not None else 0 + assert isinstance(x, tuple) or isinstance(x, torch.Tensor) + assert mode in ['default', 'full'] + + original_device = next(model.parameters()).device + training = model.training + + if isinstance(x, tuple) and all(isinstance(t, int) for t in x): + x = (torch.zeros(x).to(original_device), ) + elif torch.is_tensor(x): + x = (x.to(original_device), ) + else: + x = (t.to(original_device) for t in x) + + handler_collection = [] + profiler = ModelProfiler(custom_ops, mode) + + prev_m = None + for name, m in model.named_modules(): + # dealing with weight mask here + if isinstance(prev_m, PrunerModuleWrapper): + # weight mask is set to weight mask of its parent (wrapper) + weight_mask = prev_m.weight_mask + m.weight_mask = weight_mask + prev_m = m + + if type(m) in profiler.ops: + # if a leaf node + _handler = m.register_forward_hook(functools.partial(profiler.count_module, name=name)) + handler_collection.append(_handler) + + model.eval() - if m.weight_mask is not None: - output_channel = m.weight_mask.sum() // m.in_features + with torch.no_grad(): + model(*x) - total_ops = output_channel * (m.in_features + bias_flops) + # restore origin status + for name, m in model.named_modules(): + if hasattr(m, 'weight_mask'): + delattr(m, 'weight_mask') - m.total_ops += torch.DoubleTensor([int(total_ops)]) + model.train(training).to(original_device) + for handler in handler_collection: + handler.remove() + if verbose: + # get detail information + print(profiler.format_results()) + print(f'FLOPs total: {profiler.sum_flops()}') + print(f'#Params total: {profiler.sum_params()}') -custom_mask_ops = { - nn.Conv1d: count_convNd_mask, - nn.Conv2d: count_convNd_mask, - nn.Conv3d: count_convNd_mask, - nn.Linear: count_linear_mask, -} + return profiler.sum_flops(), profiler.sum_params(), profiler.results \ No newline at end of file diff --git a/nni/compression/pytorch/utils/mask_conflict.py b/nni/compression/pytorch/utils/mask_conflict.py index 723f66b8f2..1f8d099d26 100644 --- a/nni/compression/pytorch/utils/mask_conflict.py +++ b/nni/compression/pytorch/utils/mask_conflict.py @@ -36,9 +36,11 @@ def fix_mask_conflict(masks, model=None, dummy_input=None, traced=None): # this traced model. if traced is None: assert model is not None and dummy_input is not None - with torch.onnx.set_training(model, False): - # We need to trace the model in this way, else it will have problems - traced = torch.jit.trace(model, dummy_input) + training = model.training + model.eval() + # We need to trace the model in eval mode + traced = torch.jit.trace(model, dummy_input) + model.train(training) fix_group_mask = GroupMaskConflict(masks, model, dummy_input, traced) masks = fix_group_mask.fix_mask() diff --git a/nni/runtime/msg_dispatcher.py b/nni/runtime/msg_dispatcher.py index 4f24294fc7..20b9597f9e 100644 --- a/nni/runtime/msg_dispatcher.py +++ b/nni/runtime/msg_dispatcher.py @@ -39,7 +39,7 @@ def _sort_history(history): # Tuner global variables _next_parameter_id = 0 _trial_params = {} -'''key: trial job ID; value: parameters''' +'''key: parameter ID; value: parameters''' _customized_parameter_ids = set() @@ -114,7 +114,7 @@ def handle_import_data(self, data): data: a list of dictionaries, each of which has at least two keys, 'parameter' and 'value' """ for entry in data: - entry['value'] = entry['value'] if type(entry['value']) is str else json_tricks.dumps(entry['value']) + entry['value'] = entry['value'] if type(entry['value']) is str else json_tricks.dumps(entry['value']) entry['value'] = json_tricks.loads(entry['value']) self.tuner.import_data(data) @@ -182,8 +182,11 @@ def _handle_final_metric_data(self, data): customized = True else: customized = False + if id_ in _trial_params: self.tuner.receive_trial_result(id_, _trial_params[id_], value, customized=customized, trial_job_id=data.get('trial_job_id')) + else: + _logger.warning('Find unknown job parameter id %s, maybe something goes wrong.', _trial_params[id_]) def _handle_intermediate_metric_data(self, data): """Call assessor to process intermediate results diff --git a/nni/runtime/platform/__init__.py b/nni/runtime/platform/__init__.py index 84f04a9862..2c699e1598 100644 --- a/nni/runtime/platform/__init__.py +++ b/nni/runtime/platform/__init__.py @@ -9,7 +9,7 @@ from .standalone import * elif trial_env_vars.NNI_PLATFORM == 'unittest': from .test import * -elif trial_env_vars.NNI_PLATFORM in ('local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'): +elif trial_env_vars.NNI_PLATFORM in ('adl', 'local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'): from .local import * else: raise RuntimeError('Unknown platform %s' % trial_env_vars.NNI_PLATFORM) diff --git a/nni/tools/nnictl/command_utils.py b/nni/tools/nnictl/command_utils.py index 52650b3bae..2bbcc883d1 100644 --- a/nni/tools/nnictl/command_utils.py +++ b/nni/tools/nnictl/command_utils.py @@ -34,7 +34,8 @@ def check_output_command(file_path, head=None, tail=None): def kill_command(pid): """kill command""" if sys.platform == 'win32': - psutil.Process(pid).terminate() + process = psutil.Process(pid=pid) + process.send_signal(signal.CTRL_BREAK_EVENT) else: cmds = ['kill', str(pid)] call(cmds) diff --git a/nni/tools/nnictl/config_schema.py b/nni/tools/nnictl/config_schema.py index d320163595..702d290eac 100644 --- a/nni/tools/nnictl/config_schema.py +++ b/nni/tools/nnictl/config_schema.py @@ -124,7 +124,7 @@ def validate(self, data): Optional('maxExecDuration'): And(Regex(r'^[1-9][0-9]*[s|m|h|d]$', error='ERROR: maxExecDuration format is [digit]{s,m,h,d}')), Optional('maxTrialNum'): setNumberRange('maxTrialNum', int, 1, 99999), 'trainingServicePlatform': setChoice( - 'trainingServicePlatform', 'remote', 'local', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'), + 'trainingServicePlatform', 'adl', 'remote', 'local', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'), Optional('searchSpacePath'): And(os.path.exists, error=SCHEMA_PATH_ERROR % 'searchSpacePath'), Optional('multiPhase'): setType('multiPhase', bool), Optional('multiThread'): setType('multiThread', bool), @@ -262,6 +262,30 @@ def validate(self, data): } } +adl_trial_schema = { + 'trial':{ + 'codeDir': setType('codeDir', str), + 'command': setType('command', str), + 'gpuNum': setNumberRange('gpuNum', int, 0, 99999), + 'image': setType('image', str), + Optional('imagePullSecrets'): [{ + 'name': setType('name', str) + }], + Optional('nfs'): { + 'server': setType('server', str), + 'path': setType('path', str), + 'containerMountPath': setType('containerMountPath', str) + }, + Optional('adaptive'): setType('adaptive', bool), + Optional('checkpoint'): { + 'storageClass': setType('storageClass', str), + 'storageSize': setType('storageSize', str) + }, + Optional('cpuNum'): setNumberRange('cpuNum', int, 0, 99999), + Optional('memorySize'): setType('memorySize', str) + } +} + kubeflow_trial_schema = { 'trial': { 'codeDir': setPathCheck('codeDir'), @@ -404,6 +428,7 @@ def validate(self, data): } training_service_schema_dict = { + 'adl': Schema({**common_schema, **adl_trial_schema}), 'local': Schema({**common_schema, **common_trial_schema}), 'remote': Schema({**common_schema, **common_trial_schema, **machine_list_schema, **remote_config_schema}), 'pai': Schema({**common_schema, **pai_trial_schema, **pai_config_schema}), diff --git a/nni/tools/nnictl/constants.py b/nni/tools/nnictl/constants.py index 0654473ed4..f64f93b289 100644 --- a/nni/tools/nnictl/constants.py +++ b/nni/tools/nnictl/constants.py @@ -64,21 +64,21 @@ INSTALLABLE_PACKAGE_META = { 'SMAC': { 'type': 'tuner', - 'class_name': 'nni.smac_tuner.smac_tuner.SMACTuner', + 'class_name': 'nni.algorithms.hpo.smac_tuner.smac_tuner.SMACTuner', 'code_sub_dir': 'smac_tuner', - 'class_args_validator': 'nni.smac_tuner.smac_tuner.SMACClassArgsValidator' + 'class_args_validator': 'nni.algorithms.hpo.smac_tuner.smac_tuner.SMACClassArgsValidator' }, 'BOHB': { 'type': 'advisor', - 'class_name': 'nni.bohb_advisor.bohb_advisor.BOHB', + 'class_name': 'nni.algorithms.hpo.bohb_advisor.bohb_advisor.BOHB', 'code_sub_dir': 'bohb_advisor', - 'class_args_validator': 'nni.bohb_advisor.bohb_advisor.BOHBClassArgsValidator' + 'class_args_validator': 'nni.algorithms.hpo.bohb_advisor.bohb_advisor.BOHBClassArgsValidator' }, 'PPOTuner': { 'type': 'tuner', - 'class_name': 'nni.ppo_tuner.ppo_tuner.PPOTuner', + 'class_name': 'nni.algorithms.hpo.ppo_tuner.ppo_tuner.PPOTuner', 'code_sub_dir': 'ppo_tuner', - 'class_args_validator': 'nni.ppo_tuner.ppo_tuner.PPOClassArgsValidator' + 'class_args_validator': 'nni.algorithms.hpo.ppo_tuner.ppo_tuner.PPOClassArgsValidator' } } diff --git a/nni/tools/nnictl/launcher.py b/nni/tools/nnictl/launcher.py index 576954e335..6a199ce88a 100644 --- a/nni/tools/nnictl/launcher.py +++ b/nni/tools/nnictl/launcher.py @@ -136,6 +136,14 @@ def set_local_config(experiment_config, port, config_file_name): return set_trial_config(experiment_config, port, config_file_name), None +def set_adl_config(experiment_config, port, config_file_name): + '''set adl configuration''' + result, message = setNNIManagerIp(experiment_config, port, config_file_name) + if not result: + return result, message + #set trial_config + return set_trial_config(experiment_config, port, config_file_name), None + def set_remote_config(experiment_config, port, config_file_name): '''Call setClusterMetadata to pass trial''' #set machine_list @@ -393,7 +401,9 @@ def set_platform_config(platform, experiment_config, port, config_file_name, res '''call set_cluster_metadata for specific platform''' print_normal('Setting {0} config...'.format(platform)) config_result, err_msg = None, None - if platform == 'local': + if platform == 'adl': + config_result, err_msg = set_adl_config(experiment_config, port, config_file_name) + elif platform == 'local': config_result, err_msg = set_local_config(experiment_config, port, config_file_name) elif platform == 'remote': config_result, err_msg = set_remote_config(experiment_config, port, config_file_name) diff --git a/nni/tools/nnictl/nnictl_utils.py b/nni/tools/nnictl/nnictl_utils.py index 7819e83a94..f2e487288b 100644 --- a/nni/tools/nnictl/nnictl_utils.py +++ b/nni/tools/nnictl/nnictl_utils.py @@ -10,6 +10,7 @@ import shutil import subprocess from functools import cmp_to_key +import traceback from datetime import datetime, timezone from subprocess import Popen from pyhdfs import HdfsClient @@ -21,6 +22,7 @@ from .constants import NNICTL_HOME_DIR, NNI_HOME_DIR, EXPERIMENT_INFORMATION_FORMAT, EXPERIMENT_DETAIL_FORMAT, \ EXPERIMENT_MONITOR_INFO, TRIAL_MONITOR_HEAD, TRIAL_MONITOR_CONTENT, TRIAL_MONITOR_TAIL, REST_TIME_OUT from .common_utils import print_normal, print_error, print_warning, detect_process, get_yml_content, generate_temp_dir +from .common_utils import print_green from .command_utils import check_output_command, kill_command from .ssh_utils import create_ssh_sftp_client, remove_remote_directory @@ -372,6 +374,40 @@ def log_stderr(args): '''get stderr log''' log_internal(args, 'stderr') +def log_trial_adl_helper(args, experiment_id): + # adljob_id format should be consistent to the one in "adlTrainingService.ts": + # const adlJobName: string = `nni-exp-${this.experimentId}-trial-${trialJobId}`.toLowerCase(); + adlJobName = "nni-exp-{}-trial-{}".format(experiment_id, args.trial_id).lower() + print_warning('Note that no log will show when trial is pending or done (succeeded or failed). ' + 'You can retry the command.') + print_green('>>> Trial log streaming:') + try: + subprocess.run( + [ + "kubectl", "logs", + "-l", "adaptdl/job=%s" % adlJobName, + "-f" # Follow the stream + ], # TODO: support remaining argument, uncomment the lines in nnictl.py + ) # TODO: emulate tee behaviors, not necessary tho. + except KeyboardInterrupt: + pass + except Exception: + print_error('Error! Please check kubectl:') + traceback.print_exc() + exit(1) + finally: + print_green('<<< [adlJobName:%s]' % adlJobName) + nni_manager_collection_path = os.path.expanduser('~/nni-experiments/%s/trials/%s/stdout_log_collection.log' % + (experiment_id, args.trial_id)) + print_green('>>> (Optional) How to persist the complete trial log locally:') + print( + 'Please ensure `logCollection: http` ' + 'exists in the experiment configuration yaml. ' + 'After trial done, you can check it from the file below: \n %s' + % nni_manager_collection_path + ) + + def log_trial(args): ''''get trial log path''' trial_id_path_dict = {} @@ -394,10 +430,18 @@ def log_trial(args): else: print_error('Restful server is not running...') exit(1) + is_adl = nni_config.get_config('experimentConfig').get('trainingServicePlatform') == 'adl' + if is_adl and not args.trial_id: + print_error('Trial ID is required to retrieve the log for adl. Please specify it with "--trial_id".') + exit(1) if args.trial_id: if args.trial_id not in trial_id_list: print_error('Trial id {0} not correct, please check your command!'.format(args.trial_id)) exit(1) + if is_adl: + log_trial_adl_helper(args, nni_config.get_config('experimentId')) + # adl has its own way to log trial, and it thus returns right after the helper returns + return if trial_id_path_dict.get(args.trial_id): print_normal('id:' + args.trial_id + ' path:' + trial_id_path_dict[args.trial_id]) else: diff --git a/nni/tools/nnictl/tensorboard_utils.py b/nni/tools/nnictl/tensorboard_utils.py index c314f9e219..9bc8e14e48 100644 --- a/nni/tools/nnictl/tensorboard_utils.py +++ b/nni/tools/nnictl/tensorboard_utils.py @@ -10,7 +10,7 @@ from .config_utils import Config, Experiments from .url_utils import trial_jobs_url, get_local_urls from .constants import REST_TIME_OUT -from .common_utils import print_normal, print_error, print_green, detect_process, detect_port, check_tensorboard_version +from .common_utils import print_normal, print_warning, print_error, print_green, detect_process, detect_port, check_tensorboard_version from .nnictl_utils import check_experiment_id, check_experiment_id from .ssh_utils import create_ssh_sftp_client, copy_remote_directory_to_local @@ -110,14 +110,36 @@ def stop_tensorboard(args): else: print_error('No tensorboard configuration!') +def adl_tensorboard_helper(args): + '''start tensorboard on adl''' + import subprocess + if args.trial_id is not None: + print_warning('Tensorboard on adl platform will show all trials. No trial ids needed.') + cmd = "kubectl port-forward --address 0.0.0.0 deployment/{} {}:{}".format( + "adaptdl-tensorboard" + "-" + args.id.lower(), + args.port, + 6006 + ) + print_green('Tensorboard is accessible at 0.0.0.0:{port} or localhost:{port}'.format(port=args.port)) + subprocess.run(args=cmd, shell=True) def start_tensorboard(args): '''start tensorboard''' experiment_id = check_experiment_id(args) + if not experiment_id: + return + if args.id is None: + args.id = experiment_id experiment_config = Experiments() experiment_dict = experiment_config.get_all_experiments() + if experiment_dict[args.id]["status"] == "STOPPED": + print_error("Experiment {} is stopped...".format(args.id)) + return config_file_name = experiment_dict[experiment_id]['fileName'] nni_config = Config(config_file_name) + if nni_config.get_config('experimentConfig').get('trainingServicePlatform') == 'adl': + adl_tensorboard_helper(args) + return rest_port = nni_config.get_config('restServerPort') rest_pid = nni_config.get_config('restServerPid') if not detect_process(rest_pid): @@ -144,4 +166,4 @@ def start_tensorboard(args): os.makedirs(temp_nni_path, exist_ok=True) path_list = get_path_list(args, nni_config, trial_content, temp_nni_path) - start_tensorboard_process(args, nni_config, path_list, temp_nni_path) + start_tensorboard_process(args, nni_config, path_list, temp_nni_path) \ No newline at end of file diff --git a/nni/tools/trial_tool/trial_keeper.py b/nni/tools/trial_tool/trial_keeper.py index 08688973e0..913ebb82f9 100644 --- a/nni/tools/trial_tool/trial_keeper.py +++ b/nni/tools/trial_tool/trial_keeper.py @@ -28,6 +28,7 @@ regular = re.compile('v?(?P[0-9](\.[0-9]){0,1}).*') _hdfs_client = None +_trial_process = None def get_hdfs_client(args): @@ -62,6 +63,7 @@ def get_hdfs_client(args): def main_loop(args): '''main loop logic for trial keeper''' + global _trial_process if not os.path.exists(LOG_DIR): os.makedirs(LOG_DIR) @@ -90,13 +92,13 @@ def main_loop(args): # Notice: We don't appoint env, which means subprocess wil inherit current environment and that is expected behavior log_pipe_stdout = trial_syslogger_stdout.get_pipelog_reader() - process = Popen(args.trial_command, shell=True, stdout=log_pipe_stdout, stderr=log_pipe_stdout) - nni_log(LogType.Info, 'Trial keeper spawns a subprocess (pid {0}) to run command: {1}'.format(process.pid, + _trial_process = Popen(args.trial_command, shell=True, stdout=log_pipe_stdout, stderr=log_pipe_stdout, preexec_fn=os.setsid) + nni_log(LogType.Info, 'Trial keeper spawns a subprocess (pid {0}) to run command: {1}'.format(_trial_process.pid, shlex.split( args.trial_command))) while True: - retCode = process.poll() + retCode = _trial_process.poll() # child worker process exits and all stdout data is read if retCode is not None and log_pipe_stdout.set_process_exit() and log_pipe_stdout.is_read_completed == True: # In Windows, the retCode -1 is 4294967295. It's larger than c_long, and raise OverflowError. @@ -213,6 +215,20 @@ def run(self): fetch_file_thread.start() +def _set_adaptdl_signal_handler(): + import signal + global _trial_process + def _handler(signum, frame): + nni_log(LogType.Info, "RECEIVED SIGNAL {}".format(signum)) + nni_log(LogType.Debug, "TRIAL PROCESS ID {}".format(_trial_process.pid)) + if _trial_process and (signum == signal.SIGTERM or signum == signal.SIGINT): + os.killpg(os.getpgid(_trial_process.pid), signal.SIGINT) + os.waitpid(_trial_process.pid, 0) + exit(1) + signal.signal(signal.SIGTERM, _handler) + signal.signal(signal.SIGINT, _handler) + + if __name__ == '__main__': '''NNI Trial Keeper main function''' PARSER = argparse.ArgumentParser() @@ -237,6 +253,8 @@ def run(self): try: if NNI_PLATFORM == 'paiYarn' and is_multi_phase(): fetch_parameter_file(args) + if NNI_PLATFORM == 'adl': + _set_adaptdl_signal_handler() main_loop(args) except SystemExit as se: nni_log(LogType.Info, 'NNI trial keeper exit with code {}'.format(se.code)) diff --git a/nni/trial.py b/nni/trial.py index cdb2b1e683..e85d292b8c 100644 --- a/nni/trial.py +++ b/nni/trial.py @@ -97,6 +97,21 @@ def get_sequence_id(): _intermediate_seq = 0 + +def overwrite_intermediate_seq(value): + """ + Overwrite intermediate sequence value. + + Parameters + ---------- + value: + int + """ + assert isinstance(value, int) + global _intermediate_seq + _intermediate_seq = value + + def report_intermediate_result(metric): """ Reports intermediate result to NNI. diff --git a/pipelines/fast-test.yml b/pipelines/fast-test.yml index 5a0607eab0..7e0fd53ece 100644 --- a/pipelines/fast-test.yml +++ b/pipelines/fast-test.yml @@ -34,7 +34,7 @@ jobs: set -e sudo apt-get install -y pandoc python3 -m pip install -U --upgrade pygments - python3 -m pip install -U torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html + python3 -m pip install -U torch==1.7.0+cpu torchvision==0.8.1+cpu -f https://download.pytorch.org/whl/torch_stable.html python3 -m pip install -U tensorflow==2.3.1 python3 -m pip install -U keras==2.4.2 python3 -m pip install -U gym onnx peewee thop @@ -96,7 +96,7 @@ jobs: - script: | set -e - python3 -m pip install -U torch==1.3.1+cpu torchvision==0.4.2+cpu -f https://download.pytorch.org/whl/torch_stable.html + python3 -m pip install -U torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html python3 -m pip install -U tensorflow==1.15.2 python3 -m pip install -U keras==2.1.6 python3 -m pip install -U gym onnx peewee @@ -131,12 +131,16 @@ jobs: # This platform runs TypeScript unit test first. steps: + - task: UsePythonVersion@0 + inputs: + versionSpec: 3.8 + displayName: Configure Python + - script: | set -e - export PYTHON38_BIN_DIR=/usr/local/Cellar/python@3.8/`ls /usr/local/Cellar/python@3.8`/bin - echo "##vso[task.setvariable variable=PATH]${PYTHON38_BIN_DIR}:${HOME}/Library/Python/3.8/bin:${PATH}" - python3 -m pip install -U --upgrade pip setuptools - python3 -m pip install -U pytest coverage + echo "##vso[task.setvariable variable=PATH]${PATH}:${HOME}/.local/bin" + python -m pip install -U --upgrade pip setuptools wheel + python -m pip install -U pytest coverage displayName: 'Install Python tools' - script: | @@ -145,10 +149,9 @@ jobs: - script: | set -e - cd ts/nni_manager - yarn test - cd ../nasui - CI=true yarn test + export CI=true + (cd ts/nni_manager && yarn test) + (cd ts/nasui && yarn test) displayName: 'TypeScript unit test' - script: | diff --git a/setup.py b/setup.py index 0c22b67c04..038dc6d4ae 100644 --- a/setup.py +++ b/setup.py @@ -72,7 +72,8 @@ 'colorama', 'scikit-learn>=0.23.2', 'pkginfo', - 'websockets' + 'websockets', + 'prettytable' ] diff --git a/test/async_sharing_test/simple_tuner.py b/test/async_sharing_test/simple_tuner.py index c09b40fc13..ec49c38322 100644 --- a/test/async_sharing_test/simple_tuner.py +++ b/test/async_sharing_test/simple_tuner.py @@ -35,8 +35,7 @@ def generate_parameters(self, parameter_id, **kwargs): 'checksum': None, 'path': '', } - _logger.info('generate parameter for father trial %s' % - parameter_id) + _logger.info('generate parameter for father trial %s', parameter_id) self.thread_lock.release() return { 'prev_id': 0, diff --git a/test/config/naive_test/naive_assessor.py b/test/config/naive_test/naive_assessor.py index 0bc69133cc..54468f6e99 100644 --- a/test/config/naive_test/naive_assessor.py +++ b/test/config/naive_test/naive_assessor.py @@ -18,7 +18,7 @@ def __init__(self, optimize_mode): _logger.info('init') def assess_trial(self, trial_job_id, trial_history): - _logger.info('assess trial %s %s' % (trial_job_id, trial_history)) + _logger.info('assess trial %s %s', trial_job_id, trial_history) id_ = trial_history[0] if id_ in self._killed: diff --git a/test/config/naive_test/naive_tuner.py b/test/config/naive_test/naive_tuner.py index 7dfef032ba..28a052050f 100644 --- a/test/config/naive_test/naive_tuner.py +++ b/test/config/naive_test/naive_tuner.py @@ -21,17 +21,17 @@ def __init__(self, optimize_mode): def generate_parameters(self, parameter_id, **kwargs): self.cur += 1 - _logger.info('generate parameters: %s' % self.cur) + _logger.info('generate parameters: %s', self.cur) return { 'x': self.cur } def receive_trial_result(self, parameter_id, parameters, value, **kwargs): reward = extract_scalar_reward(value) - _logger.info('receive trial result: %s, %s, %s' % (parameter_id, parameters, reward)) + _logger.info('receive trial result: %s, %s, %s', parameter_id, parameters, reward) _result.write('%d %d\n' % (parameters['x'], reward)) _result.flush() def update_search_space(self, search_space): - _logger.info('update_search_space: %s' % search_space) + _logger.info('update_search_space: %s', search_space) with open(os.path.join(_pwd, 'tuner_search_space.json'), 'w') as file_: json.dump(search_space, file_) diff --git a/test/ut/sdk/test_compression_utils.py b/test/ut/sdk/test_compression_utils.py index 9c7b9d5ba2..5423f762b0 100644 --- a/test/ut/sdk/test_compression_utils.py +++ b/test/ut/sdk/test_compression_utils.py @@ -12,6 +12,7 @@ from nni.algorithms.compression.pytorch.pruning import L1FilterPruner from nni.compression.pytorch.utils.shape_dependency import ChannelDependency from nni.compression.pytorch.utils.mask_conflict import fix_mask_conflict +from nni.compression.pytorch.utils.counter import count_flops_params device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') prefix = 'analysis_test' @@ -60,7 +61,6 @@ unittest.TestLoader.sortTestMethodsUsing = None -@unittest.skipIf(torch.__version__ >= '1.6.0', 'not supported') class AnalysisUtilsTest(TestCase): @unittest.skipIf(torch.__version__ < "1.3.0", "not supported") def test_channel_dependency(self): @@ -138,5 +138,49 @@ def test_mask_conflict(self): assert b_index1 == b_index2 + def test_flops_params(self): + class Model1(nn.Module): + def __init__(self): + super(Model1, self).__init__() + self.conv = nn.Conv2d(3, 5, 1, 1) + self.bn = nn.BatchNorm2d(5) + self.relu = nn.LeakyReLU() + self.linear = nn.Linear(20, 10) + self.upsample = nn.UpsamplingBilinear2d(size=2) + self.pool = nn.AdaptiveAvgPool2d((2, 2)) + + def forward(self, x): + x = self.conv(x) + x = self.bn(x) + x = self.relu(x) + x = self.upsample(x) + x = self.pool(x) + x = x.view(x.size(0), -1) + x = self.linear(x) + return x + + class Model2(nn.Module): + def __init__(self): + super(Model2, self).__init__() + self.conv = nn.Conv2d(3, 5, 1, 1) + self.conv2 = nn.Conv2d(5, 5, 1, 1) + + def forward(self, x): + x = self.conv(x) + for _ in range(5): + x = self.conv2(x) + return x + + flops, params, results = count_flops_params(Model1(), (1, 3, 2, 2), mode='full', verbose=False) + assert (flops, params) == (610, 240) + + flops, params, results = count_flops_params(Model2(), (1, 3, 2, 2), verbose=False) + assert (flops, params) == (560, 50) + + from torchvision.models import resnet50 + flops, params, results = count_flops_params(resnet50(), (1, 3, 224, 224), verbose=False) + assert (flops, params) == (4089184256, 25503912) + + if __name__ == '__main__': main() diff --git a/test/ut/sdk/test_dependecy_aware.py b/test/ut/sdk/test_dependecy_aware.py index 328c987d41..5918f502bf 100644 --- a/test/ut/sdk/test_dependecy_aware.py +++ b/test/ut/sdk/test_dependecy_aware.py @@ -47,7 +47,6 @@ def generate_random_sparsity_v2(model): return cfg_list -@unittest.skipIf(torch.__version__ >= '1.6.0', 'not supported') class DependencyawareTest(TestCase): @unittest.skipIf(torch.__version__ < "1.3.0", "not supported") def test_dependency_aware_pruning(self): diff --git a/test/ut/sdk/test_model_speedup.py b/test/ut/sdk/test_model_speedup.py index f6af22b371..c3f41260ab 100644 --- a/test/ut/sdk/test_model_speedup.py +++ b/test/ut/sdk/test_model_speedup.py @@ -177,7 +177,6 @@ def channel_prune(model): pruner.compress() pruner.export_model(model_path=MODEL_FILE, mask_path=MASK_FILE) -@unittest.skipIf(torch.__version__ >= '1.6.0', 'not supported') class SpeedupTestCase(TestCase): def test_speedup_vgg16(self): prune_model_l1(vgg16()) diff --git a/test/ut/sdk/test_pruners.py b/test/ut/sdk/test_pruners.py index aa4c0a58f4..d9b10d63a9 100644 --- a/test/ut/sdk/test_pruners.py +++ b/test/ut/sdk/test_pruners.py @@ -151,12 +151,37 @@ def validate_sparsity(wrapper, sparsity, bias=False): lambda model: validate_sparsity(model.conv1, 0.5, model.bias) ] }, - 'autocompress': { + 'autocompress_l1': { 'pruner_class': AutoCompressPruner, 'config_list': [{ 'sparsity': 0.5, 'op_types': ['Conv2d'], }], + 'base_algo': 'l1', + 'trainer': lambda model, optimizer, criterion, epoch, callback : model, + 'evaluator': lambda model: 0.9, + 'dummy_input': torch.randn([64, 1, 28, 28]), + 'validators': [] + }, + 'autocompress_l2': { + 'pruner_class': AutoCompressPruner, + 'config_list': [{ + 'sparsity': 0.5, + 'op_types': ['Conv2d'], + }], + 'base_algo': 'l2', + 'trainer': lambda model, optimizer, criterion, epoch, callback : model, + 'evaluator': lambda model: 0.9, + 'dummy_input': torch.randn([64, 1, 28, 28]), + 'validators': [] + }, + 'autocompress_fpgm': { + 'pruner_class': AutoCompressPruner, + 'config_list': [{ + 'sparsity': 0.5, + 'op_types': ['Conv2d'], + }], + 'base_algo': 'fpgm', 'trainer': lambda model, optimizer, criterion, epoch, callback : model, 'evaluator': lambda model: 0.9, 'dummy_input': torch.randn([64, 1, 28, 28]), @@ -181,7 +206,7 @@ def __init__(self, bias=True): def forward(self, x): return self.fc(self.pool(self.bn1(self.conv1(x))).view(x.size(0), -1)) -def pruners_test(pruner_names=['level', 'agp', 'slim', 'fpgm', 'l1', 'l2', 'taylorfo', 'mean_activation', 'apoz', 'netadapt', 'simulatedannealing', 'admm', 'autocompress'], bias=True): +def pruners_test(pruner_names=['level', 'agp', 'slim', 'fpgm', 'l1', 'l2', 'taylorfo', 'mean_activation', 'apoz', 'netadapt', 'simulatedannealing', 'admm', 'autocompress_l1', 'autocompress_l2', 'autocompress_fpgm',], bias=True): for pruner_name in pruner_names: print('testing {}...'.format(pruner_name)) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") @@ -203,8 +228,8 @@ def pruners_test(pruner_names=['level', 'agp', 'slim', 'fpgm', 'l1', 'l2', 'tayl pruner = prune_config[pruner_name]['pruner_class'](model, config_list, evaluator=prune_config[pruner_name]['evaluator']) elif pruner_name == 'admm': pruner = prune_config[pruner_name]['pruner_class'](model, config_list, trainer=prune_config[pruner_name]['trainer']) - elif pruner_name == 'autocompress': - pruner = prune_config[pruner_name]['pruner_class'](model, config_list, trainer=prune_config[pruner_name]['trainer'], evaluator=prune_config[pruner_name]['evaluator'], dummy_input=x) + elif pruner_name.startswith('autocompress'): + pruner = prune_config[pruner_name]['pruner_class'](model, config_list, trainer=prune_config[pruner_name]['trainer'], evaluator=prune_config[pruner_name]['evaluator'], dummy_input=x, base_algo=prune_config[pruner_name]['base_algo']) else: pruner = prune_config[pruner_name]['pruner_class'](model, config_list, optimizer) pruner.compress() @@ -264,7 +289,6 @@ def __getitem__(self, index): def __len__(self): return 1000 -@unittest.skipIf(torch.__version__ >= '1.6.0', 'not supported') class PrunerTestCase(TestCase): def test_pruners(self): pruners_test(bias=True) @@ -273,7 +297,7 @@ def test_pruners_no_bias(self): pruners_test(bias=False) def test_agp_pruner(self): - for pruning_algorithm in ['l1', 'l2', 'taylorfo', 'apoz']: + for pruning_algorithm in ['l1', 'l2', 'fpgm', 'taylorfo', 'apoz']: _test_agp(pruning_algorithm) for pruning_algorithm in ['level']: diff --git a/test/ut/tools/annotation/testcase/annotated/mnist.py b/test/ut/tools/annotation/testcase/annotated/mnist.py index 13542109d3..1648f8db93 100644 --- a/test/ut/tools/annotation/testcase/annotated/mnist.py +++ b/test/ut/tools/annotation/testcase/annotated/mnist.py @@ -38,8 +38,7 @@ def build_network(self): input_dim = int(math.sqrt(self.x_dim)) except: logger.debug( - 'input dim cannot be sqrt and reshape. input dim: ' + - str(self.x_dim)) + 'input dim cannot be sqrt and reshape. input dim: ', str(self.x_dim)) raise x_image = tf.reshape(self.x, [-1, input_dim, input_dim, 1]) with tf.name_scope('conv1'): @@ -132,7 +131,7 @@ def main(): mnist_network.build_network() logger.debug('Mnist build network done.') graph_location = tempfile.mkdtemp() - logger.debug('Saving graph to: %s' % graph_location) + logger.debug('Saving graph to: %s', graph_location) train_writer = tf.summary.FileWriter(graph_location) train_writer.add_graph(tf.get_default_graph()) test_acc = 0.0 diff --git a/test/ut/tools/annotation/testcase/usercode/mnist.py b/test/ut/tools/annotation/testcase/usercode/mnist.py index ea6839553e..f734e6fd78 100644 --- a/test/ut/tools/annotation/testcase/usercode/mnist.py +++ b/test/ut/tools/annotation/testcase/usercode/mnist.py @@ -53,7 +53,7 @@ def build_network(self): input_dim = int(math.sqrt(self.x_dim)) except: #print('input dim cannot be sqrt and reshape. input dim: ' + str(self.x_dim)) - logger.debug('input dim cannot be sqrt and reshape. input dim: ' + str(self.x_dim)) + logger.debug('input dim cannot be sqrt and reshape. input dim: ', str(self.x_dim)) raise x_image = tf.reshape(self.x, [-1, input_dim, input_dim, 1]) @@ -147,7 +147,7 @@ def main(): # Write log graph_location = tempfile.mkdtemp() - logger.debug('Saving graph to: %s' % graph_location) + logger.debug('Saving graph to: %s', graph_location) # print('Saving graph to: %s' % graph_location) train_writer = tf.summary.FileWriter(graph_location) train_writer.add_graph(tf.get_default_graph()) diff --git a/ts/nni_manager/common/datastore.ts b/ts/nni_manager/common/datastore.ts index b3f29cd80f..41324d12e5 100644 --- a/ts/nni_manager/common/datastore.ts +++ b/ts/nni_manager/common/datastore.ts @@ -46,6 +46,7 @@ interface TrialJobInfo { trialJobId: string; sequenceId?: number; status: TrialJobStatus; + message?: string; startTime?: number; endTime?: number; hyperParameters?: string[]; diff --git a/ts/nni_manager/common/manager.ts b/ts/nni_manager/common/manager.ts index 1f3972ae43..ea36d02943 100644 --- a/ts/nni_manager/common/manager.ts +++ b/ts/nni_manager/common/manager.ts @@ -105,6 +105,7 @@ abstract class Manager { public abstract getTrialLog(trialJobId: string, logType: LogType): Promise; public abstract getTrialJobStatistics(): Promise; + public abstract getTrialJobMessage(trialJobId: string): string | undefined; public abstract getStatus(): NNIManagerStatus; } diff --git a/ts/nni_manager/common/trainingService.ts b/ts/nni_manager/common/trainingService.ts index 4edcf16ab6..450a2c07b6 100644 --- a/ts/nni_manager/common/trainingService.ts +++ b/ts/nni_manager/common/trainingService.ts @@ -42,6 +42,7 @@ interface TrialJobDetail { readonly workingDirectory: string; readonly form: TrialJobApplicationForm; isEarlyStopped?: boolean; + message?: string; } /** diff --git a/ts/nni_manager/config/adl/adaptdl-crd-v1.json b/ts/nni_manager/config/adl/adaptdl-crd-v1.json new file mode 100644 index 0000000000..368a184168 --- /dev/null +++ b/ts/nni_manager/config/adl/adaptdl-crd-v1.json @@ -0,0 +1,17 @@ +{ + "apiVersion": "apiextensions.k8s.io/v1beta1", + "kind": "CustomResourceDefinition", + "metadata": { + "name": "adaptdljobs.adaptdl.petuum.com" + }, + "spec": { + "group": "adaptdl.petuum.com", + "version": "v1", + "scope": "Namespaced", + "names": { + "plural": "adaptdljobs", + "singular": "adaptdljob", + "kind": "AdaptDLJob" + } + } +} diff --git a/ts/nni_manager/config/adl/adaptdl-nni-configmap-template.json b/ts/nni_manager/config/adl/adaptdl-nni-configmap-template.json new file mode 100644 index 0000000000..42a0e6ce7d --- /dev/null +++ b/ts/nni_manager/config/adl/adaptdl-nni-configmap-template.json @@ -0,0 +1,19 @@ +{ + "apiVersion": "v1", + "kind": "ConfigMap", + "metadata": { + "name": "", + "ownerReferences": [ + { + "apiVersion": "adaptdl.petuum.com/v1", + "kind": "AdaptDLJob", + "name": "", + "uid": "" + } + ] + }, + "data": { + "run.sh": "", + "cleanup.sh": "" + } +} diff --git a/ts/nni_manager/config/adl/adaptdl-pvc-template.json b/ts/nni_manager/config/adl/adaptdl-pvc-template.json new file mode 100644 index 0000000000..b98c1ef902 --- /dev/null +++ b/ts/nni_manager/config/adl/adaptdl-pvc-template.json @@ -0,0 +1,27 @@ +{ + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": "", + "ownerReferences": [ + { + "apiVersion": "adaptdl.petuum.com/v1", + "kind": "AdaptDLJob", + "name": "", + "uid": "" + } + ] + }, + "spec": { + "accessModes": [ + "ReadWriteMany" + ], + "resources": { + "requests": { + "storage": "" + } + }, + "storageClassName": "", + "volumeMode": "Filesystem" + } +} diff --git a/ts/nni_manager/config/adl/adaptdl-tensorboard-deployment-template.json b/ts/nni_manager/config/adl/adaptdl-tensorboard-deployment-template.json new file mode 100644 index 0000000000..30acc1408c --- /dev/null +++ b/ts/nni_manager/config/adl/adaptdl-tensorboard-deployment-template.json @@ -0,0 +1,55 @@ +{ + "apiVersion": "apps/v1", + "kind": "Deployment", + "metadata": { + "name": "", + "labels": { + "expId": "" + } + }, + "spec": { + "selector": { + "matchLabels": { + "app": "" + } + }, + "replicas": 1, + "template": { + "metadata": { + "labels": { + "app": "" + } + }, + "spec": { + "containers": [ + { + "command": ["tensorboard"], + "args": ["--host=0.0.0.0", "--logdir=/adaptdl/tensorboard", "--port=6006"], + "image": "tensorflow/tensorflow", + "name": "tensorboard", + "ports": [ + { + "containerPort": 6006 + } + ], + "volumeMounts": [ + { + "mountPath": "/adaptdl/tensorboard", + "name": "adaptdl-tensorboard-pvc", + "subPath": "adaptdl/tensorboard" + } + ] + } + ], + "volumes": [ + { + "name": "adaptdl-tensorboard-pvc", + "persistentVolumeClaim": { + "claimName": "" + } + } + ] + } + } + } +} \ No newline at end of file diff --git a/ts/nni_manager/config/adl/adaptdl-tensorboard-pvc-template.json b/ts/nni_manager/config/adl/adaptdl-tensorboard-pvc-template.json new file mode 100644 index 0000000000..a2230de16d --- /dev/null +++ b/ts/nni_manager/config/adl/adaptdl-tensorboard-pvc-template.json @@ -0,0 +1,27 @@ +{ + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": "", + "ownerReferences": [ + { + "apiVersion": "apps/v1", + "kind": "Deployment", + "name": "", + "uid": "" + } + ] + }, + "spec": { + "accessModes": [ + "ReadWriteMany" + ], + "resources": { + "requests": { + "storage": "" + } + }, + "storageClassName": "", + "volumeMode": "Filesystem" + } +} diff --git a/ts/nni_manager/config/adl/adaptdljob-template.json b/ts/nni_manager/config/adl/adaptdljob-template.json new file mode 100644 index 0000000000..462f561ca5 --- /dev/null +++ b/ts/nni_manager/config/adl/adaptdljob-template.json @@ -0,0 +1,109 @@ +{ + "apiVersion": "adaptdl.petuum.com/v1", + "kind": "AdaptDLJob", + "metadata": { + "name": "", + "labels": { + "app": "", + "expId": "", + "trialId": "" + } + }, + "spec": { + "preemptible": false, + "template": { + "spec": { + "containers": [ + { + "lifecycle": + { + "preStop": + { + "exec": + { + "command": ["/cleanup.sh"] + } + } + }, + "command": ["/run.sh"], + "env": [ + { + "name": "ADAPTDL_CHECKPOINT_PATH", + "value": "/adaptdl/checkpoint" + }, + { + "name": "ADAPTDL_TENSORBOARD_LOGDIR", + "value": "/adaptdl/tensorboard" + }, + { + "name": "ADAPTDL_SHARE_PATH", + "value": "/adaptdl/share" + } + ], + "image": "", + "imagePullPolicy": "Always", + "name": "main", + "resources": { + "requests": { + "memory": "", + "cpu": "" + }, + "limits": { + "nvidia.com/gpu": 1 + } + }, + "volumeMounts": [ + { + "mountPath": "/adaptdl/checkpoint", + "name": "adaptdl-pvc", + "subPath": "adaptdl/checkpoint" + }, + { + "mountPath": "/adaptdl/share", + "name": "adaptdl-pvc", + "subPath": "adaptdl/share" + }, + { + "mountPath": "/adaptdl/tensorboard", + "name": "adaptdl-tensorboard-pvc", + "subPath": "adaptdl/tensorboard" + }, + { + "mountPath": "/cleanup.sh", + "name": "adaptdl-nni-configmap", + "subPath": "cleanup.sh" + }, + { + "mountPath": "/run.sh", + "name": "adaptdl-nni-configmap", + "subPath": "run.sh" + } + ] + } + ], + "imagePullSecrets": [], + "volumes": [ + { + "name": "adaptdl-pvc", + "persistentVolumeClaim": { + "claimName": "" + } + }, + { + "name": "adaptdl-tensorboard-pvc", + "persistentVolumeClaim": { + "claimName": "" + } + }, + { + "name": "adaptdl-nni-configmap", + "configMap": { + "name": "", + "defaultMode": 511 + } + } + ] + } + } + } +} diff --git a/ts/nni_manager/core/nnimanager.ts b/ts/nni_manager/core/nnimanager.ts index 6fa1271523..e9cd44e20a 100644 --- a/ts/nni_manager/core/nnimanager.ts +++ b/ts/nni_manager/core/nnimanager.ts @@ -345,6 +345,14 @@ class NNIManager implements Manager { return this.status; } + public getTrialJobMessage(trialJobId: string): string | undefined { + const trialJob = this.trialJobs.get(trialJobId); + if (trialJob !== undefined){ + return trialJob.message + } + return undefined + } + public async listTrialJobs(status?: TrialJobStatus): Promise { return this.dataStore.listTrialJobs(status); } @@ -501,6 +509,10 @@ class NNIManager implements Manager { this.trialJobs.set(trialJobId, Object.assign({}, trialJobDetail)); await this.dataStore.storeTrialJobEvent(trialJobDetail.status, trialJobDetail.id, undefined, trialJobDetail); } + const newTrialJobDetail: TrialJobDetail | undefined = this.trialJobs.get(trialJobId); + if (newTrialJobDetail !== undefined) { + newTrialJobDetail.message = trialJobDetail.message; + } let hyperParams: string | undefined = undefined; switch (trialJobDetail.status) { case 'SUCCEEDED': @@ -678,11 +690,15 @@ class NNIManager implements Manager { private async onTrialJobMetrics(metric: TrialJobMetric): Promise { this.log.debug(`NNIManager received trial job metrics: ${metric}`); - await this.dataStore.storeMetricData(metric.id, metric.data); - if (this.dispatcher === undefined) { - throw new Error('Error: tuner has not been setup'); + if (this.trialJobs.has(metric.id)){ + await this.dataStore.storeMetricData(metric.id, metric.data); + if (this.dispatcher === undefined) { + throw new Error('Error: tuner has not been setup'); + } + this.dispatcher.sendCommand(REPORT_METRIC_DATA, metric.data); + } else { + this.log.warning(`NNIManager received non-existent trial job metrics: ${metric}`); } - this.dispatcher.sendCommand(REPORT_METRIC_DATA, metric.data); } private requestTrialJobs(jobNum: number): void { diff --git a/ts/nni_manager/main.ts b/ts/nni_manager/main.ts index 86a7a2583a..f00c757425 100644 --- a/ts/nni_manager/main.ts +++ b/ts/nni_manager/main.ts @@ -19,6 +19,7 @@ import { NNIManager } from './core/nnimanager'; import { SqlDB } from './core/sqlDatabase'; import { NNIRestServer } from './rest_server/nniRestServer'; import { FrameworkControllerTrainingService } from './training_service/kubernetes/frameworkcontroller/frameworkcontrollerTrainingService'; +import { AdlTrainingService } from './training_service/kubernetes/adl/adlTrainingService'; import { KubeflowTrainingService } from './training_service/kubernetes/kubeflow/kubeflowTrainingService'; import { LocalTrainingService } from './training_service/local/localTrainingService'; import { RouterTrainingService } from './training_service/reusable/routerTrainingService'; @@ -34,7 +35,11 @@ function initStartupInfo( } async function initContainer(foreground: boolean, platformMode: string, logFileName?: string): Promise { - if (platformMode === 'local') { + if (platformMode === 'adl') { + Container.bind(TrainingService) + .to(AdlTrainingService) + .scope(Scope.Singleton); + } else if (platformMode === 'local') { Container.bind(TrainingService) .to(LocalTrainingService) .scope(Scope.Singleton); @@ -94,7 +99,7 @@ async function initContainer(foreground: boolean, platformMode: string, logFileN function usage(): void { console.info('usage: node main.js --port --mode \ - --start_mode --experiment_id --foreground '); + --start_mode --experiment_id --foreground '); } const strPort: string = parseArg(['--port', '-p']); @@ -114,7 +119,7 @@ const foreground: boolean = foregroundArg.toLowerCase() === 'true' ? true : fals const port: number = parseInt(strPort, 10); const mode: string = parseArg(['--mode', '-m']); -if (!['local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'].includes(mode)) { +if (!['adl', 'local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'].includes(mode)) { console.log(`FATAL: unknown mode: ${mode}`); usage(); process.exit(1); @@ -174,20 +179,7 @@ mkDirP(getLogDir()) console.error(`Failed to create log dir: ${err.stack}`); }); -function getStopSignal(): any { - return 'SIGTERM'; -} - -function getCtrlCSignal(): any { - return 'SIGINT'; -} - -process.on(getCtrlCSignal(), async () => { - const log: Logger = getLogger(); - log.info(`Get SIGINT signal!`); -}); - -process.on(getStopSignal(), async () => { +async function cleanUp(): Promise { const log: Logger = getLogger(); let hasError: boolean = false; try { @@ -201,7 +193,11 @@ process.on(getStopSignal(), async () => { hasError = true; log.error(`${err.stack}`); } finally { - await log.close(); + log.close(); process.exit(hasError ? 1 : 0); } -}); +} + +process.on('SIGTERM', cleanUp); +process.on('SIGBREAK', cleanUp); +process.on('SIGINT', cleanUp); diff --git a/ts/nni_manager/rest_server/restHandler.ts b/ts/nni_manager/rest_server/restHandler.ts index 1a4d87ec02..2b1cf89c58 100644 --- a/ts/nni_manager/rest_server/restHandler.ts +++ b/ts/nni_manager/rest_server/restHandler.ts @@ -15,12 +15,13 @@ import { ExperimentProfile, Manager, TrialJobStatistics } from '../common/manage import { ValidationSchemas } from './restValidationSchemas'; import { NNIRestServer } from './nniRestServer'; import { getVersion } from '../common/utils'; +import { NNIManager } from "../core/nnimanager"; const expressJoi = require('express-joi-validator'); class NNIRestHandler { private restServer: NNIRestServer; - private nniManager: Manager; + private nniManager: NNIManager; private log: Logger; constructor(rs: NNIRestServer) { @@ -209,6 +210,7 @@ class NNIRestHandler { this.nniManager.listTrialJobs(req.query.status).then((jobInfos: TrialJobInfo[]) => { jobInfos.forEach((trialJob: TrialJobInfo) => { this.setErrorPathForFailedJob(trialJob); + this.setMessageforJob(trialJob); }); res.send(jobInfos); }).catch((err: Error) => { @@ -221,6 +223,7 @@ class NNIRestHandler { router.get('/trial-jobs/:id', (req: Request, res: Response) => { this.nniManager.getTrialJob(req.params.id).then((jobDetail: TrialJobInfo) => { const jobInfo: TrialJobInfo = this.setErrorPathForFailedJob(jobDetail); + this.setMessageforJob(jobInfo); res.send(jobInfo); }).catch((err: Error) => { this.handleError(err, res); @@ -311,6 +314,14 @@ class NNIRestHandler { return jobInfo; } + + private setMessageforJob(jobInfo: TrialJobInfo): TrialJobInfo { + if (jobInfo === undefined){ + return jobInfo + } + jobInfo.message = this.nniManager.getTrialJobMessage(jobInfo.trialJobId); + return jobInfo + } } export function createRestHandler(rs: NNIRestServer): Router { diff --git a/ts/nni_manager/rest_server/restValidationSchemas.ts b/ts/nni_manager/rest_server/restValidationSchemas.ts index 1b33925a78..be2c08d427 100644 --- a/ts/nni_manager/rest_server/restValidationSchemas.ts +++ b/ts/nni_manager/rest_server/restValidationSchemas.ts @@ -32,6 +32,9 @@ export namespace ValidationSchemas { outputDir: joi.string(), cpuNum: joi.number().min(1), memoryMB: joi.number().min(100), + // ############## adl cpu and memory config ############### + memorySize: joi.string(), + // ######################################################## gpuNum: joi.number().min(0), command: joi.string().min(1), virtualCluster: joi.string(), @@ -93,6 +96,20 @@ export namespace ValidationSchemas { minFailedTaskCount: joi.number(), minSucceededTaskCount: joi.number() }) + }), + imagePullSecrets: joi.array({ + name: joi.string().min(1).required() + }), + // ############## adl ############### + adaptive: joi.boolean(), + checkpoint: joi.object({ + storageClass: joi.string().min(1).required(), + storageSize: joi.string().min(1).required() + }), + nfs: joi.object({ + server: joi.string().min(1).required(), + path: joi.string().min(1).required(), + containerMountPath: joi.string().min(1).required() }) }), pai_yarn_config: joi.object({ // eslint-disable-line @typescript-eslint/camelcase diff --git a/ts/nni_manager/rest_server/test/mockedNNIManager.ts b/ts/nni_manager/rest_server/test/mockedNNIManager.ts index afa0adc1a1..d9e3e8b9b6 100644 --- a/ts/nni_manager/rest_server/test/mockedNNIManager.ts +++ b/ts/nni_manager/rest_server/test/mockedNNIManager.ts @@ -110,6 +110,11 @@ export class MockedNNIManager extends Manager { return deferred.promise; } + + public getTrialJobMessage(trialJobId: string): string | undefined { + return "TEST-MESSAGE" + } + public stopExperiment(): Promise { throw new MethodNotImplementedError(); } diff --git a/ts/nni_manager/training_service/kubernetes/adl/adlApiClient.ts b/ts/nni_manager/training_service/kubernetes/adl/adlApiClient.ts new file mode 100644 index 0000000000..36584b66be --- /dev/null +++ b/ts/nni_manager/training_service/kubernetes/adl/adlApiClient.ts @@ -0,0 +1,56 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT license. + +'use strict'; + +import * as fs from 'fs'; +import { GeneralK8sClient, KubernetesCRDClient } from '../kubernetesApiClient'; + +/** + * Adl ClientV1 + */ +class AdlClientV1 extends KubernetesCRDClient { + /** + * constructor, to initialize adl CRD definition + */ + public constructor() { + super(); + this.crdSchema = JSON.parse(fs.readFileSync('./config/adl/adaptdl-crd-v1.json', 'utf8')); + this.client.addCustomResourceDefinition(this.crdSchema); + } + + protected get operator(): any { + return this.client.apis['adaptdl.petuum.com'].v1.namespaces('default').adaptdljobs; + } + + public get containerName(): string { + return 'main'; + } + + public async getKubernetesPods(jobName: string): Promise { + let result: Promise; + const response = await this.client.api.v1.namespaces('default').pods + .get({ qs: { labelSelector: `adaptdl/job=${jobName}` } }); + if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { + result = Promise.resolve(response.body); + } else { + result = Promise.reject(`AdlClient getKubernetesPods failed, statusCode is ${response.statusCode}`); + } + return result; + } +} + +/** + * Adl Client + */ +class AdlClientFactory { + /** + * Factory method to generate operator client + */ + public static createClient(): KubernetesCRDClient { + return new AdlClientV1(); + } +} + +export { AdlClientFactory, GeneralK8sClient }; +export { AdlClientV1 } diff --git a/ts/nni_manager/training_service/kubernetes/adl/adlConfig.ts b/ts/nni_manager/training_service/kubernetes/adl/adlConfig.ts new file mode 100644 index 0000000000..682a3e01cb --- /dev/null +++ b/ts/nni_manager/training_service/kubernetes/adl/adlConfig.ts @@ -0,0 +1,93 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT license. + +'use strict'; + +import {KubernetesTrialConfig} from "../kubernetesConfig"; + +/** + * Checkpoint Config + */ +export class CheckpointConfig { + public readonly storageClass: string; + + public readonly storageSize: string; + + constructor(storageClass: string, storageSize: string) { + this.storageClass = storageClass; + this.storageSize = storageSize; + } +} + +/** + * imagePullSecret Config + */ +export class ImagePullSecretConfig{ + public readonly name: string; + + constructor(name: string) { + this.name = name + } +} + +/** + * NFS Config + */ +export class NFSConfig { + public readonly server: string; + + public readonly path: string; + + public readonly containerMountPath: string; + + constructor(server: string, path: string, containerMountPath: string) { + this.server = server; + this.path = path; + this.containerMountPath = containerMountPath; + } +} + +/** + * Trial job configuration for Adl + */ +export class AdlTrialConfig extends KubernetesTrialConfig { + + public readonly command: string; + + public readonly gpuNum: number; + + public readonly image: string; + + public readonly imagePullSecrets?: ImagePullSecretConfig[]; + + public readonly nfs?: NFSConfig; + + public readonly checkpoint?: CheckpointConfig; + + public readonly cpuNum?: number; + + public readonly memorySize?: string; + + public readonly adaptive?: boolean; // adaptive == preemptible + + constructor(codeDir: string, + command: string, gpuNum: number, + image: string, imagePullSecrets?: ImagePullSecretConfig[], + nfs?: NFSConfig, checkpoint?: CheckpointConfig, + cpuNum?: number, memorySize?: string, + adaptive?: boolean + ) { + super(codeDir); + this.command = command; + this.gpuNum = gpuNum; + this.image = image; + this.imagePullSecrets = imagePullSecrets; + this.nfs = nfs; + this.checkpoint = checkpoint; + this.cpuNum = cpuNum; + this.memorySize = memorySize; + this.adaptive = adaptive; + } +} + +export type AdlJobStatus = "Pending" | "Running" | "Starting" | "Stopping" | "Failed" | "Succeeded"; diff --git a/ts/nni_manager/training_service/kubernetes/adl/adlJobInfoCollector.ts b/ts/nni_manager/training_service/kubernetes/adl/adlJobInfoCollector.ts new file mode 100644 index 0000000000..5e9eb5bc60 --- /dev/null +++ b/ts/nni_manager/training_service/kubernetes/adl/adlJobInfoCollector.ts @@ -0,0 +1,94 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT license. + +'use strict'; + +import { AdlClientV1 } from './adlApiClient'; +import { KubernetesTrialJobDetail} from '../kubernetesData'; +import { KubernetesJobInfoCollector } from '../kubernetesJobInfoCollector'; +import { AdlJobStatus } from './adlConfig'; + +/** + * Collector Adl jobs info from Kubernetes cluster, and update adl job status locally + */ +export class AdlJobInfoCollector extends KubernetesJobInfoCollector { + constructor(jobMap: Map) { + super(jobMap); + } + + protected async retrieveSingleTrialJobInfo(kubernetesCRDClient: AdlClientV1 | undefined, + kubernetesTrialJob: KubernetesTrialJobDetail): Promise { + if (!this.statusesNeedToCheck.includes(kubernetesTrialJob.status)) { + return Promise.resolve(); + } + + if (kubernetesCRDClient === undefined) { + return Promise.reject('kubernetesCRDClient is undefined'); + } + + let kubernetesJobInfo: any; + let kubernetesPodsInfo: any; + try { + kubernetesJobInfo = await kubernetesCRDClient.getKubernetesJob(kubernetesTrialJob.kubernetesJobName); + kubernetesPodsInfo = await kubernetesCRDClient.getKubernetesPods(kubernetesTrialJob.kubernetesJobName); + } catch (error) { + // Notice: it maynot be a 'real' error since cancel trial job can also cause getKubernetesJob failed. + this.log.error(`Get job ${kubernetesTrialJob.kubernetesJobName} info failed, error is ${error}`); + + //This is not treat as a error status + return Promise.resolve(); + } + /* eslint-disable require-atomic-updates */ + if (kubernetesJobInfo.status) { + const phase: AdlJobStatus = kubernetesJobInfo.status.phase + switch (phase) { + case 'Pending': + case 'Starting': + kubernetesTrialJob.status = 'WAITING'; + if (kubernetesPodsInfo.items.length > 0){ + if (kubernetesPodsInfo.items[0].status.containerStatuses != undefined) { + const currState: any = kubernetesPodsInfo.items[0].status.containerStatuses[0].state + if (currState.waiting != undefined) { + const msg: string = currState.waiting.reason + if (msg == "ImagePullBackOff" || msg == "ErrImagePull") { + kubernetesTrialJob.status = 'FAILED'; + } + } + } + kubernetesTrialJob.message = kubernetesPodsInfo.items + .map((pod: any) => JSON.stringify(pod.status.containerStatuses)) + .join('\n'); + } + kubernetesTrialJob.startTime = Date.parse(kubernetesJobInfo.metadata.creationTimestamp); + break; + case 'Running': + case 'Stopping': + kubernetesTrialJob.status = 'RUNNING'; + kubernetesTrialJob.message = `Use 'nnictl log trial --trial_id ${kubernetesTrialJob.id}' to check the log stream.`; + if (kubernetesTrialJob.startTime === undefined) { + kubernetesTrialJob.startTime = Date.parse(kubernetesJobInfo.metadata.creationTimestamp); + } + break; + case 'Failed': + kubernetesTrialJob.status = 'FAILED'; + kubernetesTrialJob.message = kubernetesJobInfo.status.message; + if (kubernetesPodsInfo.items.length > 0) { + kubernetesTrialJob.message += " ; "; + kubernetesTrialJob.message += `Use 'nnictl log trial --trial_id ${kubernetesTrialJob.id}' for the path of the collected logs.`; + } + // undefined => NaN as endTime here + kubernetesTrialJob.endTime = Date.parse(kubernetesJobInfo.status.completionTimestamp); + break; + case 'Succeeded': + kubernetesTrialJob.status = 'SUCCEEDED'; + kubernetesTrialJob.endTime = Date.parse(kubernetesJobInfo.status.completionTimestamp); + kubernetesTrialJob.message = `Succeeded at ${kubernetesJobInfo.status.completionTimestamp}` + break; + default: + } + } + /* eslint-enable require-atomic-updates */ + + return Promise.resolve(); + } +} diff --git a/ts/nni_manager/training_service/kubernetes/adl/adlJobRestServer.ts b/ts/nni_manager/training_service/kubernetes/adl/adlJobRestServer.ts new file mode 100644 index 0000000000..a22b0a78e6 --- /dev/null +++ b/ts/nni_manager/training_service/kubernetes/adl/adlJobRestServer.ts @@ -0,0 +1,22 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT license. + +'use strict'; + +import * as component from '../../../common/component'; +import { KubernetesJobRestServer } from '../kubernetesJobRestServer'; +import { AdlTrainingService } from './adlTrainingService'; + +/** + * Adl Training service Rest server, provides rest API to support adl job metrics update + * + */ +@component.Singleton +export class AdlJobRestServer extends KubernetesJobRestServer { + /** + * constructor to provide NNIRestServer's own rest property, e.g. port + */ + constructor() { + super(component.get(AdlTrainingService)); + } +} diff --git a/ts/nni_manager/training_service/kubernetes/adl/adlTrainingService.ts b/ts/nni_manager/training_service/kubernetes/adl/adlTrainingService.ts new file mode 100644 index 0000000000..2abcc9cfee --- /dev/null +++ b/ts/nni_manager/training_service/kubernetes/adl/adlTrainingService.ts @@ -0,0 +1,342 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT license. + +'use strict'; + +import * as fs from 'fs'; +import * as component from '../../../common/component'; + +import { String } from 'typescript-string-operations'; +import { getExperimentId } from '../../../common/experimentStartupInfo'; +import { + NNIManagerIpConfig, TrialJobApplicationForm, TrialJobDetail, TrialJobStatus +} from '../../../common/trainingService'; +import { delay, generateParamFileName, getVersion, uniqueString } from '../../../common/utils'; +import { TrialConfigMetadataKey } from '../../common/trialConfigMetadataKey'; +import { KubernetesTrialJobDetail } from '../kubernetesData'; +import { KubernetesTrainingService } from '../kubernetesTrainingService'; +import { AdlClientFactory } from './adlApiClient' +import { AdlJobInfoCollector } from './adlJobInfoCollector'; +import { AdlJobRestServer } from './adlJobRestServer'; +import { AdlTrialConfig } from './adlConfig' + +/** + * Training Service implementation for Adl + */ +@component.Singleton +class AdlTrainingService extends KubernetesTrainingService implements KubernetesTrainingService { + private adlTrialConfig?: AdlTrialConfig; + private readonly adlJobInfoCollector: AdlJobInfoCollector; + private configmapTemplateStr: string; + private jobTemplateStr: string; + private pvcTemplateStr: string; + private tensorboardPvcTemplate: any; + private tensorboardDeploymentTemplate: any; + //TODO: change the logic here when we want to support multiple tensorboard + private tensorboardName: string = "adaptdl-tensorboard-" + getExperimentId().toLowerCase(); + + constructor() { + super(); + this.adlJobInfoCollector = new AdlJobInfoCollector(this.trialJobsMap); + this.experimentId = getExperimentId(); + this.kubernetesCRDClient = AdlClientFactory.createClient(); + this.configmapTemplateStr = fs.readFileSync( + './config/adl/adaptdl-nni-configmap-template.json', 'utf8'); + this.jobTemplateStr = fs.readFileSync('./config/adl/adaptdljob-template.json', 'utf8'); + this.pvcTemplateStr = fs.readFileSync('./config/adl/adaptdl-pvc-template.json', 'utf8'); + this.tensorboardPvcTemplate = JSON.parse( + fs.readFileSync('./config/adl/adaptdl-tensorboard-pvc-template.json', 'utf8')); + this.tensorboardDeploymentTemplate = JSON.parse( + fs.readFileSync('./config/adl/adaptdl-tensorboard-deployment-template.json', 'utf8')); + + this.log.info('Construct Adl training service.'); + } + + public async run(): Promise { + this.log.info(this.tensorboardName); + this.log.info('Start tensorboard deployment.'); + await this.launchTensorboard() + + this.log.info('Run Adl training service.'); + this.kubernetesJobRestServer = component.get(AdlJobRestServer); + if (this.kubernetesJobRestServer === undefined) { + throw new Error('kubernetesJobRestServer not initialized!'); + } + await this.kubernetesJobRestServer.start(); + this.kubernetesJobRestServer.setEnableVersionCheck = this.versionCheck; + this.log.info(`Adl Training service rest server listening on: ${this.kubernetesJobRestServer.endPoint}`); + while (!this.stopping) { + // collect metrics for Adl jobs by interacting with Kubernetes API server + await delay(3000); + await this.adlJobInfoCollector.retrieveTrialStatus(this.kubernetesCRDClient); + if (this.kubernetesJobRestServer.getErrorMessage !== undefined) { + throw new Error(this.kubernetesJobRestServer.getErrorMessage); + } + } + this.log.info('Adl training service exit.'); + } + private async launchTensorboard(): Promise { + // Start the tensorboard at the beginning of the experiment. + if (this.adlTrialConfig === undefined) { + throw new Error('Adl trial config is undefined'); + } + // Create tensorboard deployment + this.tensorboardDeploymentTemplate.metadata.name = this.tensorboardName + this.tensorboardDeploymentTemplate.metadata.labels.expId = this.experimentId + this.tensorboardDeploymentTemplate.spec.selector.matchLabels.app = this.tensorboardName + this.tensorboardDeploymentTemplate.spec.template.metadata.labels.app = this.tensorboardName + this.tensorboardDeploymentTemplate.spec.template.spec.volumes[0] + .persistentVolumeClaim.claimName = this.tensorboardName + const deploymentUid: string = await this.genericK8sClient.createDeployment(this.tensorboardDeploymentTemplate); + // Create pvc + this.tensorboardPvcTemplate.metadata.name = this.tensorboardName; + this.tensorboardPvcTemplate.metadata.ownerReferences[0].name = this.tensorboardName; + this.tensorboardPvcTemplate.metadata.ownerReferences[0].uid = deploymentUid + if (this.adlTrialConfig.checkpoint != undefined) { + this.tensorboardPvcTemplate.spec.resources.requests.storage = this.adlTrialConfig.checkpoint.storageSize; + this.tensorboardPvcTemplate.spec.storageClassName = this.adlTrialConfig.checkpoint.storageClass; + } + else { + this.tensorboardPvcTemplate.spec.resources.requests.storage = "1Gi" + this.tensorboardPvcTemplate.spec.storageClassName = await this.genericK8sClient.getStorageClass(); + } + await this.genericK8sClient.createPersistentVolumeClaim(this.tensorboardPvcTemplate); + + return Promise.resolve() + } + + public async submitTrialJob(form: TrialJobApplicationForm): Promise { + if (this.kubernetesCRDClient === undefined) { + throw new Error('Adl job operator client is undefined'); + } + + if (this.adlTrialConfig === undefined) { + throw new Error('Adl trial config is undefined'); + } + + if (this.kubernetesRestServerPort === undefined) { + const restServer: AdlJobRestServer = component.get(AdlJobRestServer); + this.kubernetesRestServerPort = restServer.clusterRestServerPort; + } + + const trialJobId: string = uniqueString(5); + const adlJobName: string = `nni-exp-${this.experimentId}-trial-${trialJobId}`.toLowerCase(); + const initStatus: TrialJobStatus = 'WAITING'; + const codeDir = this.adlTrialConfig.codeDir; + const outputDir = "output" + const trialJobDetail: KubernetesTrialJobDetail = new KubernetesTrialJobDetail( + trialJobId, + initStatus, + Date.now(), + codeDir, + form, + adlJobName, + outputDir + ); + + // Create adljob + const job: any = JSON.parse(this.jobTemplateStr); + job.metadata.name = adlJobName + job.metadata.labels.app = this.NNI_KUBERNETES_TRIAL_LABEL + job.metadata.labels.expId = this.experimentId + job.metadata.labels.trialId = trialJobId + if (this.adlTrialConfig.adaptive !== undefined){ + job.spec.preemptible = this.adlTrialConfig.adaptive + } + job.spec.template.spec.containers[0] + .image = this.adlTrialConfig.image; + job.spec.template.spec.volumes[0] + .persistentVolumeClaim.claimName = adlJobName + job.spec.template.spec.volumes[1] + .persistentVolumeClaim.claimName = this.tensorboardName + job.spec.template.spec.volumes[2] + .configMap.name = adlJobName + // Handle Pod Resource + let cpu: number = 1; + let memory: string = "1Gi"; + if (this.adlTrialConfig.cpuNum !== undefined) { + cpu = this.adlTrialConfig.cpuNum; + } + if (this.adlTrialConfig.memorySize !== undefined) { + memory = this.adlTrialConfig.memorySize; + } + job.spec.template.spec.containers[0] + .resources.requests.memory = memory; + job.spec.template.spec.containers[0] + .resources.requests.cpu = cpu; + job.spec.template.spec.containers[0] + .resources.limits["nvidia.com/gpu"] = this.adlTrialConfig.gpuNum; + // Handle imagePullSecrets + if (this.adlTrialConfig.imagePullSecrets !== undefined) { + job.spec.template.spec.imagePullSecrets = job.spec.template.spec + .imagePullSecrets.concat(this.adlTrialConfig.imagePullSecrets); + } + // Handle NFS + if (this.adlTrialConfig.nfs !== undefined) { + job.spec.template.spec.volumes.push({ + "name": "nfs", + "nfs": { + "server": this.adlTrialConfig.nfs.server, + "path": this.adlTrialConfig.nfs.path, + "readOnly": false + } + }); + job.spec.template.spec.containers[0].volumeMounts.push({ + "name": "nfs", + "mountPath": this.adlTrialConfig.nfs.containerMountPath + }); + } + await this.kubernetesCRDClient.createKubernetesJob(job); + const k8sadlJob: any = await this.kubernetesCRDClient.getKubernetesJob(adlJobName); + + // Create pvc + const pvc: any = JSON.parse(this.pvcTemplateStr); + pvc.metadata.name = adlJobName; + pvc.metadata.ownerReferences[0].name = adlJobName; + pvc.metadata.ownerReferences[0].uid = k8sadlJob.metadata.uid; + if (this.adlTrialConfig.checkpoint != undefined) { + pvc.spec.resources.requests.storage = this.adlTrialConfig + .checkpoint.storageSize; + pvc.spec.storageClassName = this.adlTrialConfig.checkpoint.storageClass; + } + else { + pvc.spec.resources.requests.storage = "1Gi" + pvc.spec.storageClassName = await this.genericK8sClient.getStorageClass(); + } + await this.genericK8sClient.createPersistentVolumeClaim(pvc); + + // prepare the runscript and convert it to configmap and mount it + const configmap: any = JSON.parse(this.configmapTemplateStr); + configmap.metadata.name = adlJobName; + configmap.metadata.ownerReferences[0].name = adlJobName; + configmap.metadata.ownerReferences[0].uid = k8sadlJob.metadata.uid; + configmap.data["run.sh"] = await this.prepareRunScript( + trialJobId, form, codeDir, outputDir) + const cleanupScriptTemplate: string = +`#!/bin/bash +ps aux | grep "python3 -m nni_trial_tool.trial_keeper" | awk '{print $2}' | xargs kill -2 +while true; +do + proc=\`ps aux | grep "python3 -m nni_trial_tool.trial_keeper" | awk '{print $2}' | grep "" -c\` + if (( $proc == 1 )); then + exit 0 + else + echo "waiting" + fi + sleep 1 +done +`; + configmap.data["cleanup.sh"] = cleanupScriptTemplate + await this.genericK8sClient.createConfigMap(configmap) + + // Set trial job detail until create Adl job successfully + this.trialJobsMap.set(trialJobId, trialJobDetail); + + return Promise.resolve(trialJobDetail); + } + + private async prepareRunScript(jobId: string, + form: TrialJobApplicationForm, + codeDir: string, + outputDir: string): Promise { + if (this.adlTrialConfig === undefined) { + throw new Error('Adl trial config is undefined'); + } + + if (this.kubernetesRestServerPort === undefined) { + throw new Error('Adl rest server port is undefined'); + } + + if (this.nniManagerIpConfig === undefined) { + throw new Error('Adl nniManager ip config is undefined'); + } + + const expId: string = this.experimentId; + const seqId: string = form.sequenceId.toString(); + const command: string = this.adlTrialConfig.command; + const hyperParameters: string = form.hyperParameters.value; + const hyperParametersFile: string = generateParamFileName(form.hyperParameters); + const nniManagerPort: string = this.kubernetesRestServerPort.toString(); + const nniManagerIp: string = this.nniManagerIpConfig.nniManagerIp; + let nniManagerVersion: string = ''; + if (this.versionCheck) { + nniManagerVersion = await getVersion(); + } + + let nvidiaScript: string = ''; + if (this.adlTrialConfig.gpuNum == 0) { + nvidiaScript = 'export CUDA_VISIBLE_DEVICES='; + } + + const runScriptTemplate: string = +`#!/bin/bash +export NNI_PLATFORM=adl +export MULTI_PHASE=false +export NNI_SYS_DIR={0} +export NNI_CODE_DIR={0} +export NNI_OUTPUT_DIR={1} +export NNI_TRIAL_JOB_ID={2} +export NNI_EXP_ID={3} +export NNI_TRIAL_SEQ_ID={4} +mkdir -p $NNI_OUTPUT_DIR +{5} +echo '{6}' > $NNI_CODE_DIR/{7} +python3 -m nni_trial_tool.trial_keeper --trial_command '{8}' \ +--nnimanager_ip {9} --nnimanager_port {10} \ +--nni_manager_version '{11}' --log_collection '{12}' +`; + const runScript = String.Format( + runScriptTemplate, codeDir, outputDir, + jobId, expId, seqId, nvidiaScript, + hyperParameters, hyperParametersFile, command, + nniManagerIp, nniManagerPort, nniManagerVersion, + this.logCollection); + return Promise.resolve(runScript); + } + + public async setClusterMetadata(key: string, value: string): Promise { + this.log.info('SetCluster ' + key + ', ' +value); + switch (key) { + case TrialConfigMetadataKey.NNI_MANAGER_IP: + this.nniManagerIpConfig = JSON.parse(value); + break; + case TrialConfigMetadataKey.TRIAL_CONFIG: + this.adlTrialConfig = JSON.parse(value); + break; + case TrialConfigMetadataKey.VERSION_CHECK: + this.versionCheck = (value === 'true' || value === 'True'); + break; + case TrialConfigMetadataKey.LOG_COLLECTION: + this.logCollection = value; + break; + default: + } + + return Promise.resolve(); + } + + public getClusterMetadata(key: string): Promise { + let result: string; + switch (key) { + case TrialConfigMetadataKey.TRIAL_CONFIG: + if (this.adlTrialConfig === undefined) { + return Promise.reject(`${key} is not set yet`); + } + + result = JSON.stringify(this.adlTrialConfig); + break; + case TrialConfigMetadataKey.NNI_MANAGER_IP: + if (this.nniManagerIpConfig === undefined) { + return Promise.reject(`${key} is not set yet`); + } + + result = JSON.stringify(this.nniManagerIpConfig); + break; + default: + return Promise.reject(`${key} not set`); + } + + return Promise.resolve(result); + } +} +export { AdlTrainingService }; diff --git a/ts/nni_manager/training_service/kubernetes/kubernetesApiClient.ts b/ts/nni_manager/training_service/kubernetes/kubernetesApiClient.ts index e34b468bb2..4156a2bc4a 100644 --- a/ts/nni_manager/training_service/kubernetes/kubernetesApiClient.ts +++ b/ts/nni_manager/training_service/kubernetes/kubernetesApiClient.ts @@ -19,6 +19,94 @@ class GeneralK8sClient { this.client.loadSpec(); } + private matchStorageClass(response: any): string { + const adlSupportedProvisioners: RegExp[] = [ + new RegExp("microk8s.io/hostpath"), + new RegExp(".*cephfs.csi.ceph.com"), + new RegExp(".*azure.*"), + new RegExp("\\b" + "efs" + "\\b") + ] + const templateLen = adlSupportedProvisioners.length, + responseLen = response.items.length + let i = 0, + j = 0; + for (; i < responseLen; i++) { + const provisioner: string = response.items[i].provisioner + for (; j < templateLen; j++) { + if (provisioner.match(adlSupportedProvisioners[j])) { + return response.items[i].metadata.name; + } + } + } + return "Not Found!"; + } + + public async getStorageClass(): Promise { + let result: Promise; + const response: any = await this.client.apis["storage.k8s.io"].v1beta1.storageclasses.get() + const storageClassType: string = this.matchStorageClass(response.body) + if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { + if (storageClassType != "Not Found!") { + result = Promise.resolve(storageClassType); + } + else { + result = Promise.reject("No StorageClasses are supported!") + } + } else { + result = Promise.reject(`List storageclasses failed, statusCode is ${response.statusCode}`); + } + return result; + } + + public async createDeployment(deploymentManifest: any): Promise { + let result: Promise; + const response: any = await this.client.apis.apps.v1.namespaces('default').deployments.post({ body: deploymentManifest }) + if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { + result = Promise.resolve(response.body.metadata.uid); + } else { + result = Promise.reject(`Create deployment failed, statusCode is ${response.statusCode}`); + } + return result; + } + + public async deleteDeployment(deploymentName: string): Promise { + let result: Promise; + // TODO: change this hard coded deployment name after demo + const response: any = await this.client.apis.apps.v1.namespaces('default') + .deployment(deploymentName).delete(); + if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { + result = Promise.resolve(true); + } else { + result = Promise.reject(`Delete deployment failed, statusCode is ${response.statusCode}`); + } + return result; + } + + public async createConfigMap(configMapManifest: any): Promise { + let result: Promise; + const response: any = await this.client.api.v1.namespaces('default') + .configmaps.post({body: configMapManifest}); + if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { + result = Promise.resolve(true); + } else { + result = Promise.reject(`Create configMap failed, statusCode is ${response.statusCode}`); + } + + return result; + } + + public async createPersistentVolumeClaim(pvcManifest: any): Promise { + let result: Promise; + const response: any = await this.client.api.v1.namespaces('default') + .persistentvolumeclaims.post({body: pvcManifest}); + if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { + result = Promise.resolve(true); + } else { + result = Promise.reject(`Create pvc failed, statusCode is ${response.statusCode}`); + } + return result; + } + public async createSecret(secretManifest: any): Promise { let result: Promise; const response: any = await this.client.api.v1.namespaces('default').secrets @@ -77,7 +165,7 @@ abstract class KubernetesCRDClient { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { result = Promise.resolve(true); } else { - result = Promise.reject(`Create kubernetes job failed, statusCode is ${response.statusCode}`); + result = Promise.reject(`KubernetesApiClient createKubernetesJob failed, statusCode is ${response.statusCode}`); } return result; @@ -91,7 +179,7 @@ abstract class KubernetesCRDClient { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { result = Promise.resolve(response.body); } else { - result = Promise.reject(`KubeflowOperatorClient get tfjobs failed, statusCode is ${response.statusCode}`); + result = Promise.reject(`KubernetesApiClient getKubernetesJob failed, statusCode is ${response.statusCode}`); } return result; @@ -115,7 +203,7 @@ abstract class KubernetesCRDClient { result = Promise.resolve(true); } else { result = Promise.reject( - `KubeflowOperatorClient, delete labels ${matchQuery} get wrong statusCode ${deleteResult.statusCode}`); + `KubernetesApiClient, delete labels ${matchQuery} get wrong statusCode ${deleteResult.statusCode}`); } } catch (err) { result = Promise.reject(err); diff --git a/ts/nni_manager/training_service/kubernetes/kubernetesData.ts b/ts/nni_manager/training_service/kubernetes/kubernetesData.ts index eb96202aed..0f96d541dc 100644 --- a/ts/nni_manager/training_service/kubernetes/kubernetesData.ts +++ b/ts/nni_manager/training_service/kubernetes/kubernetesData.ts @@ -11,6 +11,7 @@ import { TrialJobApplicationForm, TrialJobDetail, TrialJobStatus } from '../../ export class KubernetesTrialJobDetail implements TrialJobDetail { public id: string; public status: TrialJobStatus; + public message?: string; public submitTime: number; public startTime?: number; public endTime?: number; @@ -26,6 +27,7 @@ export class KubernetesTrialJobDetail implements TrialJobDetail { kubernetesJobName: string, url: string) { this.id = id; this.status = status; + this.message = 'Pending for creating the trial job.'; this.submitTime = submitTime; this.workingDirectory = workingDirectory; this.form = form; diff --git a/ts/nni_manager/training_service/kubernetes/kubernetesJobInfoCollector.ts b/ts/nni_manager/training_service/kubernetes/kubernetesJobInfoCollector.ts index 129bd7ba5b..2865ff19fd 100644 --- a/ts/nni_manager/training_service/kubernetes/kubernetesJobInfoCollector.ts +++ b/ts/nni_manager/training_service/kubernetes/kubernetesJobInfoCollector.ts @@ -23,21 +23,16 @@ export class KubernetesJobInfoCollector { this.statusesNeedToCheck = ['RUNNING', 'WAITING']; } - public async retrieveTrialStatus(kubernetesCRDClient: KubernetesCRDClient | undefined): Promise { + public async retrieveTrialStatus(kubernetesCRDClient: KubernetesCRDClient | undefined): Promise { assert(kubernetesCRDClient !== undefined); const updateKubernetesTrialJobs: Promise[] = []; for (const [trialJobId, kubernetesTrialJob] of this.trialJobsMap) { if (kubernetesTrialJob === undefined) { throw new NNIError(NNIErrorNames.NOT_FOUND, `trial job id ${trialJobId} not found`); } - // Since Kubeflow needs some delay to schedule jobs, we provide 20 seconds buffer time to check kubeflow job's status - if (Date.now() - kubernetesTrialJob.submitTime < 20 * 1000) { - return Promise.resolve(); - } updateKubernetesTrialJobs.push(this.retrieveSingleTrialJobInfo(kubernetesCRDClient, kubernetesTrialJob)); } - - await Promise.all(updateKubernetesTrialJobs); + return Promise.all(updateKubernetesTrialJobs); } protected async retrieveSingleTrialJobInfo(_kubernetesCRDClient: KubernetesCRDClient | undefined, diff --git a/ts/nni_manager/training_service/kubernetes/kubernetesTrainingService.ts b/ts/nni_manager/training_service/kubernetes/kubernetesTrainingService.ts index 11a54c453c..71e92d19c3 100644 --- a/ts/nni_manager/training_service/kubernetes/kubernetesTrainingService.ts +++ b/ts/nni_manager/training_service/kubernetes/kubernetesTrainingService.ts @@ -209,6 +209,13 @@ abstract class KubernetesTrainingService { return Promise.reject(error); } + try { + await this.genericK8sClient.deleteDeployment("adaptdl-tensorboard-" + getExperimentId().toLowerCase()) + this.log.info('tensorboard deployment deleted') + } catch (error) { + this.log.error(`tensorboard deployment deletion failed: ${error.message}`) + } + return Promise.resolve(); } @@ -377,6 +384,5 @@ abstract class KubernetesTrainingService { } return Promise.resolve(folderUriInAzure); } - } export { KubernetesTrainingService }; diff --git a/ts/nni_manager/training_service/test/adlTrainingService.test.ts b/ts/nni_manager/training_service/test/adlTrainingService.test.ts new file mode 100644 index 0000000000..9ba5509c5b --- /dev/null +++ b/ts/nni_manager/training_service/test/adlTrainingService.test.ts @@ -0,0 +1,138 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT license. + +'use strict'; + +import * as chai from 'chai'; +import * as chaiAsPromised from 'chai-as-promised'; +import * as fs from 'fs'; +import * as tmp from 'tmp'; +import * as component from '../../common/component'; +import { TrialJobApplicationForm, TrialJobDetail, TrainingService } from '../../common/trainingService'; +import { cleanupUnitTest, prepareUnitTest } from '../../common/utils'; +import { TrialConfigMetadataKey } from '../common/trialConfigMetadataKey'; +import { AdlTrainingService } from '../kubernetes/adl/adlTrainingService'; + +const localCodeDir: string = tmp.dirSync().name + +describe('Unit Test for AdlTrainingService', () => { + let skip: boolean = false; + try { + const testKubeflowConfig = fs.readFileSync('/home/vsts/.kube/config', 'utf8'); + } catch (err) { + console.log('Please have kubernetes cluster to enable its training service unit test.'); + skip = true; + } + + let testAdlTrialConfig: any = JSON.stringify({ + "command": "python3 /root/apps/nni_linear_regression/main.py", + "codeDir": ".", + "gpuNum": 0, + "image": "test.image:latest", + "imagePullSecrets": [ + { + "name": "stagingsecrets" + } + ], + "nfs": { + "server": "172.20.188.236", + "path": "/exports", + "containerMountPath": "/nfs" + }, + "memorySize": "1Gi", + "cpuNum": 1 + }); + let testAdlTrialConfig2: any = JSON.stringify({ + "command": "python3 /root/apps/nni_linear_regression/main.py", + "codeDir": ".", + "gpuNum": 0, + "image": "test.image:latest", + "imagePullSecrets": [ + { + "name": "stagingsecrets" + } + ], + "adaptive": true, + "checkpoint": { + "storageClass": "aws-efs", + "storageSize": "1Gi" + }, + "nfs": { + "server": "172.20.188.236", + "path": "/exports", + "containerMountPath": "/nfs" + } + }); + let testNniManagerIp: any = JSON.stringify({ + "nniManagerIp": "0.0.0.0" + }); + let adlTrainingService: AdlTrainingService; + console.log(tmp.dirSync().name); + + before(() => { + chai.should(); + chai.use(chaiAsPromised); + prepareUnitTest(); + }); + + after(() => { + cleanupUnitTest(); + }); + + beforeEach(() => { + if (skip) { + return; + } + adlTrainingService = component.get(AdlTrainingService); + adlTrainingService.run() + }); + + afterEach(() => { + if (skip) { + return; + } + adlTrainingService.cleanUp(); + }); + + it('Set and get cluster metadata', async () => { + if (skip) { + return; + } + await adlTrainingService.setClusterMetadata(TrialConfigMetadataKey.TRIAL_CONFIG, testAdlTrialConfig2); + await adlTrainingService.setClusterMetadata(TrialConfigMetadataKey.NNI_MANAGER_IP, testNniManagerIp); + let data:string = await adlTrainingService.getClusterMetadata(TrialConfigMetadataKey.TRIAL_CONFIG); + chai.expect(data).to.be.equals(testAdlTrialConfig2); + }); + + it('Submit job', async () => { + if (skip) { + return; + } + // job without given checkpoint, with resource config + await adlTrainingService.setClusterMetadata(TrialConfigMetadataKey.TRIAL_CONFIG, testAdlTrialConfig); + let form: TrialJobApplicationForm = { + sequenceId: 0, + hyperParameters: { + value: 'mock hyperparameters', + index: 0 + } + }; + let jobDetail: TrialJobDetail = await adlTrainingService.submitTrialJob(form); + chai.expect(jobDetail.status).to.be.equals('WAITING'); + await adlTrainingService.cancelTrialJob(jobDetail.id); + chai.expect(jobDetail.status).to.be.equals('USER_CANCELED'); + // job with given checkpoint + await adlTrainingService.setClusterMetadata(TrialConfigMetadataKey.TRIAL_CONFIG, testAdlTrialConfig2); + form = { + sequenceId: 0, + hyperParameters: { + value: 'mock hyperparameters', + index: 0 + } + }; + jobDetail = await adlTrainingService.submitTrialJob(form); + chai.expect(jobDetail.status).to.be.equals('WAITING'); + await adlTrainingService.cancelTrialJob(jobDetail.id); + chai.expect(jobDetail.status).to.be.equals('USER_CANCELED'); + }).timeout(3000000); +}); diff --git a/ts/webui/src/App.scss b/ts/webui/src/App.scss index 3811759ea8..f129bcf1c0 100644 --- a/ts/webui/src/App.scss +++ b/ts/webui/src/App.scss @@ -29,6 +29,8 @@ width: 87%; margin: 0 auto; min-width: 1200px; + + /* nav bar: 56 + marginTop: 18 */ margin-top: 74px; margin-bottom: 30px; } diff --git a/ts/webui/src/App.tsx b/ts/webui/src/App.tsx index daaf9be1ce..830fb50a44 100644 --- a/ts/webui/src/App.tsx +++ b/ts/webui/src/App.tsx @@ -6,6 +6,7 @@ import NavCon from './components/NavCon'; import MessageInfo from './components/modals/MessageInfo'; import { SlideNavBtns } from './components/slideNav/SlideNavBtns'; import './App.scss'; +import './static/style/common.scss'; interface AppState { interval: number; diff --git a/ts/webui/src/components/Overview.tsx b/ts/webui/src/components/Overview.tsx index 2188df7d7e..4dd5246cac 100644 --- a/ts/webui/src/components/Overview.tsx +++ b/ts/webui/src/components/Overview.tsx @@ -13,8 +13,9 @@ import { TrialCount } from './overview/count/TrialCount'; import { Command1 } from './overview/command/Command1'; import { Command2 } from './overview/command/Command2'; import { TitleContext } from './overview/TitleContext'; -import { itemStyle1, itemStyleSucceed, itemStyle2, entriesOption } from './overview/overviewConst'; +import { itemStyleSucceed, entriesOption } from './overview/overviewConst'; import '../static/style/overview/overview.scss'; +import '../static/style/overview/topTrial.scss'; import '../static/style/logPath.scss'; interface OverviewState { @@ -89,42 +90,40 @@ class Overview extends React.Component<{}, OverviewState> { {/* duration & trial numbers */} -
    -
    - - - </TitleContext.Provider> - <ExpDurationContext.Provider - value={{ - maxExecDuration, - execDuration, - updateOverviewPage, - maxDurationUnit, - changeMaxDurationUnit - }} - > - <ExpDuration /> - </ExpDurationContext.Provider> - </div> - <div className='trialCount'> - <TitleContext.Provider value={{ text: 'Trial numbers', icon: 'NumberSymbol' }}> - <Title /> - </TitleContext.Provider> - <ExpDurationContext.Provider - value={{ - maxExecDuration, - execDuration, - updateOverviewPage, - maxDurationUnit, - changeMaxDurationUnit - }} - > - <TrialCount /> - </ExpDurationContext.Provider> - </div> + <div className='duration'> + <TitleContext.Provider value={{ text: 'Duration', icon: 'Timer' }}> + <Title /> + </TitleContext.Provider> + <ExpDurationContext.Provider + value={{ + maxExecDuration, + execDuration, + updateOverviewPage, + maxDurationUnit, + changeMaxDurationUnit + }} + > + <ExpDuration /> + </ExpDurationContext.Provider> + </div> + <div className='trialCount'> + <TitleContext.Provider value={{ text: 'Trial numbers', icon: 'NumberSymbol' }}> + <Title /> + </TitleContext.Provider> + <ExpDurationContext.Provider + value={{ + maxExecDuration, + execDuration, + updateOverviewPage, + maxDurationUnit, + changeMaxDurationUnit + }} + > + <TrialCount /> + </ExpDurationContext.Provider> </div> {/* table */} - <div className='overviewTable'> + <div className='overviewBestMetric'> <Stack horizontal> <div style={itemStyleSucceed}> <TitleContext.Provider value={{ text: 'Top trials', icon: 'BulletedList' }}> @@ -167,7 +166,13 @@ class Overview extends React.Component<{}, OverviewState> { </Stack> </div> </Stack> - <SuccessTable trialIds={bestTrials.map(trial => trial.info.trialJobId)} /> + <div className='overviewChart'> + <Accuracy accuracyData={accuracyGraphData} accNodata={noDataMessage} /> + <SuccessTable + trialIds={bestTrials.map(trial => trial.info.trialJobId)} + updateOverviewPage={updateOverviewPage} + /> + </div> </div> <div className='overviewCommand1'> <Command1 /> @@ -175,24 +180,6 @@ class Overview extends React.Component<{}, OverviewState> { <div className='overviewCommand2'> <Command2 /> </div> - <div className='overviewChart'> - <Stack horizontal> - <div style={itemStyle1}> - <TitleContext.Provider - value={{ text: 'Trial metric chart', icon: 'HomeGroup' }} - > - <Title /> - </TitleContext.Provider> - </div> - <div style={itemStyle2}> - <Stack className='maxmin' horizontal> - <div className='circle' /> - <div>{`Top ${this.context.metricGraphMode}imal trials`}</div> - </Stack> - </div> - </Stack> - <Accuracy accuracyData={accuracyGraphData} accNodata={noDataMessage} height={380} /> - </div> </div> </div> ); @@ -219,8 +206,8 @@ class Overview extends React.Component<{}, OverviewState> { return { // support max show 0.0000000 grid: { - left: 67, - right: 40 + x: 60, + y: 40 }, tooltip: { trigger: 'item' diff --git a/ts/webui/src/components/overview/Accuracy.tsx b/ts/webui/src/components/overview/Accuracy.tsx index 395e04d364..8130c14a9d 100644 --- a/ts/webui/src/components/overview/Accuracy.tsx +++ b/ts/webui/src/components/overview/Accuracy.tsx @@ -11,7 +11,6 @@ import 'echarts/lib/component/title'; interface AccuracyProps { accuracyData: object; accNodata: string; - height: number; } class Accuracy extends React.Component<AccuracyProps, {}> { @@ -20,17 +19,10 @@ class Accuracy extends React.Component<AccuracyProps, {}> { } render(): React.ReactNode { - const { accNodata, accuracyData, height } = this.props; + const { accNodata, accuracyData } = this.props; return ( - <div style={{ position: 'relative' }}> - <ReactEcharts - option={accuracyData} - style={{ - height: height, - margin: '0 auto' - }} - theme='my_theme' - /> + <div className='defaultMetricContainer'> + <ReactEcharts option={accuracyData} theme='my_theme' /> <div className='showMess'>{accNodata}</div> </div> ); diff --git a/ts/webui/src/components/overview/command/Command1.tsx b/ts/webui/src/components/overview/command/Command1.tsx index fc78a795fb..ae4fa4d595 100644 --- a/ts/webui/src/components/overview/command/Command1.tsx +++ b/ts/webui/src/components/overview/command/Command1.tsx @@ -6,24 +6,29 @@ export const Command1 = (): any => { const tuner = EXPERIMENT.profile.params.tuner; const advisor = EXPERIMENT.profile.params.advisor; const assessor = EXPERIMENT.profile.params.assessor; - let title = ''; - let builtinName = ''; + const title: string[] = []; + const builtinName: string[] = []; if (tuner !== undefined) { - title = title.concat('Tuner'); + title.push('Tuner'); if (tuner.builtinTunerName !== undefined) { - builtinName = builtinName.concat(tuner.builtinTunerName); + builtinName.push(tuner.builtinTunerName); } } + if (advisor !== undefined) { - title = title.concat('/ Assessor'); + title.push('Advisor'); if (advisor.builtinAdvisorName !== undefined) { - builtinName = builtinName.concat(advisor.builtinAdvisorName); + builtinName.push(advisor.builtinAdvisorName); + } + if (advisor.className !== undefined) { + builtinName.push(advisor.className); } } + if (assessor !== undefined) { - title = title.concat('/ Addvisor'); + title.push('Assessor'); if (assessor.builtinAssessorName !== undefined) { - builtinName = builtinName.concat(assessor.builtinAssessorName); + builtinName.push(assessor.builtinAssessorName); } } @@ -32,8 +37,8 @@ export const Command1 = (): any => { <div> <p className='command'>Training platform</p> <div className='nowrap'>{EXPERIMENT.profile.params.trainingServicePlatform}</div> - <p className='lineMargin'>{title}</p> - <div className='nowrap'>{builtinName}</div> + <p className='lineMargin'>{title.join('/')}</p> + <div className='nowrap'>{builtinName.join('/')}</div> </div> </div> ); diff --git a/ts/webui/src/components/overview/count/EditExperimentParam.tsx b/ts/webui/src/components/overview/count/EditExperimentParam.tsx index 07eee048a4..bc7168cd69 100644 --- a/ts/webui/src/components/overview/count/EditExperimentParam.tsx +++ b/ts/webui/src/components/overview/count/EditExperimentParam.tsx @@ -188,16 +188,16 @@ export const EditExperimentParam = (): any => { /> )} {isShowPencil && ( - <span className='edit' onClick={hidePencil}> + <span className='edit cursor' onClick={hidePencil}> {Edit} </span> )} {!isShowPencil && ( <span className='series'> - <span className='confirm' onClick={confirmEdit}> + <span className='confirm cursor' onClick={confirmEdit}> {CheckMark} </span> - <span className='cancel' onClick={cancelEdit}> + <span className='cancel cursor' onClick={cancelEdit}> {Cancel} </span> </span> diff --git a/ts/webui/src/components/overview/count/ExpDuration.tsx b/ts/webui/src/components/overview/count/ExpDuration.tsx index 12a1d723ff..e5078968f3 100644 --- a/ts/webui/src/components/overview/count/ExpDuration.tsx +++ b/ts/webui/src/components/overview/count/ExpDuration.tsx @@ -6,7 +6,7 @@ import { convertDuration, convertTimeAsUnit } from '../../../static/function'; import { EditExperimentParam } from './EditExperimentParam'; import { ExpDurationContext } from './ExpDurationContext'; import { EditExpeParamContext } from './context'; -import { durationItem1, durationItem2 } from './commonStyle'; +import { leftProgress, durationItem2, progressHeight } from './commonStyle'; import '../../../static/style/overview/count.scss'; export const ExpDuration = (): any => ( @@ -19,7 +19,7 @@ export const ExpDuration = (): any => ( const maxExecDurationStr = convertTimeAsUnit(maxDurationUnit, maxExecDuration).toString(); return ( <Stack horizontal className='ExpDuration'> - <div style={durationItem1}> + <div style={leftProgress}> <TooltipHost content={`${convertDuration(tooltip)} remaining`} directionalHint={DirectionalHint.bottomCenter} @@ -33,7 +33,11 @@ export const ExpDuration = (): any => ( } }} > - <ProgressIndicator className={EXPERIMENT.status} percentComplete={percent} barHeight={15} /> + <ProgressIndicator + className={EXPERIMENT.status} + percentComplete={percent} + barHeight={progressHeight} + /> </TooltipHost> {/* execDuration / maxDuration: 20min / 1h */} <div className='exp-progress'> diff --git a/ts/webui/src/components/overview/count/TrialCount.tsx b/ts/webui/src/components/overview/count/TrialCount.tsx index f5846bdae0..0f473675b0 100644 --- a/ts/webui/src/components/overview/count/TrialCount.tsx +++ b/ts/webui/src/components/overview/count/TrialCount.tsx @@ -5,7 +5,7 @@ import { CONTROLTYPE, TOOLTIP_BACKGROUND_COLOR, MAX_TRIAL_NUMBERS } from '../../ import { EditExperimentParam } from './EditExperimentParam'; import { EditExpeParamContext } from './context'; import { ExpDurationContext } from './ExpDurationContext'; -import { trialCountItem1, trialCountItem2 } from './commonStyle'; +import { leftProgress, trialCountItem2, progressHeight } from './commonStyle'; export const TrialCount = (): any => { const count = TRIALS.countStatus(); @@ -23,9 +23,9 @@ export const TrialCount = (): any => { return ( <React.Fragment> <Stack horizontal horizontalAlign='space-between' className='ExpDuration'> - <div style={trialCountItem1}> + <div style={leftProgress}> <TooltipHost - content={bar2.toString()} + content={`${bar2.toString()} trials`} directionalHint={DirectionalHint.bottomCenter} tooltipProps={{ calloutProps: { @@ -40,7 +40,7 @@ export const TrialCount = (): any => { <ProgressIndicator className={EXPERIMENT.status} percentComplete={bar2Percent} - barHeight={15} + barHeight={progressHeight} /> </TooltipHost> <div className='exp-progress'> @@ -81,7 +81,7 @@ export const TrialCount = (): any => { </EditExpeParamContext.Provider> </div> </Stack> - <Stack horizontal horizontalAlign='space-between' className='mess'> + <Stack horizontal horizontalAlign='space-between' className='trialStatus'> <div className='basic'> <p>Running</p> <div>{count.get('RUNNING')}</div> diff --git a/ts/webui/src/components/overview/count/commonStyle.ts b/ts/webui/src/components/overview/count/commonStyle.ts index 2138aa7a5c..f18b6a22d1 100644 --- a/ts/webui/src/components/overview/count/commonStyle.ts +++ b/ts/webui/src/components/overview/count/commonStyle.ts @@ -1,16 +1,15 @@ -const durationItem1: React.CSSProperties = { - width: '33%' +const leftProgress: React.CSSProperties = { + width: '33%', + position: 'relative', + top: 6 }; const durationItem2: React.CSSProperties = { - width: '52%', + width: '51.5%', paddingLeft: '15%' }; -const trialCountItem1: React.CSSProperties = { - width: '33%' -}; const trialCountItem2: React.CSSProperties = { - width: '52%' + width: '51.5%' }; - -export { durationItem1, durationItem2, trialCountItem1, trialCountItem2 }; +const progressHeight = 8; +export { leftProgress, durationItem2, trialCountItem2, progressHeight }; diff --git a/ts/webui/src/components/overview/experiment/BasicInfo.tsx b/ts/webui/src/components/overview/experiment/BasicInfo.tsx index 605b9ed025..e5ca1b82d9 100644 --- a/ts/webui/src/components/overview/experiment/BasicInfo.tsx +++ b/ts/webui/src/components/overview/experiment/BasicInfo.tsx @@ -23,76 +23,74 @@ export const ReBasicInfo = (): any => { return ( <div> - <div className='basic'> - <p> - ID: <span>{EXPERIMENT.profile.id}</span> - </p> - <div>{EXPERIMENT.profile.params.experimentName}</div> - </div> - <div className='basic'> - <p>Status</p> - <Stack horizontal className='status'> - <span className={`${EXPERIMENT.status} status-text`}>{EXPERIMENT.status}</span> - {EXPERIMENT.status === 'ERROR' ? ( - <div> - <div className={styles.buttonArea} ref={ref}> - <IconButton - iconProps={{ iconName: 'info' }} - onClick={isCalloutVisible ? onDismiss : showCallout} - /> - </div> - {isCalloutVisible && ( - <Callout - className={styles.callout} - ariaLabelledBy={labelId} - ariaDescribedBy={descriptionId} - role='alertdialog' - gapSpace={0} - target={ref} - onDismiss={onDismiss} - setInitialFocus={true} - > - <div className={styles.header}> - <p className={styles.title} id={labelId} style={{ color: '#333' }}> - Error - </p> - </div> - <div className={styles.inner}> - <p className={styles.subtext} id={descriptionId} style={{ color: '#333' }}> - {EXPERIMENT.error} - </p> - <div className={styles.actions}> - <Link className={styles.link} onClick={ShowLogDrawer}> - Learn about - </Link> + <Stack horizontal horizontalAlign='space-between' className='mess'> + <div className='basic'> + <p>Name</p> + <div className='nowrap'>{EXPERIMENT.profile.params.experimentName}</div> + <p className='margin'>ID</p> + <div className='nowrap'>{EXPERIMENT.profile.id}</div> + </div> + <div className='basic'> + <p>Status</p> + <Stack horizontal className='status'> + <span className={`${EXPERIMENT.status} status-text`}>{EXPERIMENT.status}</span> + {EXPERIMENT.status === 'ERROR' ? ( + <div> + <div className={styles.buttonArea} ref={ref}> + <IconButton + iconProps={{ iconName: 'info' }} + onClick={isCalloutVisible ? onDismiss : showCallout} + /> + </div> + {isCalloutVisible && ( + <Callout + className={styles.callout} + ariaLabelledBy={labelId} + ariaDescribedBy={descriptionId} + role='alertdialog' + gapSpace={0} + target={ref} + onDismiss={onDismiss} + setInitialFocus={true} + > + <div className={styles.header}> + <p className={`${styles.title} color`} id={labelId}> + Error + </p> + </div> + <div className={styles.inner}> + <p className={`${styles.subtext} color`} id={descriptionId}> + {EXPERIMENT.error} + </p> + <div className={styles.actions}> + <Link className={styles.link} onClick={ShowLogDrawer}> + Learn about + </Link> + </div> </div> - </div> - </Callout> - )} - </div> - ) : null} - </Stack> - </div> - <div className='basic'> - <BestMetricContext.Consumer> - {(value): React.ReactNode => ( - <Stack className='bestMetric'> - <p>Best metric</p> - <div className={EXPERIMENT.status}> - {isNaN(value.bestAccuracy) ? 'N/A' : value.bestAccuracy.toFixed(6)} + </Callout> + )} </div> - </Stack> - )} - </BestMetricContext.Consumer> - </div> - <div className='basic'> - <p>Start time</p> - <div className='nowrap'>{formatTimestamp(EXPERIMENT.profile.startTime)}</div> - </div> - <div className='basic'> - <p>End time</p> - <div className='nowrap'>{formatTimestamp(EXPERIMENT.profile.endTime)}</div> - </div> + ) : null} + </Stack> + <BestMetricContext.Consumer> + {(value): React.ReactNode => ( + <Stack className='bestMetric'> + <p className='margin'>Best metric</p> + <div className={EXPERIMENT.status}> + {isNaN(value.bestAccuracy) ? 'N/A' : value.bestAccuracy.toFixed(6)} + </div> + </Stack> + )} + </BestMetricContext.Consumer> + </div> + <div className='basic'> + <p>Start time</p> + <div className='nowrap'>{formatTimestamp(EXPERIMENT.profile.startTime)}</div> + <p className='margin'>End time</p> + <div className='nowrap'>{formatTimestamp(EXPERIMENT.profile.endTime)}</div> + </div> + </Stack> {/* learn about click -> default active key is dispatcher. */} {isShowLogDrawer ? <LogDrawer closeDrawer={closeLogDrawer} activeTab='dispatcher' /> : null} </div> diff --git a/ts/webui/src/components/overview/table/Details.tsx b/ts/webui/src/components/overview/table/Details.tsx deleted file mode 100644 index bf11ddb789..0000000000 --- a/ts/webui/src/components/overview/table/Details.tsx +++ /dev/null @@ -1,37 +0,0 @@ -import * as React from 'react'; -import { DetailsRow, IDetailsRowBaseProps } from '@fluentui/react'; -import OpenRow from '../../public-child/OpenRow'; - -interface DetailsProps { - detailsProps: IDetailsRowBaseProps; -} - -interface DetailsState { - isExpand: boolean; -} - -class Details extends React.Component<DetailsProps, DetailsState> { - constructor(props: DetailsProps) { - super(props); - this.state = { isExpand: false }; - } - - render(): React.ReactNode { - const { detailsProps } = this.props; - const { isExpand } = this.state; - return ( - <div> - <div - onClick={(): void => { - this.setState(() => ({ isExpand: !isExpand })); - }} - > - <DetailsRow {...detailsProps} /> - </div> - {isExpand && <OpenRow trialId={detailsProps.item.id} />} - </div> - ); - } -} - -export default Details; diff --git a/ts/webui/src/components/overview/table/SuccessTable.tsx b/ts/webui/src/components/overview/table/SuccessTable.tsx index 424ea75d5f..db568c6e5a 100644 --- a/ts/webui/src/components/overview/table/SuccessTable.tsx +++ b/ts/webui/src/components/overview/table/SuccessTable.tsx @@ -1,9 +1,22 @@ import * as React from 'react'; -import { DetailsList, IDetailsListProps, IColumn } from '@fluentui/react'; +import { + DetailsList, + IDetailsListProps, + IColumn, + Icon, + DetailsRow, + IRenderFunction, + IDetailsHeaderProps, + Sticky, + StickyPositionType, + ScrollablePane, + ScrollbarVisibility +} from '@fluentui/react'; import DefaultMetric from '../../public-child/DefaultMetric'; -import Details from './Details'; -import { convertDuration } from '../../../static/function'; +import OpenRow from '../../public-child/OpenRow'; +import { convertDuration, copyAndSort } from '../../../static/function'; import { TRIALS } from '../../../static/datamodel'; +import { SortInfo } from '../../../static/interface'; import { DETAILTABS } from '../../stateless-component/NNItabs'; import '../../../static/style/succTable.scss'; import '../../../static/style/tableStatus.css'; @@ -11,12 +24,15 @@ import '../../../static/style/openRow.scss'; interface SuccessTableProps { trialIds: string[]; + // eslint-disable-next-line @typescript-eslint/no-unused-vars + updateOverviewPage: () => void; } interface SuccessTableState { columns: IColumn[]; source: Array<any>; - innerWidth: number; + expandRowIdList: Set<string>; + sortInfo: SortInfo; } class SuccessTable extends React.Component<SuccessTableProps, SuccessTableState> { @@ -25,18 +41,42 @@ class SuccessTable extends React.Component<SuccessTableProps, SuccessTableState> this.state = { columns: this.columns, source: TRIALS.table(this.props.trialIds), - innerWidth: window.innerWidth + sortInfo: { field: '', isDescend: false }, + expandRowIdList: new Set() // store expanded row's trial id }; } - private onRenderRow: IDetailsListProps['onRenderRow'] = props => { - if (props) { - return <Details detailsProps={props} />; + componentDidUpdate(prevProps: SuccessTableProps): void { + if (this.props.trialIds !== prevProps.trialIds) { + const { trialIds } = this.props; + this.setState(() => ({ source: TRIALS.table(trialIds) })); } - return null; - }; + } - onColumnClick = (ev: React.MouseEvent<HTMLElement>, getColumn: IColumn): void => { + render(): React.ReactNode { + const { columns, source, sortInfo } = this.state; + const keepSortedSource = copyAndSort(source, sortInfo.field, sortInfo.isDescend); + const isNoneData = source.length === 0 ? true : false; + return ( + <div id='succTable'> + <ScrollablePane className='scrollPanel' scrollbarVisibility={ScrollbarVisibility.auto}> + <DetailsList + columns={columns} + items={keepSortedSource} + setKey='set' + compact={true} + onRenderRow={this.onRenderRow} + onRenderDetailsHeader={this.onRenderDetailsHeader} + selectionMode={0} // close selector function + className='succTable' + /> + </ScrollablePane> + {isNoneData && <div className='succTable-tooltip'>{this.tooltipStr}</div>} + </div> + ); + } + + private onColumnClick = (_ev: React.MouseEvent<HTMLElement>, getColumn: IColumn): void => { const { columns, source } = this.state; const newColumns: IColumn[] = columns.slice(); const currColumn: IColumn = newColumns.filter(item => getColumn.key === item.key)[0]; @@ -50,32 +90,51 @@ class SuccessTable extends React.Component<SuccessTableProps, SuccessTableState> } }); // eslint-disable-next-line @typescript-eslint/no-non-null-assertion - const newItems = this.copyAndSort(source, currColumn.fieldName!, currColumn.isSortedDescending); + const newItems = copyAndSort(source, currColumn.fieldName!, currColumn.isSortedDescending); this.setState({ columns: newColumns, - source: newItems + source: newItems, + // eslint-disable-next-line @typescript-eslint/no-non-null-assertion + sortInfo: { field: currColumn.fieldName!, isDescend: currColumn.isSortedDescending } }); }; - private copyAndSort<T>(items: T[], columnKey: string, isSortedDescending?: boolean): T[] { - const key = columnKey as keyof T; - return items.slice(0).sort((a: T, b: T) => ((isSortedDescending ? a[key] < b[key] : a[key] > b[key]) ? 1 : -1)); - } - - tooltipStr = ( + private tooltipStr = ( <React.Fragment> The experiment is running, please wait for the final metric patiently. You could also find status of trial job with <span>{DETAILTABS}</span> button. </React.Fragment> ); - columns = [ + private columns = [ + { + key: '_expand', + name: '', + onRender: (item: any): any => ( + <Icon + aria-hidden={true} + iconName='ChevronRight' + styles={{ + root: { + transition: 'all 0.2s', + transform: `rotate(${this.state.expandRowIdList.has(item.id) ? 90 : 0}deg)` + } + }} + className='cursor' + onClick={this.expandTrialId.bind(this, Event, item.id)} + /> + ), + fieldName: 'expand', + isResizable: false, + minWidth: 20, + maxWidth: 20 + }, { name: 'Trial No.', key: 'sequenceId', fieldName: 'sequenceId', // required! - minWidth: 50, - maxWidth: 87, + minWidth: 60, + maxWidth: 100, isResizable: true, data: 'number', onColumnClick: this.onColumnClick, @@ -85,8 +144,8 @@ class SuccessTable extends React.Component<SuccessTableProps, SuccessTableState> name: 'ID', key: 'id', fieldName: 'id', - minWidth: 50, - maxWidth: 87, + minWidth: 60, + maxWidth: 118, isResizable: true, className: 'tableHead leftTitle', data: 'string', @@ -96,8 +155,8 @@ class SuccessTable extends React.Component<SuccessTableProps, SuccessTableState> { name: 'Duration', key: 'duration', - minWidth: 65, - maxWidth: 150, + minWidth: 85, + maxWidth: 166, isResizable: true, fieldName: 'duration', data: 'number', @@ -111,8 +170,8 @@ class SuccessTable extends React.Component<SuccessTableProps, SuccessTableState> { name: 'Status', key: 'status', - minWidth: 80, - maxWidth: 150, + minWidth: 98, + maxWidth: 160, isResizable: true, fieldName: 'status', onRender: (item: any): React.ReactNode => ( @@ -124,7 +183,7 @@ class SuccessTable extends React.Component<SuccessTableProps, SuccessTableState> key: 'accuracy', fieldName: 'accuracy', minWidth: 100, - maxWidth: 160, + maxWidth: 166, isResizable: true, data: 'number', onColumnClick: this.onColumnClick, @@ -132,43 +191,49 @@ class SuccessTable extends React.Component<SuccessTableProps, SuccessTableState> } ]; - setInnerWidth = (): void => { - this.setState(() => ({ innerWidth: window.innerWidth })); + private onRenderDetailsHeader: IRenderFunction<IDetailsHeaderProps> = (props, defaultRender) => { + if (!props) { + return null; + } + return ( + <Sticky stickyPosition={StickyPositionType.Header} isScrollSynced> + {// eslint-disable-next-line @typescript-eslint/no-non-null-assertion + defaultRender!({ + ...props + })} + </Sticky> + ); }; - componentDidMount(): void { - window.addEventListener('resize', this.setInnerWidth); - } - componentWillUnmount(): void { - window.removeEventListener('resize', this.setInnerWidth); - } - - componentDidUpdate(prevProps: SuccessTableProps): void { - if (this.props.trialIds !== prevProps.trialIds) { - const { trialIds } = this.props; - this.setState(() => ({ source: TRIALS.table(trialIds) })); + private onRenderRow: IDetailsListProps['onRenderRow'] = props => { + const { expandRowIdList } = this.state; + if (props) { + return ( + <div> + <div> + <DetailsRow {...props} /> + </div> + {Array.from(expandRowIdList).map( + item => item === props.item.id && <OpenRow key={item} trialId={item} /> + )} + </div> + ); } - } - - render(): React.ReactNode { - const { columns, source } = this.state; - const isNoneData = source.length === 0 ? true : false; + return null; + }; - return ( - <div id='succTable'> - <DetailsList - columns={columns} - items={source} - setKey='set' - compact={true} - onRenderRow={this.onRenderRow} - selectionMode={0} // close selector function - className='succTable' - /> - {isNoneData && <div className='succTable-tooltip'>{this.tooltipStr}</div>} - </div> - ); - } + private expandTrialId = (_event: any, id: string): void => { + const { expandRowIdList } = this.state; + const { updateOverviewPage } = this.props; + const copyExpandList = expandRowIdList; + if (copyExpandList.has(id)) { + copyExpandList.delete(id); + } else { + copyExpandList.add(id); + } + this.setState(() => ({ expandRowIdList: copyExpandList })); + updateOverviewPage(); + }; } export default SuccessTable; diff --git a/ts/webui/src/components/trial-detail/TableList.tsx b/ts/webui/src/components/trial-detail/TableList.tsx index 6752f78b3a..f0cf341ae6 100644 --- a/ts/webui/src/components/trial-detail/TableList.tsx +++ b/ts/webui/src/components/trial-detail/TableList.tsx @@ -15,8 +15,8 @@ import { import React from 'react'; import { EXPERIMENT, TRIALS } from '../../static/datamodel'; import { TOOLTIP_BACKGROUND_COLOR } from '../../static/const'; -import { convertDuration, formatTimestamp } from '../../static/function'; -import { TableObj } from '../../static/interface'; +import { convertDuration, formatTimestamp, copyAndSort } from '../../static/function'; +import { TableObj, SortInfo } from '../../static/interface'; import '../../static/style/search.scss'; import '../../static/style/tableStatus.css'; import '../../static/style/logPath.scss'; @@ -56,36 +56,6 @@ const searchOptionLiterals = { const defaultDisplayedColumns = ['sequenceId', 'id', 'duration', 'status', 'latestAccuracy']; -interface SortInfo { - field: string; - isDescend?: boolean; -} - -function _copyAndSort<T>(items: T[], columnKey: string, isSortedDescending?: boolean): any { - const key = columnKey as keyof T; - return items.slice(0).sort(function(a: T, b: T): any { - if ( - a[key] === undefined || - Object.is(a[key], NaN) || - Object.is(a[key], Infinity) || - Object.is(a[key], -Infinity) || - typeof a[key] === 'object' - ) { - return 1; - } - if ( - b[key] === undefined || - Object.is(b[key], NaN) || - Object.is(b[key], Infinity) || - Object.is(b[key], -Infinity) || - typeof b[key] === 'object' - ) { - return -1; - } - return (isSortedDescending ? a[key] < b[key] : a[key] > b[key]) ? 1 : -1; - }); -} - function _inferColumnTitle(columnKey: string): string { if (columnKey === 'sequenceId') { return 'Trial No.'; @@ -238,7 +208,7 @@ class TableList extends React.Component<TableListProps, TableListState> { const { sortInfo } = this.state; if (sortInfo.field !== '') { - return _copyAndSort(items, sortInfo.field, sortInfo.isDescend); + return copyAndSort(items, sortInfo.field, sortInfo.isDescend); } else { return items; } @@ -255,6 +225,7 @@ class TableList extends React.Component<TableListProps, TableListState> { <Icon aria-hidden={true} iconName='ChevronRight' + className='cursor' styles={{ root: { transition: 'all 0.2s', diff --git a/ts/webui/src/static/function.ts b/ts/webui/src/static/function.ts index 547fce8d8a..40f191594e 100644 --- a/ts/webui/src/static/function.ts +++ b/ts/webui/src/static/function.ts @@ -269,6 +269,30 @@ function caclMonacoEditorHeight(height): number { return height - 178; } +function copyAndSort<T>(items: T[], columnKey: string, isSortedDescending?: boolean): any { + const key = columnKey as keyof T; + return items.slice(0).sort(function(a: T, b: T): any { + if ( + a[key] === undefined || + Object.is(a[key], NaN) || + Object.is(a[key], Infinity) || + Object.is(a[key], -Infinity) || + typeof a[key] === 'object' + ) { + return 1; + } + if ( + b[key] === undefined || + Object.is(b[key], NaN) || + Object.is(b[key], Infinity) || + Object.is(b[key], -Infinity) || + typeof b[key] === 'object' + ) { + return -1; + } + return (isSortedDescending ? a[key] < b[key] : a[key] > b[key]) ? 1 : -1; + }); +} export { convertTime, convertDuration, @@ -288,5 +312,6 @@ export { requestAxios, isNaNorInfinity, formatComplexTypeValue, - caclMonacoEditorHeight + caclMonacoEditorHeight, + copyAndSort }; diff --git a/ts/webui/src/static/interface.ts b/ts/webui/src/static/interface.ts index 5eb7fbd82b..3171216a02 100644 --- a/ts/webui/src/static/interface.ts +++ b/ts/webui/src/static/interface.ts @@ -212,6 +212,12 @@ interface EventMap { [key: string]: () => void; } +// table column sort +interface SortInfo { + field: string; + isDescend?: boolean; +} + export { TableObj, TableRecord, @@ -233,5 +239,6 @@ export { NNIManagerStatus, EventMap, SingleAxis, - MultipleAxes + MultipleAxes, + SortInfo }; diff --git a/ts/webui/src/static/style/common.scss b/ts/webui/src/static/style/common.scss new file mode 100644 index 0000000000..6ecf745243 --- /dev/null +++ b/ts/webui/src/static/style/common.scss @@ -0,0 +1,5 @@ +.cursor{ + &:hover, & i:hover{ + cursor: pointer; + } +} \ No newline at end of file diff --git a/ts/webui/src/static/style/overview/count.scss b/ts/webui/src/static/style/overview/count.scss index 5069e40fd4..7ea47e5688 100644 --- a/ts/webui/src/static/style/overview/count.scss +++ b/ts/webui/src/static/style/overview/count.scss @@ -1,11 +1,7 @@ -$seriesIconMargin: 8px; +$seriesIconMargin: 10px; .ExpDuration { - margin-top: 28px; - - span:hover { - cursor: pointer; - } + margin-top: 20px; .maxTrialNum { margin-bottom: 10px; @@ -13,7 +9,7 @@ $seriesIconMargin: 8px; } .exp-progress { - margin-top: 10px; + margin-top: 16px; .bold { font-weight: 500; @@ -57,17 +53,17 @@ $seriesIconMargin: 8px; } &-dropdown { - width: 48px; + width: 65px; display: inline-block; position: relative; top: 13px; left: 4px; margin-right: 3px; - } -} -.ExpDuration .series .confirm { - margin: 0 6px; + .ms-Dropdown-title { + padding-right: 0; + } + } } .series { @@ -114,4 +110,12 @@ $seriesIconMargin: 8px; .basic p { margin-top: 0; } + + p.margin { + margin-top: 20px; + } +} + +.trialStatus { + margin-top: 8px; } diff --git a/ts/webui/src/static/style/overview/overview.scss b/ts/webui/src/static/style/overview/overview.scss index 5cad95f167..2c062485e4 100644 --- a/ts/webui/src/static/style/overview/overview.scss +++ b/ts/webui/src/static/style/overview/overview.scss @@ -4,8 +4,8 @@ $boxGapPadding: 10px; .wrapper { display: grid; - grid-template-columns: repeat(9, 1fr); - grid-auto-rows: 97px; + grid-template-columns: repeat(8, 1fr); + grid-auto-rows: 93px; > div { background: #fff; @@ -14,61 +14,63 @@ $boxGapPadding: 10px; box-sizing: border-box; } - .overviewProgress { - grid-column: 2 / 6; - grid-row: 1 / 5; - display: grid; - grid-auto-rows: 70px; - margin: 0 $boxGapPadding; - padding: 0; - background: transparent; - - .duration, - .trialCount { - background: #fff; - padding: $boxPadding; - border-radius: $boxBorderRadius; - box-sizing: border-box; - - /* for alert message tooltip position */ - position: relative; - } + .duration, + .trialCount { + grid-column: 1 / 5; + background: #fff; + padding: $boxPadding; + border-radius: $boxBorderRadius; + box-sizing: border-box; + margin-top: $boxGapPadding; - .duration { - // grid-row: 1 / 3; - height: 139px; - } + /* for alert message tooltip position */ + position: relative; + } - .trialCount { - margin-top: 79px; - height: 239px; - } + .duration { + grid-row: 3 / 5; + height: 138px; + } + + .trialCount { + grid-row: 4 / 6; + margin-top: 65px; + height: 239px; } .overviewCommand1, .overviewCommand2 { + grid-row-start: 8; + margin-top: -59px; + margin-right: $boxGapPadding; border-radius: 0; } .overviewCommand1 { - grid-column-start: 1; + grid-column: 1 / 5; border-radius: $boxBorderRadius 0 0 $boxBorderRadius; } .overviewCommand2 { - grid-column: 2 / 6; - margin-right: 10px; + grid-column: 2 / 5; + margin-right: $boxGapPadding; padding-left: 30px; border-radius: 0 $boxBorderRadius $boxBorderRadius 0; } } .overviewBasicInfo { - grid-column-start: 1; - grid-row: 1 / 5; + grid-column: 1 / 5; + grid-row: 1 / 3; z-index: 2; } +.overviewBasicInfo, +.duration, +.trialCount { + margin-right: $boxGapPadding; +} + .basic { line-height: 21px; font-family: "Segoe UI", Tahoma, Geneva, Verdana, sans-serif; @@ -96,9 +98,10 @@ $boxGapPadding: 10px; } } -.overviewTable { - grid-column: 6 / 10; - grid-row: 1 / 11; +.overviewBestMetric { + grid-column: 5 / 9; + grid-row: 1 / 9; + max-height: 736px; overflow: hidden; .topTrialTitle { @@ -114,43 +117,33 @@ $boxGapPadding: 10px; } .mincenter { - margin: 0 13px 0 10px; + margin: 0 13px 0 $boxGapPadding; } .chooseEntry { - margin-right: 10px; + margin-right: $boxGapPadding; line-height: 30px; } } .overviewCommand1, .overviewCommand2 { + grid-row: 7 / 9; height: 144px; overflow: hidden; - margin-top: 10px; + margin-top: $boxGapPadding; } -$circle: 10px; -$bgblue: #0071bc; - .overviewChart { - grid-column: 1 / 6; - grid-row: 7 / 11; - margin-right: $boxGapPadding; - margin-top: -29px; - - .circle { - width: $circle; - height: $circle; - border-radius: 50%; - background-color: $bgblue; - margin-top: 6px; - margin-right: 18px; - } + margin-top: 20px; } -.showMess { - position: absolute; - top: 42%; - left: 48%; +.defaultMetricContainer { + position: relative; + + .showMess { + position: absolute; + top: 42%; + left: 48%; + } } diff --git a/ts/webui/src/static/style/overview/overviewTitle.scss b/ts/webui/src/static/style/overview/overviewTitle.scss index 7182a0e1a9..1214002cf2 100644 --- a/ts/webui/src/static/style/overview/overviewTitle.scss +++ b/ts/webui/src/static/style/overview/overviewTitle.scss @@ -1,4 +1,4 @@ -$iconPaddingVal: 20px; +$iconPaddingVal: 10px; .panelTitle { span { @@ -9,7 +9,7 @@ $iconPaddingVal: 20px; i { font-size: 24px; - margin-right: 20px; + margin-right: $iconPaddingVal; color: #545454; } } @@ -18,22 +18,3 @@ $iconPaddingVal: 20px; #tabsty .anticon { margin-right: 0; } - -.top10bg { - .top10Title { - width: 160px; - } - - .title { - border-left: 2px solid #fff; - } - - .minTitle { - // margin-right: $iconPaddingVal; - border-right: 2px solid #fff; - } - - .title:hover { - cursor: pointer; - } -} diff --git a/ts/webui/src/static/style/overview/topTrial.scss b/ts/webui/src/static/style/overview/topTrial.scss new file mode 100644 index 0000000000..05497402f9 --- /dev/null +++ b/ts/webui/src/static/style/overview/topTrial.scss @@ -0,0 +1,15 @@ +$circle: 10px; +$bgblue: #0071bc; + +.maxmin { + margin-top: 40px; + + .circle { + width: $circle; + height: $circle; + border-radius: 50%; + background-color: $bgblue; + margin-top: 6px; + margin-right: 18px; + } +} diff --git a/ts/webui/src/static/style/progress/progress.scss b/ts/webui/src/static/style/progress/progress.scss index a2182903ac..c92fc0dc05 100644 --- a/ts/webui/src/static/style/progress/progress.scss +++ b/ts/webui/src/static/style/progress/progress.scss @@ -6,7 +6,10 @@ .status-text { display: inline-block; - line-height: 30px; + } + + .color { + color: #333; } } @@ -60,10 +63,4 @@ /* office-fabric-ui progressIndicator */ .ms-ProgressIndicator-itemProgress { padding: 0; - border: 2px solid #e6e6e6; -} - -.cursor, -.cursor:hover { - cursor: pointer; } diff --git a/ts/webui/src/static/style/succTable.scss b/ts/webui/src/static/style/succTable.scss index dfa32a4ba3..a4895e7df8 100644 --- a/ts/webui/src/static/style/succTable.scss +++ b/ts/webui/src/static/style/succTable.scss @@ -1,8 +1,17 @@ +$tableHeight: 381px; + +.scrollPanel { + height: $tableHeight; +} + #succTable { - min-height: 400px; - max-height: 1000px; - overflow-y: auto; + height: $tableHeight; position: relative; + top: -10px; + + .ms-DetailsHeader { + padding-top: 0; + } .succTable-tooltip { width: 90%; @@ -21,4 +30,4 @@ padding-left: 6px; box-sizing: border-box; } -} +} \ No newline at end of file