From d8d380a7c79fb7cf9d2dce3d5c95a17c6c1fc382 Mon Sep 17 00:00:00 2001 From: Scarlett Li <39592018+scarlett2018@users.noreply.github.com> Date: Wed, 16 Feb 2022 12:11:15 +0800 Subject: [PATCH 1/8] Update README.md --- README.md | 52 +++++++++++++++++++++++++++++++--------------------- 1 file changed, 31 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index 3878dbfd0f..2f733fcab9 100644 --- a/README.md +++ b/README.md @@ -1,27 +1,22 @@ -

+

-

+ +**A lightweight toolkit that automates feature engineering, neural architecture search, hyperparameter tuning, and model compression.** + +
[![MIT licensed](https://img.shields.io/badge/license-MIT-brightgreen.svg)](LICENSE) [![Build Status](https://msrasrg.visualstudio.com/NNIOpenSource/_apis/build/status/full%20test%20-%20linux?branchName=master)](https://msrasrg.visualstudio.com/NNIOpenSource/_build/latest?definitionId=62&branchName=master) [![Issues](https://img.shields.io/github/issues-raw/Microsoft/nni.svg)](https://github.com/Microsoft/nni/issues?q=is%3Aissue+is%3Aopen) [![Bugs](https://img.shields.io/github/issues/Microsoft/nni/bug.svg)](https://github.com/Microsoft/nni/issues?q=is%3Aissue+is%3Aopen+label%3Abug) [![Pull Requests](https://img.shields.io/github/issues-pr-raw/Microsoft/nni.svg)](https://github.com/Microsoft/nni/pulls?q=is%3Apr+is%3Aopen) -[![Version](https://img.shields.io/github/release/Microsoft/nni.svg)](https://github.com/Microsoft/nni/releases) [![Join the chat at https://gitter.im/Microsoft/nni](https://badges.gitter.im/Microsoft/nni.svg)](https://gitter.im/Microsoft/nni?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) +[![Version](https://img.shields.io/github/release/Microsoft/nni.svg)](https://github.com/Microsoft/nni/releases) [![Documentation Status](https://readthedocs.org/projects/nni/badge/?version=stable)](https://nni.readthedocs.io/en/stable/?badge=stable) -[NNI Doc](https://nni.readthedocs.io/) | [简体中文](README_zh_CN.md) - -**NNI (Neural Network Intelligence)** is a lightweight but powerful toolkit to help users **automate** Feature Engineering, Neural Architecture Search, Hyperparameter Tuning and Model Compression. +______________________________________________________________________ -The tool manages automated machine learning (AutoML) experiments, **dispatches and runs** experiments' trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in **different training environments** like Local Machine, Remote Servers, OpenPAI, Kubeflow, FrameworkController on K8S (AKS etc.), DLWorkspace (aka. DLTS), AML (Azure Machine Learning), AdaptDL (aka. ADL) , other cloud options and even Hybrid mode. +Find the latest features, API, examples and tutorials in our **official documentation**: [NNI Doc](https://nni.readthedocs.io/) -## **Who should consider using NNI** - -* Those who want to **try different AutoML algorithms** in their training code/model. -* Those who want to run AutoML trial jobs **in different environments** to speed up search. -* Researchers and data scientists who want to easily **implement and experiment new AutoML algorithms**, may it be: hyperparameter tuning algorithm, neural architect search algorithm or model compression algorithm. -* ML Platform owners who want to **support AutoML in their platform**. ## **What's NEW!**   @@ -35,11 +30,15 @@ The tool manages automated machine learning (AutoML) experiments, **dispatches a

## **NNI capabilities in a glance** +**Neural Network Intelligence (NNI)** is a lightweight and powerful toolkit to help users **automate** Feature Engineering, Neural Architecture Search, Hyperparameter Tuning and Model Compression. -NNI provides CommandLine Tool as well as an user friendly WebUI to manage training experiments. With the extensible API, you can customize your own AutoML algorithms and training services. To make it easy for new users, NNI also provides a set of build-in state-of-the-art AutoML algorithms and out of box support for popular training platforms. +NNI provides CommandLine Tool as well as an user friendly WebUI to manage training experiments. The tool manages automated machine learning (AutoML) experiments, **dispatches and runs** experiments' trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in **different training environments** like Local Machine, Remote Servers, OpenPAI, Kubeflow, FrameworkController on K8S (AKS etc.), DLWorkspace (aka. DLTS), AML (Azure Machine Learning), AdaptDL (aka. ADL) , other cloud options and even Hybrid mode. + +With the extensible API, you can customize your own AutoML algorithms and training services. To make it easy for new users, NNI also provides a set of build-in state-of-the-art AutoML algorithms and out of box support for popular training platforms. Within the following table, we summarized the current NNI capabilities, we are gradually adding new capabilities and we'd love to have your contribution. +

@@ -220,6 +219,13 @@ Within the following table, we summarized the current NNI capabilities, we are g +## **Who should consider using NNI** + +* Those who want to **try different AutoML algorithms** in their training code/model. +* Those who want to run AutoML trial jobs **in different environments** to speed up search. +* Researchers and data scientists who want to easily **implement and experiment new AutoML algorithms**, may it be: hyperparameter tuning algorithm, neural architect search algorithm or model compression algorithm. +* ML Platform owners who want to **support AutoML in their platform**. + ## **Installation** ### **Install** @@ -243,14 +249,17 @@ If you want to try latest code, please [install NNI](https://nni.readthedocs.io/ For detail system requirements of NNI, please refer to [here](https://nni.readthedocs.io/en/stable/Tutorial/InstallationLinux.html#system-requirements) for Linux & macOS, and [here](https://nni.readthedocs.io/en/stable/Tutorial/InstallationWin.html#system-requirements) for Windows. -Note: - +
+ Installation FAQ * If there is any privilege issue, add `--user` to install NNI in the user directory. * Currently NNI on Windows supports local, remote and pai mode. Anaconda or Miniconda is highly recommended to install [NNI on Windows](https://nni.readthedocs.io/en/stable/Tutorial/InstallationWin.html). * If there is any error like `Segmentation fault`, please refer to [FAQ](https://nni.readthedocs.io/en/stable/Tutorial/FAQ.html). For FAQ on Windows, please refer to [NNI on Windows](https://nni.readthedocs.io/en/stable/Tutorial/InstallationWin.html#faq). - -### **Verify installation** - +
+ + ### **Run your first example** +
+ set up and run the example + * Download the examples via clone the source code. ```bash @@ -301,8 +310,9 @@ You can use these commands to get more information about the experiment * Open the `Web UI url` in your browser, you can view detailed information of the experiment and all the submitted trial jobs as shown below. [Here](https://nni.readthedocs.io/en/stable/Tutorial/WebUI.html) are more Web UI pages. -webui - +
+ webui + ## **Releases and Contributing** NNI has a monthly release cycle (major releases). Please let us know if you encounter a bug by [filling an issue](https://github.com/microsoft/nni/issues/new/choose). From 7849602bf39129d77869038b401ca67efa3e796b Mon Sep 17 00:00:00 2001 From: jiahangxu Date: Fri, 18 Feb 2022 04:09:18 +0000 Subject: [PATCH 2/8] setup the refactor --- .../TrainingService/RemoteMachineMode.rst | 39 ++++++++++++++++--- 1 file changed, 34 insertions(+), 5 deletions(-) diff --git a/docs/source/TrainingService/RemoteMachineMode.rst b/docs/source/TrainingService/RemoteMachineMode.rst index 7e54ef8d84..7075712c95 100644 --- a/docs/source/TrainingService/RemoteMachineMode.rst +++ b/docs/source/TrainingService/RemoteMachineMode.rst @@ -1,3 +1,4 @@ +==================================== Run an Experiment on Remote Machines ==================================== @@ -5,8 +6,8 @@ NNI can run one experiment on multiple remote machines through SSH, called ``rem The OS of remote machines supports ``Linux``\ , ``Windows 10``\ , and ``Windows Server 2019``. -Requirements ------------- +Prerequisite +============ * @@ -21,18 +22,25 @@ Requirements * Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows. +Usage +===== + +Remote setup +------------ + + Linux ^^^^^ -* Follow `installation <../Tutorial/InstallationLinux.rst>`__ to install NNI on the remote machine. +* Follow `installation <../Tutorial/InstallationLinux.rst>`__ to install NNI on the remote Linux machine. Windows ^^^^^^^ * - Follow `installation <../Tutorial/InstallationWin.rst>`__ to install NNI on the remote machine. + Follow `installation <../Tutorial/InstallationWin.rst>`__ to install NNI on the remote Windows machine. * Install and start ``OpenSSH Server``. @@ -95,6 +103,8 @@ e.g. there are three machines, which can be logged in with username and password Install and run NNI on one of those three machines or another machine, which has network access to them. +(one example of configuration of this training service) + Use ``examples/trials/mnist-pytorch`` as the example. Below is content of ``examples/trials/mnist-pytorch/config_remote.yml``\ : .. code-block:: yaml @@ -123,12 +133,23 @@ Use ``examples/trials/mnist-pytorch`` as the example. Below is content of ``exam Files in ``trialCodeDirectory`` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines: + + +(explain the configuration if necessary) + + + +(refer to a complete example config, and refer to training service reference) + .. code-block:: bash nnictl create --config examples/trials/mnist-pytorch/config_remote.yml +More features +============= + Configure python environment -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +============================ By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use **pythonPath** to specify a python environment on your remote machine. @@ -137,3 +158,11 @@ For example, with anaconda you can specify: .. code-block:: yaml pythonPath: /home/bob/.conda/envs/ENV-NAME/bin + +Configure distributed trial +=========================== + +(some training service, e.g., openpai, kubeflow, already supported distributed trial) + +Configure additional shared storage +=================================== From 002ef3eb86e8eb98dd587c23da6ae48122f49a53 Mon Sep 17 00:00:00 2001 From: jiahangxu Date: Fri, 18 Feb 2022 10:35:57 +0000 Subject: [PATCH 3/8] change title level --- .../TrainingService/RemoteMachineMode.rst | 22 ++++++------------- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/docs/source/TrainingService/RemoteMachineMode.rst b/docs/source/TrainingService/RemoteMachineMode.rst index 7075712c95..f99363b459 100644 --- a/docs/source/TrainingService/RemoteMachineMode.rst +++ b/docs/source/TrainingService/RemoteMachineMode.rst @@ -1,4 +1,3 @@ -==================================== Run an Experiment on Remote Machines ==================================== @@ -7,7 +6,7 @@ NNI can run one experiment on multiple remote machines through SSH, called ``rem The OS of remote machines supports ``Linux``\ , ``Windows 10``\ , and ``Windows Server 2019``. Prerequisite -============ +------------ * @@ -22,12 +21,6 @@ Prerequisite * Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows. -Usage -===== - -Remote setup ------------- - Linux ^^^^^ @@ -78,8 +71,8 @@ Windows (py37_default) C:\Users\AzureUser> -Run an experiment ------------------ +Usage +----- e.g. there are three machines, which can be logged in with username and password. @@ -138,7 +131,6 @@ Files in ``trialCodeDirectory`` will be uploaded to remote machines automaticall (explain the configuration if necessary) - (refer to a complete example config, and refer to training service reference) .. code-block:: bash @@ -146,10 +138,10 @@ Files in ``trialCodeDirectory`` will be uploaded to remote machines automaticall nnictl create --config examples/trials/mnist-pytorch/config_remote.yml More features -============= +------------- Configure python environment -============================ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use **pythonPath** to specify a python environment on your remote machine. @@ -160,9 +152,9 @@ For example, with anaconda you can specify: pythonPath: /home/bob/.conda/envs/ENV-NAME/bin Configure distributed trial -=========================== +^^^^^^^^^^^^^^^^^^^^^^^^^^^ (some training service, e.g., openpai, kubeflow, already supported distributed trial) Configure additional shared storage -=================================== +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ From a567365ed5d7d1f4ab2188b7cd35638a0695aa2b Mon Sep 17 00:00:00 2001 From: jiahangxu Date: Tue, 1 Mar 2022 15:18:40 +0000 Subject: [PATCH 4/8] refactor --- docs/source/experiment/remote.rst | 105 +++++++++++++++++++++++++++++- 1 file changed, 104 insertions(+), 1 deletion(-) diff --git a/docs/source/experiment/remote.rst b/docs/source/experiment/remote.rst index 8e98e9b2ff..28ebd93157 100644 --- a/docs/source/experiment/remote.rst +++ b/docs/source/experiment/remote.rst @@ -1,4 +1,107 @@ Remote Training Service ======================= -TBD \ No newline at end of file +NNI can run one experiment on multiple remote machines through SSH, called ``remote`` mode. It's like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel. + +The OS of remote machines supports ``Linux``\ , ``Windows 10``\ , and ``Windows Server 2019``. + +Prerequisite +------------ + + +1. Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into ``command`` field of NNI config. + +2. Make sure remote machines can be accessed through SSH from the machine which runs ``nnictl`` command. It supports both password and key authentication of SSH. For advanced usage, please refer to :ref:`reference-remote-config-label` in reference for detailed usage. + +3. Make sure the NNI version on each machine is consistent. Follow the install guide `here <../Tutorial/QuickStart.rst>`__ to install NNI. + +4. Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows. + +There are several steps for Windows server. + +1. Install and start ``OpenSSH Server``. + + 1) Open ``Settings`` app on Windows. + + 2) Click ``Apps``\ , then click ``Optional features``. + + 3) Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``. + + 4) Once it's installed, run below command to start and set to automatic start. + + .. code-block:: bat + + sc config sshd start=auto + net start sshd + +2. Make sure remote account is administrator, so that it can stop running trials. + +3. Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you're using Data Science VM on Azure, it needs to remove extra echo commands in ``C:\dsvm\tools\setup\welcome.bat``. + + The output like below is ok, when opening a new command window. + + .. code-block:: text + + Microsoft Windows [Version 10.0.17763.1192] + (c) 2018 Microsoft Corporation. All rights reserved. + + (py37_default) C:\Users\AzureUser> + +Usage +----- + +Use ``examples/trials/mnist-pytorch`` as the example. Suppose there are two machines, which can be logged in with username and password or key authentication of SSH. Install and run NNI on one of those machines or another machine, which has network access to them. Here is a template configuration specification. + +.. code-block:: yaml + + searchSpaceFile: search_space.json + trialCommand: python3 mnist.py + trialCodeDirectory: . # default value, can be omitted + trialGpuNumber: 0 + trialConcurrency: 4 + maxTrialNumber: 20 + tuner: + name: TPE + classArgs: + optimize_mode: maximize + trainingService: + platform: remote + machineList: + - host: 192.0.2.1 + user: alice + ssh_key_file: ~/.ssh/id_rsa + - host: 192.0.2.2 + port: 10022 + user: bob + password: bob123 + +The example configuration is saved in ``examples/trials/mnist-pytorch/config_remote.yml``. + +Files in ``trialCodeDirectory`` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines: + +.. code-block:: bash + + nnictl create --config examples/trials/mnist-pytorch/config_remote.yml + +More features +------------- + +Configure python environment +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use **pythonPath** to specify a python environment on your remote machine. + +For example, with anaconda you can specify: + +.. code-block:: yaml + + pythonPath: /home/bob/.conda/envs/ENV-NAME/bin + +Configure distributed trial +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +(some training service, e.g., openpai, kubeflow, already supported distributed trial) + + +Monitor via TensorBoard +^^^^^^^^^^^^^^^^^^^^^^^ From f947e03871ec6c62c63f6e38188d8a8c765da166 Mon Sep 17 00:00:00 2001 From: jiahangxu Date: Tue, 1 Mar 2022 15:26:51 +0000 Subject: [PATCH 5/8] remove lagacy --- .../TrainingService/RemoteMachineMode.rst | 139 ++++++++++++++++++ 1 file changed, 139 insertions(+) create mode 100644 docs/en_US/TrainingService/RemoteMachineMode.rst diff --git a/docs/en_US/TrainingService/RemoteMachineMode.rst b/docs/en_US/TrainingService/RemoteMachineMode.rst new file mode 100644 index 0000000000..7e54ef8d84 --- /dev/null +++ b/docs/en_US/TrainingService/RemoteMachineMode.rst @@ -0,0 +1,139 @@ +Run an Experiment on Remote Machines +==================================== + +NNI can run one experiment on multiple remote machines through SSH, called ``remote`` mode. It's like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel. + +The OS of remote machines supports ``Linux``\ , ``Windows 10``\ , and ``Windows Server 2019``. + +Requirements +------------ + + +* + Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into ``command`` field of NNI config. + +* + Make sure remote machines can be accessed through SSH from the machine which runs ``nnictl`` command. It supports both password and key authentication of SSH. For advanced usages, please refer to `machineList part of configuration <../Tutorial/ExperimentConfig.rst>`__. + +* + Make sure the NNI version on each machine is consistent. + +* + Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows. + +Linux +^^^^^ + + +* Follow `installation <../Tutorial/InstallationLinux.rst>`__ to install NNI on the remote machine. + +Windows +^^^^^^^ + + +* + Follow `installation <../Tutorial/InstallationWin.rst>`__ to install NNI on the remote machine. + +* + Install and start ``OpenSSH Server``. + + + #. + Open ``Settings`` app on Windows. + + #. + Click ``Apps``\ , then click ``Optional features``. + + #. + Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``. + + #. + Once it's installed, run below command to start and set to automatic start. + + .. code-block:: bat + + sc config sshd start=auto + net start sshd + +* + Make sure remote account is administrator, so that it can stop running trials. + +* + Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you're using Data Science VM on Azure, it needs to remove extra echo commands in ``C:\dsvm\tools\setup\welcome.bat``. + + The output like below is ok, when opening a new command window. + + .. code-block:: text + + Microsoft Windows [Version 10.0.17763.1192] + (c) 2018 Microsoft Corporation. All rights reserved. + + (py37_default) C:\Users\AzureUser> + +Run an experiment +----------------- + +e.g. there are three machines, which can be logged in with username and password. + +.. list-table:: + :header-rows: 1 + :widths: auto + + * - IP + - Username + - Password + * - 10.1.1.1 + - bob + - bob123 + * - 10.1.1.2 + - bob + - bob123 + * - 10.1.1.3 + - bob + - bob123 + + +Install and run NNI on one of those three machines or another machine, which has network access to them. + +Use ``examples/trials/mnist-pytorch`` as the example. Below is content of ``examples/trials/mnist-pytorch/config_remote.yml``\ : + +.. code-block:: yaml + + searchSpaceFile: search_space.json + trialCommand: python3 mnist.py + trialCodeDirectory: . # default value, can be omitted + trialGpuNumber: 0 + trialConcurrency: 4 + maxTrialNumber: 20 + tuner: + name: TPE + classArgs: + optimize_mode: maximize + trainingService: + platform: remote + machineList: + - host: 192.0.2.1 + user: alice + ssh_key_file: ~/.ssh/id_rsa + - host: 192.0.2.2 + port: 10022 + user: bob + password: bob123 + pythonPath: /usr/bin + +Files in ``trialCodeDirectory`` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines: + +.. code-block:: bash + + nnictl create --config examples/trials/mnist-pytorch/config_remote.yml + +Configure python environment +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use **pythonPath** to specify a python environment on your remote machine. + +For example, with anaconda you can specify: + +.. code-block:: yaml + + pythonPath: /home/bob/.conda/envs/ENV-NAME/bin From e2221e99fdf490acad9a1e8b60cc4eeb17213d7b Mon Sep 17 00:00:00 2001 From: jiahangxu Date: Tue, 1 Mar 2022 15:29:28 +0000 Subject: [PATCH 6/8] remove lagacy --- .../TrainingService/RemoteMachineMode.rst | 31 +++---------------- 1 file changed, 5 insertions(+), 26 deletions(-) diff --git a/docs/source/TrainingService/RemoteMachineMode.rst b/docs/source/TrainingService/RemoteMachineMode.rst index f99363b459..7e54ef8d84 100644 --- a/docs/source/TrainingService/RemoteMachineMode.rst +++ b/docs/source/TrainingService/RemoteMachineMode.rst @@ -5,7 +5,7 @@ NNI can run one experiment on multiple remote machines through SSH, called ``rem The OS of remote machines supports ``Linux``\ , ``Windows 10``\ , and ``Windows Server 2019``. -Prerequisite +Requirements ------------ @@ -21,19 +21,18 @@ Prerequisite * Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows. - Linux ^^^^^ -* Follow `installation <../Tutorial/InstallationLinux.rst>`__ to install NNI on the remote Linux machine. +* Follow `installation <../Tutorial/InstallationLinux.rst>`__ to install NNI on the remote machine. Windows ^^^^^^^ * - Follow `installation <../Tutorial/InstallationWin.rst>`__ to install NNI on the remote Windows machine. + Follow `installation <../Tutorial/InstallationWin.rst>`__ to install NNI on the remote machine. * Install and start ``OpenSSH Server``. @@ -71,8 +70,8 @@ Windows (py37_default) C:\Users\AzureUser> -Usage ------ +Run an experiment +----------------- e.g. there are three machines, which can be logged in with username and password. @@ -96,8 +95,6 @@ e.g. there are three machines, which can be logged in with username and password Install and run NNI on one of those three machines or another machine, which has network access to them. -(one example of configuration of this training service) - Use ``examples/trials/mnist-pytorch`` as the example. Below is content of ``examples/trials/mnist-pytorch/config_remote.yml``\ : .. code-block:: yaml @@ -126,20 +123,10 @@ Use ``examples/trials/mnist-pytorch`` as the example. Below is content of ``exam Files in ``trialCodeDirectory`` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines: - - -(explain the configuration if necessary) - - -(refer to a complete example config, and refer to training service reference) - .. code-block:: bash nnictl create --config examples/trials/mnist-pytorch/config_remote.yml -More features -------------- - Configure python environment ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -150,11 +137,3 @@ For example, with anaconda you can specify: .. code-block:: yaml pythonPath: /home/bob/.conda/envs/ENV-NAME/bin - -Configure distributed trial -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -(some training service, e.g., openpai, kubeflow, already supported distributed trial) - -Configure additional shared storage -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ From 639df8987146d5d7e7b2337269875370e17f519c Mon Sep 17 00:00:00 2001 From: jiahangxu Date: Thu, 3 Mar 2022 13:22:04 +0000 Subject: [PATCH 7/8] add share storage and tensorboard --- .../TrainingService/RemoteMachineMode.rst | 139 ------------------ docs/source/experiment/remote.rst | 14 +- 2 files changed, 7 insertions(+), 146 deletions(-) delete mode 100644 docs/en_US/TrainingService/RemoteMachineMode.rst diff --git a/docs/en_US/TrainingService/RemoteMachineMode.rst b/docs/en_US/TrainingService/RemoteMachineMode.rst deleted file mode 100644 index 7e54ef8d84..0000000000 --- a/docs/en_US/TrainingService/RemoteMachineMode.rst +++ /dev/null @@ -1,139 +0,0 @@ -Run an Experiment on Remote Machines -==================================== - -NNI can run one experiment on multiple remote machines through SSH, called ``remote`` mode. It's like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel. - -The OS of remote machines supports ``Linux``\ , ``Windows 10``\ , and ``Windows Server 2019``. - -Requirements ------------- - - -* - Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into ``command`` field of NNI config. - -* - Make sure remote machines can be accessed through SSH from the machine which runs ``nnictl`` command. It supports both password and key authentication of SSH. For advanced usages, please refer to `machineList part of configuration <../Tutorial/ExperimentConfig.rst>`__. - -* - Make sure the NNI version on each machine is consistent. - -* - Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows. - -Linux -^^^^^ - - -* Follow `installation <../Tutorial/InstallationLinux.rst>`__ to install NNI on the remote machine. - -Windows -^^^^^^^ - - -* - Follow `installation <../Tutorial/InstallationWin.rst>`__ to install NNI on the remote machine. - -* - Install and start ``OpenSSH Server``. - - - #. - Open ``Settings`` app on Windows. - - #. - Click ``Apps``\ , then click ``Optional features``. - - #. - Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``. - - #. - Once it's installed, run below command to start and set to automatic start. - - .. code-block:: bat - - sc config sshd start=auto - net start sshd - -* - Make sure remote account is administrator, so that it can stop running trials. - -* - Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you're using Data Science VM on Azure, it needs to remove extra echo commands in ``C:\dsvm\tools\setup\welcome.bat``. - - The output like below is ok, when opening a new command window. - - .. code-block:: text - - Microsoft Windows [Version 10.0.17763.1192] - (c) 2018 Microsoft Corporation. All rights reserved. - - (py37_default) C:\Users\AzureUser> - -Run an experiment ------------------ - -e.g. there are three machines, which can be logged in with username and password. - -.. list-table:: - :header-rows: 1 - :widths: auto - - * - IP - - Username - - Password - * - 10.1.1.1 - - bob - - bob123 - * - 10.1.1.2 - - bob - - bob123 - * - 10.1.1.3 - - bob - - bob123 - - -Install and run NNI on one of those three machines or another machine, which has network access to them. - -Use ``examples/trials/mnist-pytorch`` as the example. Below is content of ``examples/trials/mnist-pytorch/config_remote.yml``\ : - -.. code-block:: yaml - - searchSpaceFile: search_space.json - trialCommand: python3 mnist.py - trialCodeDirectory: . # default value, can be omitted - trialGpuNumber: 0 - trialConcurrency: 4 - maxTrialNumber: 20 - tuner: - name: TPE - classArgs: - optimize_mode: maximize - trainingService: - platform: remote - machineList: - - host: 192.0.2.1 - user: alice - ssh_key_file: ~/.ssh/id_rsa - - host: 192.0.2.2 - port: 10022 - user: bob - password: bob123 - pythonPath: /usr/bin - -Files in ``trialCodeDirectory`` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines: - -.. code-block:: bash - - nnictl create --config examples/trials/mnist-pytorch/config_remote.yml - -Configure python environment -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use **pythonPath** to specify a python environment on your remote machine. - -For example, with anaconda you can specify: - -.. code-block:: yaml - - pythonPath: /home/bob/.conda/envs/ENV-NAME/bin diff --git a/docs/source/experiment/remote.rst b/docs/source/experiment/remote.rst index 28ebd93157..b0a9b0ebb8 100644 --- a/docs/source/experiment/remote.rst +++ b/docs/source/experiment/remote.rst @@ -17,7 +17,7 @@ Prerequisite 4. Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows. -There are several steps for Windows server. +In addition, there are several steps for Windows server. 1. Install and start ``OpenSSH Server``. @@ -50,13 +50,12 @@ There are several steps for Windows server. Usage ----- -Use ``examples/trials/mnist-pytorch`` as the example. Suppose there are two machines, which can be logged in with username and password or key authentication of SSH. Install and run NNI on one of those machines or another machine, which has network access to them. Here is a template configuration specification. +Use ``examples/trials/mnist-pytorch`` as the example. Suppose there are two machines, which can be logged in with username and password or key authentication of SSH. Here is a template configuration specification. .. code-block:: yaml searchSpaceFile: search_space.json trialCommand: python3 mnist.py - trialCodeDirectory: . # default value, can be omitted trialGpuNumber: 0 trialConcurrency: 4 maxTrialNumber: 20 @@ -77,7 +76,7 @@ Use ``examples/trials/mnist-pytorch`` as the example. Suppose there are two mach The example configuration is saved in ``examples/trials/mnist-pytorch/config_remote.yml``. -Files in ``trialCodeDirectory`` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines: +You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines: .. code-block:: bash @@ -97,11 +96,12 @@ For example, with anaconda you can specify: pythonPath: /home/bob/.conda/envs/ENV-NAME/bin -Configure distributed trial +Configure shared storage ^^^^^^^^^^^^^^^^^^^^^^^^^^^ -(some training service, e.g., openpai, kubeflow, already supported distributed trial) - +Remote training service support shared storage, which can help use your own storage during using NNI. Follow the guide `here <./shared_storage.rst>`__ to learn how to use shared storage. Monitor via TensorBoard ^^^^^^^^^^^^^^^^^^^^^^^ + +Remote training service support trial visualization via TensorBoard. Follow the guide `here <./tensorboard.rst>`__ to learn how to use TensorBoard. From 2d9b4a7da0058a02811a027497b349c252094157 Mon Sep 17 00:00:00 2001 From: jiahangxu Date: Tue, 22 Mar 2022 16:02:52 +0800 Subject: [PATCH 8/8] fix pylint --- docs/source/reference/experiment_config.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/source/reference/experiment_config.rst b/docs/source/reference/experiment_config.rst index 7a9437df89..e46a299acb 100644 --- a/docs/source/reference/experiment_config.rst +++ b/docs/source/reference/experiment_config.rst @@ -332,10 +332,12 @@ Introduction of the corresponding local training service can be found :doc:`../e If ``trialGpuNumber`` is less than the length of this value, only a subset will be visible to each trial. This will be used as ``CUDA_VISIBLE_DEVICES`` environment variable. +.. _reference-remote-config-label: + RemoteConfig ------------ -Detailed usage can be found `here <../TrainingService/RemoteMachineMode.rst>`__. +Detailed usage can be found :doc:`../experiment/remote`. .. list-table:: :widths: 10 10 80