Merge pull request #154 from Microsoft/v0.2

Merge V0.2 branch back to master
microsoft · Sep 30, 2018 · 2921e14 · 2921e14
2 parents 2a28a57 + 35900e2
commit 2921e14
Show file tree

Hide file tree

Showing 27 changed files with 403 additions and 62 deletions.
diff --git a/docs/PAIMode.md b/docs/PAIMode.md
@@ -48,6 +48,7 @@ Compared with LocalMode and [RemoteMachineMode](RemoteMachineMode.md), trial con
     * Required key. Should be positive number based on your trial program's memory requirement
 * image
     * Required key. In pai mode, your trial program will be scheduled by OpenPAI to run in [Docker container](https://www.docker.com/). This key is used to specify the Docker image used to create the container in which your traill will run. 
+    * We already build a docker image [nnimsra/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](../deployment/Dockerfile.build.base). You can either use this image directly in your config file, or build your own image based on it.
 * dataDir
     * Optional key. It specifies the HDFS data direcotry for trial to download data. The format should be something like hdfs://{your HDFS host}:9000/{your data directory}
 * outputDir 

diff --git a/docs/RELEASE.md b/docs/RELEASE.md
@@ -1,9 +1,9 @@
 # Release 0.2.0 - 9/29/2018
 ## Major Features
-   * Support for [OpenPAI](https://github.com/Microsoft/pai) (aka pai) Training Service 
+   * Support [OpenPAI](https://github.com/Microsoft/pai) (aka pai) Training Service (See [here](./PAIMode.md) for instructions about how to submit NNI job in pai mode)
       * Support training services on pai mode. NNI trials will be scheduled to run on OpenPAI cluster
       * NNI trial's output (including logs and model file) will be copied to OpenPAI HDFS for further debugging and checking
-   * Support [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) tuner
+   * Support [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) tuner (See [here](../src/sdk/pynni/nni/README.md) for instructions about how to use SMAC tuner)
       * [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) is based on Sequential Model-Based Optimization (SMBO). It adapts the most prominent previously used model class (Gaussian stochastic process models) and introduces the model class of random forests to SMBO to handle categorical parameters. The SMAC supported by NNI is a wrapper on [SMAC3](https://github.com/automl/SMAC3)
    * Support NNI installation on [conda](https://conda.io/docs/index.html) and python virtual environment
    * Others

diff --git a/docs/StartExperiment.md b/docs/StartExperiment.md
@@ -0,0 +1,33 @@
+How to start an experiment
+===
+## 1.Introduce
+There are few steps to start an new experiment of nni, here are the  process.
+<img src="./img/experiment_process.jpg" width="50%" height="50%" />
+## 2.Details
+### 2.1 Check environment
+The first step to start an experiment is to check whether the environment is ready, nnictl will check if there is an old experiment running or the port of restfurl server is occupied.
+NNICTL will also validate the content of config yaml file, to ensure the experiment config is in correct format.
+
+### 2.2 Start restful server
+After check environment, nnictl will start an restful server process to manage nni experiment, the devault port is 51188.
+
+### 2.3 Check restful server
+Before next steps, nnictl will check whether restful server is successfully started, or the starting process will stop and show error message.
+
+### 2.4 Set experiment config
+NNICTL need to set experiment config before start an experiment, experiment config includes the config values in config yaml file.
+
+### 2.5 Check experiment cofig
+NNICTL will ensure the request to set config is successfully executed.
+
+### 2.6 Start Web UI
+NNICTL will start a Web UI process to show Web UI information,the default port of Web UI is 8080.
+
+### 2.7 Check Web UI
+If Web UI is not successfully started, nnictl will give a warning information, and will continue to start experiment.
+
+### 2.8 Start Experiment
+This is the most import step of starting an nni experiment, nnictl will call restful server process to setup an experiment.
+
+### 2.9 Check experiment
+After start experiment, nnictl will check whether the experiment is correctly created, and show more information of this experiment to users.
diff --git a/docs/img/experiment_process.jpg b/docs/img/experiment_process.jpg
diff --git a/examples/trials/auto-gbdt/config.yml b/examples/trials/auto-gbdt/config.yml
@@ -3,7 +3,7 @@ experimentName: example_auto-gbdt
 trialConcurrency: 1
 maxExecDuration: 10h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 searchSpacePath: search_space.json
 #choice: true, false

diff --git a/examples/trials/auto-gbdt/config_pai.yml b/examples/trials/auto-gbdt/config_pai.yml
@@ -0,0 +1,36 @@
+authorName: default
+experimentName: example_auto-gbdt
+trialConcurrency: 1
+maxExecDuration: 10h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+searchSpacePath: search_space.json
+#choice: true, false
+useAnnotation: false
+tuner:
+  #choice: TPE, Random, Anneal, Evolution,
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: minimize
+trial:
+  command: python3 main.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
diff --git a/examples/trials/ga_squad/config.yml b/examples/trials/ga_squad/config.yml
@@ -3,7 +3,7 @@ experimentName: example_ga_squad
 trialConcurrency: 1
 maxExecDuration: 1h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 #choice: true, false
 useAnnotation: false

diff --git a/examples/trials/ga_squad/config_pai.yml b/examples/trials/ga_squad/config_pai.yml
@@ -0,0 +1,34 @@
+authorName: default
+experimentName: example_ga_squad
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+#choice: true, false
+useAnnotation: false
+tuner:
+  codeDir: ../tuners/ga_customer_tuner
+  classFileName: customer_tuner.py
+  className: CustomerTuner
+  classArgs:
+    optimize_mode: maximize
+trial:
+  command: python3 trial.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
diff --git a/examples/trials/mnist-annotation/config.yml b/examples/trials/mnist-annotation/config.yml
@@ -3,7 +3,7 @@ experimentName: example_mnist
 trialConcurrency: 1
 maxExecDuration: 1h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 #choice: true, false
 useAnnotation: true

diff --git a/examples/trials/mnist-annotation/config_pai.yml b/examples/trials/mnist-annotation/config_pai.yml
@@ -0,0 +1,35 @@
+authorName: default
+experimentName: example_mnist
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+#choice: true, false
+useAnnotation: true
+tuner:
+  #choice: TPE, Random, Anneal, Evolution,
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 mnist.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
diff --git a/examples/trials/mnist-batch-tune-keras/config.yml b/examples/trials/mnist-batch-tune-keras/config.yml
@@ -3,7 +3,7 @@ experimentName: example_mnist-keras
 trialConcurrency: 1
 maxExecDuration: 1h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 searchSpacePath: search_space.json
 #choice: true, false

diff --git a/examples/trials/mnist-batch-tune-keras/config_pai.yml b/examples/trials/mnist-batch-tune-keras/config_pai.yml
@@ -0,0 +1,36 @@
+authorName: default
+experimentName: example_mnist-keras
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+searchSpacePath: search_space.json
+#choice: true, false
+useAnnotation: false
+tuner:
+  #choice: TPE, Random, Anneal, Evolution, BatchTuner
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: BatchTuner
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 mnist-keras.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
diff --git a/examples/trials/mnist-keras/config.yml b/examples/trials/mnist-keras/config.yml
@@ -3,7 +3,7 @@ experimentName: example_mnist-keras
 trialConcurrency: 1
 maxExecDuration: 1h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 searchSpacePath: search_space.json
 #choice: true, false

diff --git a/examples/trials/mnist-keras/config_pai.yml b/examples/trials/mnist-keras/config_pai.yml
@@ -0,0 +1,36 @@
+authorName: default
+experimentName: example_mnist-keras
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+searchSpacePath: search_space.json
+#choice: true, false
+useAnnotation: false
+tuner:
+  #choice: TPE, Random, Anneal, Evolution,
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 mnist-keras.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
diff --git a/examples/trials/mnist-smartparam/config.yml b/examples/trials/mnist-smartparam/config.yml
@@ -3,7 +3,7 @@ experimentName: example_mnist-smartparam
 trialConcurrency: 1
 maxExecDuration: 1h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 #choice: true, false
 useAnnotation: true

diff --git a/examples/trials/mnist-smartparam/config_pai.yml b/examples/trials/mnist-smartparam/config_pai.yml
@@ -0,0 +1,35 @@
+authorName: default
+experimentName: example_mnist-smartparam
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+#choice: true, false
+useAnnotation: true
+tuner:
+  #choice: TPE, Random, Anneal, Evolution,
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 mnist.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
diff --git a/examples/trials/mnist/config.yml b/examples/trials/mnist/config.yml
@@ -3,7 +3,7 @@ experimentName: example_mnist
 trialConcurrency: 1
 maxExecDuration: 1h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 searchSpacePath: search_space.json
 #choice: true, false

diff --git a/examples/trials/mnist/config_pai.yml b/examples/trials/mnist/config_pai.yml
@@ -0,0 +1,36 @@
+authorName: default
+experimentName: example_mnist
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+searchSpacePath: search_space.json
+#choice: true, false
+useAnnotation: false
+tuner:
+  #choice: TPE, Random, Anneal, Evolution,
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 mnist.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
diff --git a/examples/trials/pytorch_cifar10/config.yml b/examples/trials/pytorch_cifar10/config.yml
@@ -3,7 +3,7 @@ experimentName: example_pytorch_cifar10
 trialConcurrency: 1
 maxExecDuration: 100h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 searchSpacePath: search_space.json
 #choice: true, false