Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Add documentation for NNI PAI mode experiment #141

Merged
merged 87 commits into from
Sep 29, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
9a8ac16
PAI Training service implementation, v1
Sep 19, 2018
8983045
update trial package directory in setup.py
Sep 19, 2018
248d0eb
Update setup.py package info
Sep 19, 2018
43fca76
Update trial keeper module, use IP adress for pai training service ma…
Sep 20, 2018
4fe49de
Update metrics file path in reader
Sep 20, 2018
66a54e1
Fix metrics file path issue
Sep 21, 2018
65709d3
Update pai integration, full implementation of pai training service
Sep 21, 2018
c1a3d34
Do not send metrics if it is empty
Sep 21, 2018
232d0e8
Update nnictl, to support pai configuration
Sep 21, 2018
a5d4a20
fix repo
Sep 24, 2018
cd64e5f
add hdfs_output_dir
Sep 24, 2018
889e066
add copy logic
Sep 24, 2018
de9c374
debug
Sep 24, 2018
e98b0ac
update hdfsUtility
Sep 24, 2018
272411a
debug
Sep 24, 2018
4cba4d1
debug
Sep 24, 2018
45d1031
fix setup.py bug
Sep 24, 2018
e63ffc0
fix bug
Sep 24, 2018
954d640
debug
Sep 24, 2018
0410d05
debug
Sep 24, 2018
e3788d2
add exception handler
Sep 25, 2018
793cbf1
fix bug
Sep 25, 2018
b14c108
debug
Sep 25, 2018
0ae9f6d
fix bug
Sep 25, 2018
5938310
fix bug
Sep 25, 2018
c756188
fix bug
Sep 25, 2018
b6ce813
split metrics into single line, and read metrics no matter if subproc…
Sep 25, 2018
dc0f96b
add unit test for hdfsClientUtility
Sep 25, 2018
55b6e08
fix bug
Sep 25, 2018
2529376
Add experiment id in update metrics url to differ trials
Sep 25, 2018
0f7d40c
add default outputdir
Sep 25, 2018
43d7ab7
update
Sep 25, 2018
9c53f47
fix trial_keeper
Sep 25, 2018
beac29c
fix bug
Sep 25, 2018
c54ad7a
add default value for nnioutputdir
Sep 25, 2018
c214362
fix bug
Sep 25, 2018
60bf770
remove unused code
Sep 25, 2018
fad2ba3
PAI Training service implementation, v1 (#1)
yds05 Sep 25, 2018
7f06762
fix conflict
Sep 25, 2018
aa4f306
fix conflict
Sep 25, 2018
45c9600
fix conflict
Sep 25, 2018
24dd1b6
Remove unused import and paiTrialConfig file
Sep 25, 2018
84d278c
Merge branch 'master' into dev-pai
yds05 Sep 25, 2018
9febbd3
Merge pull request #3 from yds05/dev-pai
yds05 Sep 25, 2018
7a43c54
fix conflict
Sep 25, 2018
3e0cce2
refactor code
Sep 26, 2018
ef1eaf8
fix comments
Sep 26, 2018
7f9baea
fix comment
Sep 26, 2018
4af5c60
Implement cancel job API for pai training service
Sep 26, 2018
eb548cf
fix default value for outputDir
Sep 26, 2018
4d24e87
fix comments
Sep 26, 2018
6325bd3
Merge pull request #4 from yds05/dev-pai-desy
yds05 Sep 26, 2018
1db913c
Merge pull request #2 from yds05/dev-pai-t-shya2
SparkSnail Sep 26, 2018
5487975
Merge branch 'master' of https://github.com/Microsoft/nni into Micros…
Sep 26, 2018
88b1876
Merge branch 'Microsoft-master'
Sep 26, 2018
90c9e69
Merge pull request #6 from yds05/master
SparkSnail Sep 26, 2018
b714a8f
fix pip install to master
Sep 26, 2018
b6a233a
change pip install branch in paiData.ts
Sep 27, 2018
9511174
fix conflict
Sep 27, 2018
52b1cc8
fix log path
Sep 27, 2018
76bd378
fix conflict
Sep 27, 2018
c27d146
add logpath logic
Sep 27, 2018
449a4f3
add log path
Sep 27, 2018
1d9f23e
refactor schema
Sep 27, 2018
aa552c0
Fix bug that all trials use the same hdfs log path
Sep 27, 2018
9edfb34
Merge pull request #7 from yds05/dev-pai-t-shya2
yds05 Sep 27, 2018
cb46266
Update PAI training service PR comments
Sep 27, 2018
f09a651
Remove unused nnits-tool in uninstallation
Sep 27, 2018
94c92c3
Remove unused trianing_service_tool package in setup.py
Sep 27, 2018
2eca5d9
Update setup.py version to 0.2.0
Sep 27, 2018
717856e
Change pip install repo to Microsoft/nni
Sep 27, 2018
c32cd52
Update NNI v0.2 release notes
Sep 27, 2018
a549a16
Merge pull request #8 from Microsoft/master
yds05 Sep 27, 2018
76c10e8
Fix typo based on PR comments
Sep 27, 2018
5442251
Add NNI installation scripts
Sep 28, 2018
09fd234
Merge pull request #9 from Microsoft/v0.2
yds05 Sep 28, 2018
73556cc
Update pai script, update NNI_out_dir
Sep 28, 2018
a0da600
Update NNI dir in nni sdk local.py
Sep 28, 2018
8b644d8
Create .nni folder in nni sdk local.py
Sep 28, 2018
fb6e57f
Add check before creating .nni folder
Sep 28, 2018
567bf09
Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT
Sep 28, 2018
8d7a474
Add documentation for NNI PAI mode
Sep 29, 2018
ea792fd
Merge pull request #10 from Microsoft/v0.2
yds05 Sep 29, 2018
01de6b7
Fix typo based on PR comments
Sep 29, 2018
a4acc22
Exit with subprocess return code of trial keeper
Sep 29, 2018
8467ac5
Remove additional exit code
Sep 29, 2018
39981a7
Fix typo based on PR comments
Sep 29, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ pip Installation Prerequisites
* git, wget

```
python3 -m pip install -v --user git+https://github.com/Microsoft/nni.git@v0.1
python3 -m pip install -v --user git+https://github.com/Microsoft/nni.git@v0.2
source ~/.bashrc
```

Expand Down
4 changes: 2 additions & 2 deletions docs/GetStarted.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@

* __Install NNI through pip__

python3 -m pip install -v --user git+https://github.com/Microsoft/nni.git@v0.1
python3 -m pip install -v --user git+https://github.com/Microsoft/nni.git@v0.2
source ~/.bashrc

* __Install NNI through source code__

git clone -b v0.1 https://github.com/Microsoft/nni.git
git clone -b v0.2 https://github.com/Microsoft/nni.git
cd nni
chmod +x install.sh
source install.sh
Expand Down
79 changes: 79 additions & 0 deletions docs/PAIMode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
**Run an Experiment on OpenPAI**
===
NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about change the 1st sentence to:
Starting from v0.2.0, NNI supports one more Training Service Mode: pai mode, which enabled user to run an experiment on OpenPAI.


## Setup environment
Install NNI, follow the install guide [here](GetStarted.md).

## Run an experiment
Use `examples/trials/mnist-annotation` as an example. The nni config yaml file's content is like:
```
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 2
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote, pai
trainingServicePlatform: pai
# choice: true, false
useAnnotation: true
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: openpai/pai.example.tensorflow
dataDir: hdfs://10.1.1.1:9000/nni
outputDir: hdfs://10.1.1.1:9000/nni
# Configuration to access OpenPAI Cluster
paiConfig:
userName: your_pai_nni_user
passWord: your_pai_password
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to expose the password here?

host: 10.1.1.1
```
Note: You should set `trainingServicePlatform: pai` in nni config yaml file if you want to start experiment in pai mode.

Compared with LocalMode and [RemoteMachineMode](RemoteMachineMode.md), trial configuration in pai mode have five additional keys:
* cpuNum
* Required key. Should be positive number based on your trial program's CPU requirement
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* image
* Required key. In pai mode, your trial program will be scheduled by OpenPAI to run in [Docker container](https://www.docker.com/). This key is used to specify the Docker image used to create the container in which your traill will run.
* dataDir
* Optional key. It specifies the HDFS data direcotry for trial to download data. The format should be something like hdfs://{your HDFS host}:9000/{your data directory}
* outputDir
* Optional key. It specifies the HDFS output direcotry for trial. Once the trial is completed (either succeed or fail), trial's stdout, stderr will be copied to this directory by NNI sdk automatically. The format should be something like hdfs://{your HDFS host}:9000/{your output directory}

Once complete to fill nni experiment config file and save (for example, save as exp_pai.yaml), then run the following command
```
nnictl create --config exp_pai.yaml
```
to start the experiment in pai mode. NNI will create OpanPAI job for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpanPAI is a typo.

You can see the pai jobs created by NNI in your OpenPAI cluster's web portal, like:
![](./nni_pai_joblist.jpg)

Notice: In pai mode, NNIManager will start a rest server and listen on `51189` port, to receive metrics from trial job running in PAI container. So you should `enable 51189` TCP port in your firewall rule to allow incoming traffic.

Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.

Expand a trial information in trial list view, click the logPath link like:
![](./nni_webui_joblist.jpg)

And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
![](./nni_trial_hdfs_output.jpg)

You can see there're three fils in output folder: stderr, stdout, and trial.log
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fils" is a typo.


If you also want to save trial's other output into HDFS, like model files, you can use environment variable `NNI_OUTPUT_DIR` in your trial code to save your own output files, and NNI SDK will copy all the files in `NNI_OUTPUT_DIR` from trial's container to HDFS.

Any problems when using NNI in pai mode, plesae create issues on [NNI github repo](https://github.com/Microsoft/nni), or send mail to nni@microsoft.com

Binary file added docs/nni_pai_joblist.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/nni_trial_hdfs_output.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/nni_webui_joblist.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.