This repository has been archived by the owner on Sep 18, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add documentation for NNI PAI mode experiment #141
Merged
Merged
Changes from all commits
Commits
Show all changes
87 commits
Select commit
Hold shift + click to select a range
9a8ac16
PAI Training service implementation, v1
8983045
update trial package directory in setup.py
248d0eb
Update setup.py package info
43fca76
Update trial keeper module, use IP adress for pai training service ma…
4fe49de
Update metrics file path in reader
66a54e1
Fix metrics file path issue
65709d3
Update pai integration, full implementation of pai training service
c1a3d34
Do not send metrics if it is empty
232d0e8
Update nnictl, to support pai configuration
a5d4a20
fix repo
cd64e5f
add hdfs_output_dir
889e066
add copy logic
de9c374
debug
e98b0ac
update hdfsUtility
272411a
debug
4cba4d1
debug
45d1031
fix setup.py bug
e63ffc0
fix bug
954d640
debug
0410d05
debug
e3788d2
add exception handler
793cbf1
fix bug
b14c108
debug
0ae9f6d
fix bug
5938310
fix bug
c756188
fix bug
b6ce813
split metrics into single line, and read metrics no matter if subproc…
dc0f96b
add unit test for hdfsClientUtility
55b6e08
fix bug
2529376
Add experiment id in update metrics url to differ trials
0f7d40c
add default outputdir
43d7ab7
update
9c53f47
fix trial_keeper
beac29c
fix bug
c54ad7a
add default value for nnioutputdir
c214362
fix bug
60bf770
remove unused code
fad2ba3
PAI Training service implementation, v1 (#1)
yds05 7f06762
fix conflict
aa4f306
fix conflict
45c9600
fix conflict
24dd1b6
Remove unused import and paiTrialConfig file
84d278c
Merge branch 'master' into dev-pai
yds05 9febbd3
Merge pull request #3 from yds05/dev-pai
yds05 7a43c54
fix conflict
3e0cce2
refactor code
ef1eaf8
fix comments
7f9baea
fix comment
4af5c60
Implement cancel job API for pai training service
eb548cf
fix default value for outputDir
4d24e87
fix comments
6325bd3
Merge pull request #4 from yds05/dev-pai-desy
yds05 1db913c
Merge pull request #2 from yds05/dev-pai-t-shya2
SparkSnail 5487975
Merge branch 'master' of https://github.com/Microsoft/nni into Micros…
88b1876
Merge branch 'Microsoft-master'
90c9e69
Merge pull request #6 from yds05/master
SparkSnail b714a8f
fix pip install to master
b6a233a
change pip install branch in paiData.ts
9511174
fix conflict
52b1cc8
fix log path
76bd378
fix conflict
c27d146
add logpath logic
449a4f3
add log path
1d9f23e
refactor schema
aa552c0
Fix bug that all trials use the same hdfs log path
9edfb34
Merge pull request #7 from yds05/dev-pai-t-shya2
yds05 cb46266
Update PAI training service PR comments
f09a651
Remove unused nnits-tool in uninstallation
94c92c3
Remove unused trianing_service_tool package in setup.py
2eca5d9
Update setup.py version to 0.2.0
717856e
Change pip install repo to Microsoft/nni
c32cd52
Update NNI v0.2 release notes
a549a16
Merge pull request #8 from Microsoft/master
yds05 76c10e8
Fix typo based on PR comments
5442251
Add NNI installation scripts
09fd234
Merge pull request #9 from Microsoft/v0.2
yds05 73556cc
Update pai script, update NNI_out_dir
a0da600
Update NNI dir in nni sdk local.py
8b644d8
Create .nni folder in nni sdk local.py
fb6e57f
Add check before creating .nni folder
567bf09
Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT
8d7a474
Add documentation for NNI PAI mode
ea792fd
Merge pull request #10 from Microsoft/v0.2
yds05 01de6b7
Fix typo based on PR comments
a4acc22
Exit with subprocess return code of trial keeper
8467ac5
Remove additional exit code
39981a7
Fix typo based on PR comments
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
**Run an Experiment on OpenPAI** | ||
=== | ||
NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker. | ||
|
||
## Setup environment | ||
Install NNI, follow the install guide [here](GetStarted.md). | ||
|
||
## Run an experiment | ||
Use `examples/trials/mnist-annotation` as an example. The nni config yaml file's content is like: | ||
``` | ||
authorName: your_name | ||
experimentName: auto_mnist | ||
# how many trials could be concurrently running | ||
trialConcurrency: 2 | ||
# maximum experiment running duration | ||
maxExecDuration: 3h | ||
# empty means never stop | ||
maxTrialNum: 100 | ||
# choice: local, remote, pai | ||
trainingServicePlatform: pai | ||
# choice: true, false | ||
useAnnotation: true | ||
tuner: | ||
builtinTunerName: TPE | ||
classArgs: | ||
optimize_mode: maximize | ||
trial: | ||
command: python3 mnist.py | ||
codeDir: ~/nni/examples/trials/mnist-annotation | ||
gpuNum: 0 | ||
cpuNum: 1 | ||
memoryMB: 8196 | ||
image: openpai/pai.example.tensorflow | ||
dataDir: hdfs://10.1.1.1:9000/nni | ||
outputDir: hdfs://10.1.1.1:9000/nni | ||
# Configuration to access OpenPAI Cluster | ||
paiConfig: | ||
userName: your_pai_nni_user | ||
passWord: your_pai_password | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we have to expose the password here? |
||
host: 10.1.1.1 | ||
``` | ||
Note: You should set `trainingServicePlatform: pai` in nni config yaml file if you want to start experiment in pai mode. | ||
|
||
Compared with LocalMode and [RemoteMachineMode](RemoteMachineMode.md), trial configuration in pai mode have five additional keys: | ||
* cpuNum | ||
* Required key. Should be positive number based on your trial program's CPU requirement | ||
* memoryMB | ||
* Required key. Should be positive number based on your trial program's memory requirement | ||
* image | ||
* Required key. In pai mode, your trial program will be scheduled by OpenPAI to run in [Docker container](https://www.docker.com/). This key is used to specify the Docker image used to create the container in which your traill will run. | ||
* dataDir | ||
* Optional key. It specifies the HDFS data direcotry for trial to download data. The format should be something like hdfs://{your HDFS host}:9000/{your data directory} | ||
* outputDir | ||
* Optional key. It specifies the HDFS output direcotry for trial. Once the trial is completed (either succeed or fail), trial's stdout, stderr will be copied to this directory by NNI sdk automatically. The format should be something like hdfs://{your HDFS host}:9000/{your output directory} | ||
|
||
Once complete to fill nni experiment config file and save (for example, save as exp_pai.yaml), then run the following command | ||
``` | ||
nnictl create --config exp_pai.yaml | ||
``` | ||
to start the experiment in pai mode. NNI will create OpanPAI job for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OpanPAI is a typo. |
||
You can see the pai jobs created by NNI in your OpenPAI cluster's web portal, like: | ||
![](./nni_pai_joblist.jpg) | ||
|
||
Notice: In pai mode, NNIManager will start a rest server and listen on `51189` port, to receive metrics from trial job running in PAI container. So you should `enable 51189` TCP port in your firewall rule to allow incoming traffic. | ||
|
||
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information. | ||
|
||
Expand a trial information in trial list view, click the logPath link like: | ||
![](./nni_webui_joblist.jpg) | ||
|
||
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS: | ||
![](./nni_trial_hdfs_output.jpg) | ||
|
||
You can see there're three fils in output folder: stderr, stdout, and trial.log | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "fils" is a typo. |
||
|
||
If you also want to save trial's other output into HDFS, like model files, you can use environment variable `NNI_OUTPUT_DIR` in your trial code to save your own output files, and NNI SDK will copy all the files in `NNI_OUTPUT_DIR` from trial's container to HDFS. | ||
|
||
Any problems when using NNI in pai mode, plesae create issues on [NNI github repo](https://github.com/Microsoft/nni), or send mail to nni@microsoft.com | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change the 1st sentence to:
Starting from v0.2.0, NNI supports one more Training Service Mode: pai mode, which enabled user to run an experiment on OpenPAI.