Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Add version check document in PAI, remote, kubeflow and frameworkcontroller #947

Merged
merged 79 commits into from
Apr 2, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
d77a99c
fix remote bug
Dec 25, 2018
695d866
Merge pull request #106 from Microsoft/master
SparkSnail Dec 25, 2018
b7e9799
Merge pull request #107 from Microsoft/master
SparkSnail Dec 27, 2018
7cb03f9
add document
Dec 27, 2018
44d1565
add document
Dec 27, 2018
7ab7386
update
Dec 27, 2018
d9e1ea8
update
Dec 27, 2018
2c225a8
update
Dec 27, 2018
be23f55
update
Dec 29, 2018
6f760ab
Merge pull request #108 from Microsoft/master
SparkSnail Jan 2, 2019
9161209
fix remote issue
Jan 3, 2019
e661c55
fix forEach
Jan 3, 2019
4e5d836
Merge pull request #109 from Microsoft/master
SparkSnail Jan 3, 2019
f80e737
fix conflict
Jan 4, 2019
aefc219
Merge branch 'Microsoft-master'
Jan 4, 2019
4fec2cc
update doc according to comments
Jan 7, 2019
dc45661
Merge pull request #111 from Microsoft/master
SparkSnail Jan 7, 2019
11fec6f
update
Jan 7, 2019
a03a191
update
Jan 7, 2019
7c7832c
update
Jan 7, 2019
2c862dc
Merge pull request #112 from Microsoft/master
SparkSnail Jan 8, 2019
85c015d
remove 'any more'
Jan 8, 2019
85cb472
Merge branch 'master' of https://github.com/SparkSnail/nni
Jan 8, 2019
3784355
Merge pull request #113 from Microsoft/master
SparkSnail Jan 9, 2019
d91c980
Merge pull request #114 from Microsoft/master
SparkSnail Jan 14, 2019
9786650
Merge pull request #115 from Microsoft/master
SparkSnail Jan 17, 2019
ef176d2
Merge pull request #116 from Microsoft/master
SparkSnail Jan 22, 2019
1089e80
Merge pull request #117 from Microsoft/master
SparkSnail Jan 23, 2019
627e823
Merge pull request #119 from Microsoft/master
SparkSnail Jan 24, 2019
b633c26
Merge pull request #120 from Microsoft/master
SparkSnail Jan 25, 2019
035d58b
Merge pull request #121 from Microsoft/master
SparkSnail Feb 11, 2019
cd549df
Merge pull request #122 from Microsoft/master
SparkSnail Feb 12, 2019
964743a
Merge pull request #123 from Microsoft/master
SparkSnail Feb 12, 2019
8422992
Merge pull request #124 from Microsoft/master
SparkSnail Feb 13, 2019
40391ec
Merge pull request #125 from Microsoft/master
SparkSnail Feb 18, 2019
1d84526
Merge pull request #126 from Microsoft/master
SparkSnail Feb 20, 2019
1852457
Merge pull request #127 from Microsoft/master
SparkSnail Feb 23, 2019
754a354
Merge pull request #128 from Microsoft/master
SparkSnail Feb 24, 2019
1ee9735
Merge pull request #129 from Microsoft/master
SparkSnail Feb 25, 2019
9f4485c
Merge pull request #130 from Microsoft/master
SparkSnail Feb 25, 2019
b1c3774
Merge pull request #131 from Microsoft/master
SparkSnail Feb 25, 2019
5d7923e
Merge pull request #132 from Microsoft/master
SparkSnail Feb 25, 2019
281f3dc
Merge pull request #133 from Microsoft/master
SparkSnail Feb 26, 2019
2ce9157
Merge pull request #134 from Microsoft/master
SparkSnail Feb 26, 2019
571a7af
Merge pull request #135 from Microsoft/master
SparkSnail Feb 28, 2019
f09d51a
Merge pull request #136 from Microsoft/master
SparkSnail Mar 1, 2019
41a9a59
Merge pull request #137 from Microsoft/master
SparkSnail Mar 5, 2019
21165b5
Merge pull request #138 from Microsoft/master
SparkSnail Mar 7, 2019
d25f7b5
Merge pull request #139 from Microsoft/master
SparkSnail Mar 11, 2019
17e719e
Merge pull request #140 from Microsoft/master
SparkSnail Mar 12, 2019
e25ffbd
Merge pull request #141 from Microsoft/master
SparkSnail Mar 13, 2019
5e777d2
Merge pull request #142 from Microsoft/master
SparkSnail Mar 14, 2019
6ff24a5
Merge pull request #143 from Microsoft/master
SparkSnail Mar 18, 2019
ccf6c04
Merge pull request #144 from Microsoft/master
SparkSnail Mar 20, 2019
eb5e21c
Merge pull request #145 from Microsoft/master
SparkSnail Mar 20, 2019
f796c60
Merge pull request #146 from Microsoft/master
SparkSnail Mar 21, 2019
e1ae623
Merge pull request #147 from Microsoft/master
SparkSnail Mar 22, 2019
ec41d56
Merge pull request #148 from Microsoft/master
SparkSnail Mar 25, 2019
080ae00
Merge pull request #149 from Microsoft/master
SparkSnail Mar 26, 2019
f0a2d39
Merge pull request #150 from Microsoft/master
SparkSnail Mar 26, 2019
2792098
fix nnictl experiment list command
Mar 28, 2019
d3adab4
fix comments
Mar 29, 2019
b4cb815
fix gpu scheduler bug
Mar 30, 2019
3c5ec02
Merge pull request #151 from Microsoft/v0.6
SparkSnail Mar 30, 2019
79a3bb5
Merge branch 'v0.6' of https://github.com/SparkSnail/nni into v0.6
Mar 30, 2019
0be85b0
remove unused blank line
Mar 30, 2019
ab4b416
rename variable
Mar 30, 2019
20f5aed
remove usused variable
Mar 30, 2019
905b6eb
refactor pkill
Mar 30, 2019
fdba019
fix stop logic
Mar 30, 2019
ed56f82
remove console
Mar 30, 2019
8a22c66
fix comments
Mar 30, 2019
8f17953
fix comments
Apr 1, 2019
dc432b4
add version check document
Apr 1, 2019
f2946f7
fix words
Apr 1, 2019
d926e85
Merge pull request #154 from Microsoft/v0.6
SparkSnail Apr 1, 2019
398b2c6
fix comments
Apr 1, 2019
8d3c3f1
fix gpu scheduler
Apr 1, 2019
d70481a
remove empty line
Apr 1, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion docs/en_US/FrameworkControllerMode.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,4 +97,7 @@ Trial configuration in frameworkcontroller mode have the following configuration
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy) to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, this completionpolicy could helps stop ps.

## How to run example
After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please refer the [document](./KubeflowMode.md) for more information.
After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please refer the [document](./KubeflowMode.md) for more information.

## version check
SparkSnail marked this conversation as resolved.
Show resolved Hide resolved
NNI support version check feature in since version 0.6, [refer](PAIMode.md)
3 changes: 3 additions & 0 deletions docs/en_US/KubeflowMode.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,4 +196,7 @@ Notice: In kubeflow mode, NNIManager will start a rest server and listen on a po

Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.

## version check
SparkSnail marked this conversation as resolved.
Show resolved Hide resolved
NNI support version check feature in since version 0.6, [refer](PAIMode.md)

Any problems when using NNI in kubeflow mode, please create issues on [NNI Github repo](https://github.com/Microsoft/nni).
10 changes: 10 additions & 0 deletions docs/en_US/PAIMode.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,3 +83,13 @@ You can see there're three fils in output folder: stderr, stdout, and trial.log
If you also want to save trial's other output into HDFS, like model files, you can use environment variable `NNI_OUTPUT_DIR` in your trial code to save your own output files, and NNI SDK will copy all the files in `NNI_OUTPUT_DIR` from trial's container to HDFS.

Any problems when using NNI in pai mode, please create issues on [NNI github repo](https://github.com/Microsoft/nni).

## version check
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
1. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
2. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
3. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.

If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
![](../img/version_check.png)
3 changes: 3 additions & 0 deletions docs/en_US/RemoteMachineMode.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,3 +63,6 @@ nnictl create --config ~/nni/examples/trials/mnist-annotation/config_remote.yml
```

to start the experiment.

## version check
NNI support version check feature in since version 0.6, [refer](PAIMode.md)
Binary file added docs/img/version_check.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 7 additions & 3 deletions src/nni_manager/training_service/local/gpuScheduler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,13 @@ class GPUScheduler {

public async stop() {
this.stopping = true;
const pid: string = await fs.promises.readFile(path.join(this.gpuMetricCollectorScriptFolder, 'pid'), 'utf8');
await cpp.exec(`pkill -P ${pid}`);
await cpp.exec(`rm -rf ${this.gpuMetricCollectorScriptFolder}`);
try {
const pid: string = await fs.promises.readFile(path.join(this.gpuMetricCollectorScriptFolder, 'pid'), 'utf8');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to consider how to show errors in NNI cleanup to end-user, but you can improve later after v0.6 release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will refactor the logic in next release.

await cpp.exec(`pkill -P ${pid}`);
await cpp.exec(`rm -rf ${this.gpuMetricCollectorScriptFolder}`);
} catch (error){
this.log.error(`GPU scheduler error: ${error}`);
}
}

private async updateGPUSummary() {
Expand Down