diff --git a/docs/en_US/FrameworkControllerMode.md b/docs/en_US/FrameworkControllerMode.md index ce6cd3fd11..9d4c410786 100644 --- a/docs/en_US/FrameworkControllerMode.md +++ b/docs/en_US/FrameworkControllerMode.md @@ -97,4 +97,7 @@ Trial configuration in frameworkcontroller mode have the following configuration * frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy) to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, this completionpolicy could helps stop ps. ## How to run example -After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please refer the [document](./KubeflowMode.md) for more information. \ No newline at end of file +After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please refer the [document](./KubeflowMode.md) for more information. + +## version check +NNI support version check feature in since version 0.6, [refer](PAIMode.md) \ No newline at end of file diff --git a/docs/en_US/KubeflowMode.md b/docs/en_US/KubeflowMode.md index 5a7760c0b9..44ceb7dffb 100644 --- a/docs/en_US/KubeflowMode.md +++ b/docs/en_US/KubeflowMode.md @@ -196,4 +196,7 @@ Notice: In kubeflow mode, NNIManager will start a rest server and listen on a po Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information. +## version check +NNI support version check feature in since version 0.6, [refer](PAIMode.md) + Any problems when using NNI in kubeflow mode, please create issues on [NNI Github repo](https://github.com/Microsoft/nni). diff --git a/docs/en_US/PAIMode.md b/docs/en_US/PAIMode.md index 07cf53e25f..028798fc58 100644 --- a/docs/en_US/PAIMode.md +++ b/docs/en_US/PAIMode.md @@ -83,3 +83,13 @@ You can see there're three fils in output folder: stderr, stdout, and trial.log If you also want to save trial's other output into HDFS, like model files, you can use environment variable `NNI_OUTPUT_DIR` in your trial code to save your own output files, and NNI SDK will copy all the files in `NNI_OUTPUT_DIR` from trial's container to HDFS. Any problems when using NNI in pai mode, please create issues on [NNI github repo](https://github.com/Microsoft/nni). + +## version check +NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility. +Check policy: +1. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility. +2. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too. +3. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7. + +If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check. +![](../img/version_check.png) \ No newline at end of file diff --git a/docs/en_US/RemoteMachineMode.md b/docs/en_US/RemoteMachineMode.md index 46c21153d4..2d18dc7c71 100644 --- a/docs/en_US/RemoteMachineMode.md +++ b/docs/en_US/RemoteMachineMode.md @@ -63,3 +63,6 @@ nnictl create --config ~/nni/examples/trials/mnist-annotation/config_remote.yml ``` to start the experiment. + +## version check +NNI support version check feature in since version 0.6, [refer](PAIMode.md) \ No newline at end of file diff --git a/docs/img/version_check.png b/docs/img/version_check.png new file mode 100644 index 0000000000..3ebb516b2a Binary files /dev/null and b/docs/img/version_check.png differ diff --git a/src/nni_manager/training_service/local/gpuScheduler.ts b/src/nni_manager/training_service/local/gpuScheduler.ts index 2c790b019c..58afb31e85 100644 --- a/src/nni_manager/training_service/local/gpuScheduler.ts +++ b/src/nni_manager/training_service/local/gpuScheduler.ts @@ -85,9 +85,13 @@ class GPUScheduler { public async stop() { this.stopping = true; - const pid: string = await fs.promises.readFile(path.join(this.gpuMetricCollectorScriptFolder, 'pid'), 'utf8'); - await cpp.exec(`pkill -P ${pid}`); - await cpp.exec(`rm -rf ${this.gpuMetricCollectorScriptFolder}`); + try { + const pid: string = await fs.promises.readFile(path.join(this.gpuMetricCollectorScriptFolder, 'pid'), 'utf8'); + await cpp.exec(`pkill -P ${pid}`); + await cpp.exec(`rm -rf ${this.gpuMetricCollectorScriptFolder}`); + } catch (error){ + this.log.error(`GPU scheduler error: ${error}`); + } } private async updateGPUSummary() {