Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Commit

Permalink
Optimized layout of PaiMode.md (#3085)
Browse files Browse the repository at this point in the history
  • Loading branch information
98may authored Nov 24, 2020
1 parent 52e40cb commit e9c62ba
Showing 1 changed file with 29 additions and 13 deletions.
42 changes: 29 additions & 13 deletions docs/en_US/TrainingService/PaiMode.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,27 @@
===
NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.

[toc]

## Setup environment

Step 1. Install NNI, follow the install guide [here](../Tutorial/QuickStart.md).
**Step 1. Install NNI, follow the install guide [here](../Tutorial/QuickStart.md).**

Step 2. Get token.
**Step 2. Get token.**

Open web portal of OpenPAI, and click `My profile` button in the top-right side.
![](../../img/pai_profile.jpg)
<img src="../../img/pai_profile.jpg" style="zoom: 80%;" />

Click `copy` button in the page to copy a jwt token.
![](../../img/pai_token.jpg)
<img src="../../img/pai_token.jpg" style="zoom:67%;" />

Step 3. Mount NFS storage to local machine.
**Step 3. Mount NFS storage to local machine.**

Click `Submit job` button in web portal.
![](../../img/pai_job_submission_page.jpg)
<img src="../../img/pai_job_submission_page.jpg" style="zoom: 50%;" />

Find the data management region in job submission page.
![](../../img/pai_data_management_page.jpg)
<img src="../../img/pai_data_management_page.jpg" style="zoom: 33%;" />

The `Preview container paths` is the NFS host and path that OpenPAI provided, you need to mount the corresponding host and path to your local machine first, then NNI could use the OpenPAI's NFS storage.
For example, use the following command:
Expand All @@ -33,9 +35,9 @@ You could use the following configuration in your NNI's config file:

```yaml
nniManagerNFSMountPath: /local/mnt
```
```
Step 4. Get OpenPAI's storage config name and nniManagerMountPath
**Step 4. Get OpenPAI's storage config name and nniManagerMountPath**
The `Team share storage` field is storage configuration used to specify storage value in OpenPAI. You can get `paiStorageConfigName` and `containerNFSMountPath` field in `Team share storage`, for example:

Expand All @@ -44,7 +46,10 @@ paiStorageConfigName: confignfs-data
containerNFSMountPath: /mnt/confignfs-data
```



## Run an experiment

Use `examples/trials/mnist-annotation` as an example. The NNI config YAML file's content is like:

```yaml
Expand Down Expand Up @@ -88,9 +93,11 @@ paiConfig:

Note: You should set `trainingServicePlatform: pai` in NNI config YAML file if you want to start experiment in pai mode. The host field in configuration file is PAI's job submission page uri, like `10.10.5.1`, the default http protocol in NNI is `http`, if your PAI's cluster enabled https, please use the uri in `https://10.10.5.1` format.



### Trial configurations

Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMode.md), `trial` configuration in pai mode have these additional keys:
Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMode.md), `trial` configuration in pai mode has the following additional keys:

* cpuNum

Expand Down Expand Up @@ -136,6 +143,8 @@ Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMod

2. If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.



### OpenPAI configurations

`paiConfig` includes OpenPAI specific configurations,
Expand Down Expand Up @@ -171,17 +180,23 @@ Notice: In pai mode, NNIManager will start a rest server and listen on a port wh
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.

Expand a trial information in trial list view, click the logPath link like:
![](../../img/nni_webui_joblist.jpg)
<img src="../../img/nni_webui_joblist.jpg" style="zoom: 30%;" />

And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
![](../../img/nni_trial_hdfs_output.jpg)
<img src="../../img/nni_trial_hdfs_output.jpg" style="zoom: 80%;" />

You can see there're three fils in output folder: stderr, stdout, and trial.log



## data management

Before using NNI to start your experiment, users should set the corresponding mount data path in your nniManager machine. OpenPAI has their own storage(NFS, AzureBlob ...), and the storage will used in OpenPAI will be mounted to the container when it start a job. Users should set the OpenPAI storage type by `paiStorageConfigName` field to choose a storage in OpenPAI. Then users should mount the storage to their nniManager machine, and set the `nniManagerNFSMountPath` field in configuration file, NNI will generate bash files and copy data in `codeDir` to the `nniManagerNFSMountPath` folder, then NNI will start a trial job. The data in `nniManagerNFSMountPath` will be sync to OpenPAI storage, and will be mounted to OpenPAI's container. The data path in container is set in `containerNFSMountPath`, NNI will enter this folder first, and then run scripts to start a trial job.



## version check

NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:

Expand All @@ -190,4 +205,5 @@ Check policy:
3. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.

If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
![](../../img/version_check.png)

<img src="../../img/version_check.png" style="zoom: 80%;" />

0 comments on commit e9c62ba

Please sign in to comment.