This repository has been archived by the owner on Sep 18, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
HPO: Alibaba DSW+DLC support #4055
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
64a59c4
dlc: init dlc & sumit dlc & start trial_runner
weidankong 5818644
DLC: support file command channel used in dlcEnvironment
weidankong 45eb3a5
DLC: remove redundant script dir param
weidankong b30c1a2
DLC: add document && add job_type in config
weidankong 891994f
DLC: use storage service & update doc
weidankong File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
**Run an Experiment on Aliyun PAI-DSW + PAI-DLC** | ||
=================================================== | ||
|
||
NNI supports running an experiment on `PAI-DSW <https://help.aliyun.com/document_detail/194831.html>`__ , submit trials to `PAI-DLC <https://help.aliyun.com/document_detail/165137.html>`__ called dlc mode. | ||
|
||
PAI-DSW server performs the role to submit a job while PAI-DLC is where the training job runs. | ||
|
||
Setup environment | ||
----------------- | ||
|
||
Step 1. Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__. | ||
|
||
Step 2. Create PAI-DSW server following this `link <https://help.aliyun.com/document_detail/163684.html?section-2cw-lsi-es9#title-ji9-re9-88x>`__. Note as the training service will be run on PAI-DLC, it won't cost many resources to run and you may just need a PAI-DSW server with CPU. | ||
|
||
Step 3. Open PAI-DLC `here <https://pai-dlc.console.aliyun.com/#/guide>`__, select the same region as your PAI-DSW server. Move to ``dataset configuration`` and mount the same NAS disk as the PAI-DSW server does. (Note currently only PAI-DLC public-cluster is supported.) | ||
|
||
Step 4. Open your PAI-DSW server command line, download and install PAI-DLC python SDK to submit DLC tasks, refer to `this link <https://help.aliyun.com/document_detail/203290.html>`__. Skip this step if SDK is already installed. | ||
|
||
|
||
.. code-block:: bash | ||
|
||
wget https://sdk-portal-cluster-prod.oss-cn-zhangjiakou.aliyuncs.com/downloads/u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip | ||
unzip u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip | ||
pip install ./pai-dlc-20201203 # pai-dlc-20201203 refer to unzipped sdk file name, replace it accordingly. | ||
|
||
|
||
Run an experiment | ||
----------------- | ||
|
||
Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's content is like: | ||
|
||
.. code-block:: yaml | ||
|
||
# working directory on DSW, please provie FULL path | ||
experimentWorkingDirectory: /home/admin/workspace/{your_working_dir} | ||
searchSpaceFile: search_space.json | ||
# the command on trial runner(or, DLC container), be aware of data_dir | ||
trialCommand: python mnist.py --data_dir /root/data/{your_data_dir} | ||
trialConcurrency: 1 # NOTE: please provide number <= 3 due to DLC system limit. | ||
maxTrialNumber: 10 | ||
tuner: | ||
name: TPE | ||
classArgs: | ||
optimize_mode: maximize | ||
# ref: https://help.aliyun.com/document_detail/203290.html?spm=a2c4g.11186623.6.727.6f9b5db6bzJh4x | ||
trainingService: | ||
platform: dlc | ||
type: Worker | ||
image: registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/pytorch-training:1.6.0-gpu-py37-cu101-ubuntu18.04 | ||
jobType: PyTorchJob # choices: [TFJob, PyTorchJob] | ||
podCount: 1 | ||
ecsSpec: ecs.c6.large | ||
region: cn-hangzhou | ||
nasDataSourceId: ${your_nas_data_source_id} | ||
accessKeyId: ${your_ak_id} | ||
accessKeySecret: ${your_ak_key} | ||
nasDataSourceId: ${your_nas_data_source_id} # NAS datasource ID,e.g., datat56by9n1xt0a | ||
localStorageMountPoint: /home/admin/workspace/ # default NAS path on DSW | ||
containerStorageMountPoint: /root/data/ # default NAS path on DLC container, change it according your setting | ||
|
||
Note: You should set ``platform: dlc`` in NNI config YAML file if you want to start experiment in dlc mode. | ||
|
||
Compared with `LocalMode <LocalMode.rst>`__ training service configuration in dlc mode have these additional keys like ``type/image/jobType/podCount/ecsSpec/region/nasDataSourceId/accessKeyId/accessKeySecret``, for detailed explanation ref to this `link <https://help.aliyun.com/document_detail/203111.html#h2-url-3>`__. | ||
|
||
Also, as dlc mode requires DSW/DLC to mount the same NAS disk to share information, there are two extra keys related to this: ``localStorageMountPoint`` and ``containerStorageMountPoint``. | ||
|
||
Run the following commands to start the example experiment: | ||
|
||
.. code-block:: bash | ||
|
||
git clone -b ${NNI_VERSION} https://github.com/microsoft/nni | ||
cd nni/examples/trials/mnist-pytorch | ||
|
||
# modify config_dlc.yml ... | ||
|
||
nnictl create --config config_dlc.yml | ||
|
||
Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v2.3``. | ||
|
||
Monitor your job | ||
---------------- | ||
|
||
To monitor your job on DLC, you need to visit `DLC <https://pai-dlc.console.aliyun.com/#/jobs>`__ to check job status. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# working directory on DSW, please provie FULL path | ||
searchSpaceFile: search_space.json | ||
# the command on trial runner(or, DLC container), be aware of data_dir | ||
trialCommand: python mnist.py --data_dir /root/data/{your_data_dir} | ||
trialConcurrency: 1 # NOTE: please provide number <= 3 due to DLC system limit. | ||
maxTrialNumber: 10 | ||
tuner: | ||
name: TPE | ||
classArgs: | ||
optimize_mode: maximize | ||
# ref: https://help.aliyun.com/document_detail/203290.html?spm=a2c4g.11186623.6.727.6f9b5db6bzJh4x | ||
trainingService: | ||
platform: dlc | ||
type: Worker | ||
image: registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/pytorch-training:1.6.0-gpu-py37-cu101-ubuntu18.04 | ||
jobType: PyTorchJob # choices: [TFJob, PyTorchJob] | ||
podCount: 1 | ||
ecsSpec: ecs.c6.large | ||
region: cn-hangzhou | ||
nasDataSourceId: ${your_nas_data_source_id} | ||
accessKeyId: ${your_ak_id} | ||
accessKeySecret: ${your_ak_key} | ||
nasDataSourceId: ${your_nas_data_source_id} # NAS datasource ID,e.g., datat56by9n1xt0a | ||
localStorageMountPoint: /home/admin/workspace/ # default NAS path on DSW, MUST provide full path. | ||
containerStorageMountPoint: /root/data/ # default NAS path on DLC container, change it according your setting |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# Copyright (c) Microsoft Corporation. | ||
# Licensed under the MIT license. | ||
|
||
from dataclasses import dataclass | ||
|
||
from .common import TrainingServiceConfig | ||
|
||
__all__ = ['DlcConfig'] | ||
|
||
@dataclass(init=False) | ||
class DlcConfig(TrainingServiceConfig): | ||
platform: str = 'dlc' | ||
type: str = 'Worker' | ||
image: str # 'registry-vpc.{region}.aliyuncs.com/pai-dlc/tensorflow-training:1.15.0-cpu-py36-ubuntu18.04', | ||
job_type: str = 'TFJob' | ||
pod_count: int | ||
ecs_spec: str # e.g.,'ecs.c6.large' | ||
region: str | ||
nas_data_source_id: str | ||
access_key_id: str | ||
access_key_secret: str | ||
local_storage_mount_point: str | ||
container_storage_mount_point: str | ||
|
||
_validation_rules = { | ||
'platform': lambda value: (value == 'dlc', 'cannot be modified') | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,7 @@ | |
'remote', | ||
'openpai', 'pai', | ||
'aml', | ||
'dlc' | ||
'kubeflow', | ||
'frameworkcontroller', | ||
'adl', | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provide
accessKeySecret
directly in yaml file directly is not a recommended way.