-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Conversation
Please add doc for DLC mode, refer https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/AMLMode.rst |
8cf25e0
to
d57d738
Compare
const deferred: Deferred<string> = new Deferred<string>(); | ||
this.pythonShellClient = new PythonShell('dlcUtil.py', { | ||
scriptPath: './config/dlc', | ||
pythonPath: 'python', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detect OS, use python
for windows, otherwise use python3
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in dlc mode, only Linux environment is available, no windows case.
private getScript(): string[] { | ||
const script: string[] = []; | ||
script.push( | ||
`python ./config/dlc/dlcUtil.py --type ${this.type} --image ${this.image} --pod_count ${this.podCount} ` + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detect OS, use python
for windows, otherwise use python3
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in dlc mode, only Linux environment is available, no windows case.
this.monitorError(this.pythonShellClient, deferred); | ||
return deferred.promise; | ||
|
||
return deferred.promise; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
this.experimentId = info.experimentId; | ||
this.experimentRootDir = info.logDir; | ||
this.config = flattenConfig(config, 'dlc'); | ||
component.Container.bind(StorageService).to(MountedStorageService).scope(Scope.Singleton); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should initialize storage service with your own localStorageMountPoint.
} | ||
|
||
public get hasStorageService(): boolean { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true
await fs.promises.mkdir(environmentLocalTempFolder, {recursive: true}); | ||
} | ||
|
||
const dlcFolder: string = this.experimentRootDir.replace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems this.experimentRootDir
is assigned to info.logDir
, why does it contains this.config.localStorageMountPoint
? why don't use dlcFolder = this.config.containerStorageMountPoint
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to use storageService similar to openPai.
} | ||
|
||
const prepare = `cd ${dlcEnvironment.runnerWorkingFolder} && cp -r ../../environment-temp/envs/* ../`; | ||
const startrun = `sh ../install_nni.sh && python -m nni.tools.trial_tool.trial_runner`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
environment.command already initialized this command https://github.com/microsoft/nni/blob/master/ts/nni_manager/training_service/reusable/trialDispatcher.ts#L660. BTW why use python instead of python3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in default DLC container image, python is python3.
nni/tools/nnictl/config_schema.py
Outdated
@@ -130,7 +130,7 @@ def validate(self, data): | |||
Optional('maxTrialDuration'): And(Regex(r'^[1-9][0-9]*[s|m|h|d]$', error='ERROR: maxTrialDuration format is [digit]{s,m,h,d}')), | |||
Optional('maxTrialNum'): setNumberRange('maxTrialNum', int, 1, 99999), | |||
'trainingServicePlatform': setChoice( | |||
'trainingServicePlatform', 'remote', 'local', 'pai', 'kubeflow', 'frameworkcontroller', 'dlts', 'aml', 'adl', 'hybrid'), | |||
'trainingServicePlatform', 'remote', 'local', 'pai', 'kubeflow', 'frameworkcontroller', 'dlts', 'aml', 'adl', 'hybrid', 'dlc'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you provide v1 configuration, please add corresponding convert in https://github.com/microsoft/nni/blob/master/nni/experiment/config/convert.py#L16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove v1 support.
region: cn-hangzhou | ||
nasDataSourceId: ${your_nas_data_source_id} | ||
accessKeyId: ${your_ak_id} | ||
accessKeySecret: ${your_ak_key} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provide accessKeySecret
directly in yaml file directly is not a recommended way.
ts/nni_manager/config/dlc/dlcUtil.py
Outdated
ecs_spec=args.ecs_spec, | ||
) | ||
|
||
job_type = 'TFJob' if args.image.find('tensorflow') >= 0 else 'PyTorchJob' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provide job type as nni job configuration, detect image name is not good way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, add a new param in config.
@@ -933,3 +1039,5 @@ containerName | |||
AzureBlob container name. | |||
|
|||
type: ``str`` | |||
|
|||
ion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
nni/tools/nnictl/config_schema.py
Outdated
@@ -477,6 +499,7 @@ def validate(self, data): | |||
'dlts': Schema({**common_schema, **dlts_trial_schema, **dlts_config_schema}), | |||
'hybrid': Schema({**common_schema, **hybrid_trial_schema, **hybrid_config_schema, **machine_list_schema, | |||
**pai_config_schema, **aml_config_schema, **remote_config_schema}), | |||
'dlc': Schema({**common_schema, **dlc_trial_schema, **dlc_config_schema}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This config schema file is for v1 config, if you don't support v1 config, should remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
} | ||
|
||
environment.command = `cd ${environmentRoot} && ${environment.command}`; | ||
environment.command = `${environment.command} 1>${environment.runnerWorkingFolder}/trialrunner_stdout 2>${environment.runnerWorkingFolder}/trialrunner_stderr`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines can be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
Step 2. Create PAI-DSW server following this `link <https://help.aliyun.com/document_detail/163684.html?section-2cw-lsi-es9#title-ji9-re9-88x>`__. Note as the training service will be run on PAI-DLC, it won't cost many resources to run and you may just need a PAI-DSW server with CPU. | ||
|
||
Step 3. Open PAI-DLC `here <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__, select the same region as your PAI-DSW server. Move to ``dataset configuration`` and mount the same NAS disk as the PAI-DSW server does. (Note currently only PAI-DLC public-cluster is supported.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, updated!
this.userCommand = userCommand; | ||
} | ||
|
||
private getScript(): string[] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this function is replaced by PythonShell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
import json | ||
from argparse import ArgumentParser | ||
# ref: https://help.aliyun.com/document_detail/203290.html?spm=a2c4g.11186623.6.727.6f9b5db6bzJh4x | ||
from alibabacloud_pai_dlc20201203.client import Client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who is responsible for installing these dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Ali PAI-DSW, it's default installed.
otherwise, follow the "Step 4" DLCMode.rst.
return 'dlc'; | ||
} | ||
|
||
public async refreshEnvironmentsStatus(environments: EnvironmentInformation[]): Promise<void> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the returned promise should be resolved when refreshing is done. @SparkSnail
If that's the case, this function should return Promise.all(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to keep same as refreshEnvironmentsStatus of KubernetesEnvironmentService/AMLEnvironmentService.
HPO: DSW(DataScienceWorkShop) as client and DLC(DeepLearningContainers) as computing resources.
please find config in /nni/examples/trials/mnist-pytorch/config_dlc.yml