Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

HPO: Alibaba DSW+DLC support #4055

Merged
merged 5 commits into from
Aug 23, 2021
Merged

Conversation

weidankong
Copy link
Contributor

HPO: DSW(DataScienceWorkShop) as client and DLC(DeepLearningContainers) as computing resources.

please find config in /nni/examples/trials/mnist-pytorch/config_dlc.yml

@ghost
Copy link

ghost commented Aug 10, 2021

CLA assistant check
All CLA requirements met.

@SparkSnail
Copy link
Contributor

@weidankong weidankong force-pushed the dlc_test branch 3 times, most recently from 8cf25e0 to d57d738 Compare August 11, 2021 19:17
const deferred: Deferred<string> = new Deferred<string>();
this.pythonShellClient = new PythonShell('dlcUtil.py', {
scriptPath: './config/dlc',
pythonPath: 'python',
Copy link
Contributor

@SparkSnail SparkSnail Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detect OS, use python for windows, otherwise use python3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in dlc mode, only Linux environment is available, no windows case.

private getScript(): string[] {
const script: string[] = [];
script.push(
`python ./config/dlc/dlcUtil.py --type ${this.type} --image ${this.image} --pod_count ${this.podCount} ` +
Copy link
Contributor

@SparkSnail SparkSnail Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detect OS, use python for windows, otherwise use python3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in dlc mode, only Linux environment is available, no windows case.

this.monitorError(this.pythonShellClient, deferred);
return deferred.promise;

return deferred.promise;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

this.experimentId = info.experimentId;
this.experimentRootDir = info.logDir;
this.config = flattenConfig(config, 'dlc');
component.Container.bind(StorageService).to(MountedStorageService).scope(Scope.Singleton);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should initialize storage service with your own localStorageMountPoint.

}

public get hasStorageService(): boolean {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true

await fs.promises.mkdir(environmentLocalTempFolder, {recursive: true});
}

const dlcFolder: string = this.experimentRootDir.replace(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this.experimentRootDir is assigned to info.logDir, why does it contains this.config.localStorageMountPoint? why don't use dlcFolder = this.config.containerStorageMountPoint directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use storageService similar to openPai.

}

const prepare = `cd ${dlcEnvironment.runnerWorkingFolder} && cp -r ../../environment-temp/envs/* ../`;
const startrun = `sh ../install_nni.sh && python -m nni.tools.trial_tool.trial_runner`;
Copy link
Contributor

@SparkSnail SparkSnail Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

environment.command already initialized this command https://github.com/microsoft/nni/blob/master/ts/nni_manager/training_service/reusable/trialDispatcher.ts#L660. BTW why use python instead of python3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in default DLC container image, python is python3.

@@ -130,7 +130,7 @@ def validate(self, data):
Optional('maxTrialDuration'): And(Regex(r'^[1-9][0-9]*[s|m|h|d]$', error='ERROR: maxTrialDuration format is [digit]{s,m,h,d}')),
Optional('maxTrialNum'): setNumberRange('maxTrialNum', int, 1, 99999),
'trainingServicePlatform': setChoice(
'trainingServicePlatform', 'remote', 'local', 'pai', 'kubeflow', 'frameworkcontroller', 'dlts', 'aml', 'adl', 'hybrid'),
'trainingServicePlatform', 'remote', 'local', 'pai', 'kubeflow', 'frameworkcontroller', 'dlts', 'aml', 'adl', 'hybrid', 'dlc'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you provide v1 configuration, please add corresponding convert in https://github.com/microsoft/nni/blob/master/nni/experiment/config/convert.py#L16.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove v1 support.

region: cn-hangzhou
nasDataSourceId: ${your_nas_data_source_id}
accessKeyId: ${your_ak_id}
accessKeySecret: ${your_ak_key}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provide accessKeySecret directly in yaml file directly is not a recommended way.

ecs_spec=args.ecs_spec,
)

job_type = 'TFJob' if args.image.find('tensorflow') >= 0 else 'PyTorchJob'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide job type as nni job configuration, detect image name is not good way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, add a new param in config.

@@ -933,3 +1039,5 @@ containerName
AzureBlob container name.

type: ``str``

ion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -477,6 +499,7 @@ def validate(self, data):
'dlts': Schema({**common_schema, **dlts_trial_schema, **dlts_config_schema}),
'hybrid': Schema({**common_schema, **hybrid_trial_schema, **hybrid_config_schema, **machine_list_schema,
**pai_config_schema, **aml_config_schema, **remote_config_schema}),
'dlc': Schema({**common_schema, **dlc_trial_schema, **dlc_config_schema}),
Copy link
Contributor

@SparkSnail SparkSnail Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config schema file is for v1 config, if you don't support v1 config, should remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

}

environment.command = `cd ${environmentRoot} && ${environment.command}`;
environment.command = `${environment.command} 1>${environment.runnerWorkingFolder}/trialrunner_stdout 2>${environment.runnerWorkingFolder}/trialrunner_stderr`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines can be merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


Step 2. Create PAI-DSW server following this `link <https://help.aliyun.com/document_detail/163684.html?section-2cw-lsi-es9#title-ji9-re9-88x>`__. Note as the training service will be run on PAI-DLC, it won't cost many resources to run and you may just need a PAI-DSW server with CPU.

Step 3. Open PAI-DLC `here <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__, select the same region as your PAI-DSW server. Move to ``dataset configuration`` and mount the same NAS disk as the PAI-DSW server does. (Note currently only PAI-DLC public-cluster is supported.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong link

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated!

this.userCommand = userCommand;
}

private getScript(): string[] {
Copy link
Contributor

@cruiseliu cruiseliu Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this function is replaced by PythonShell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

import json
from argparse import ArgumentParser
# ref: https://help.aliyun.com/document_detail/203290.html?spm=a2c4g.11186623.6.727.6f9b5db6bzJh4x
from alibabacloud_pai_dlc20201203.client import Client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is responsible for installing these dependencies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Ali PAI-DSW, it's default installed.
otherwise, follow the "Step 4" DLCMode.rst.

return 'dlc';
}

public async refreshEnvironmentsStatus(environments: EnvironmentInformation[]): Promise<void> {
Copy link
Contributor

@cruiseliu cruiseliu Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the returned promise should be resolved when refreshing is done. @SparkSnail
If that's the case, this function should return Promise.all(...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to keep same as refreshEnvironmentsStatus of KubernetesEnvironmentService/AMLEnvironmentService.

@SparkSnail SparkSnail merged commit e9c21fd into microsoft:master Aug 23, 2021
@weidankong weidankong deleted the dlc_test branch September 11, 2021 03:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants