Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

DLTS integration #1945

Merged
merged 28 commits into from
Mar 2, 2020
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions docs/en_US/TrainingService/DLTSMode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
**Run an Experiment on Deep Learning Training Service**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use the official name (DLWorkspace) of the project here? or the well known aka. DLTS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @hongzhili for suggestion.

Copy link
Member Author

@Gerhut Gerhut Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to DLTS

===
NNI supports running an experiment on [Deep Learning Training Service](https://github.com/microsoft/DLWorkspace.git) (aka DLTS), called dlts mode. Before starting to use NNI dlts mode, you should have an account to access DLTS dashboard.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Deep Learning Training Service]
same comments about this name usage.

Copy link
Member Author

@Gerhut Gerhut Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to DLTS


## Setup Environment

Step 1. Choose a cluster from DLTS dashboard, ask administrator for the cluster dashboard URL.

![Choose Cluster](../../img/dlts-step1.png)

Step 2. Prepare a NNI config YAML like the following:

```yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of directly post an yaml example, you might want to outline the new field like trainingServicePlatform: dlts, or additional keys comparing to LocalMode and RemoteMachineMode, refer to https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md

Copy link
Member Author

@Gerhut Gerhut Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all comments here other than DLTS specified ones.

trainingServicePlatform: dlts
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 2
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# search space file
searchSpacePath: search_space.json
# choice: true, false
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 1
image: msranni/nni
dltsConfig:
dashboard: # Ask administrator for the cluster dashboard URL
```

Remember to fill the cluster dashboard URL to the last line.

Step 3. Open your working directory of the cluster, paste the NNI config as well as related code to a directory.

![Copy Config](../../img/dlts-step3.png)

Step 4. Submit a NNI manager job to the specified cluster.

![Submit Job](../../img/dlts-step4.png)

Step 5. Go to Endpoints tab of the newly created job, click the Port 40000 link to theck trial's information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theck
typo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


![View NNI WebUI](../../img/dlts-step5.png)
Binary file added docs/img/dlts-step1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/dlts-step3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/dlts-step4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/dlts-step5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
34 changes: 34 additions & 0 deletions examples/trials/mnist-tfv1/config_dlts.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
debug: true
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: dlts
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 1
#The docker image to run nni job on dlts
image: msranni/nni:latest
dltsConfig:
dashboard: http://azure-eastus-p40-dev1-infra01.eastus.cloudapp.azure.com/

# The following fields are all optional and could be retrieved from environment
# variables if running in DLTS job container.

# cluster: .default
# team: platform
# email: example@microsoft.com
# password: # Paste from DLTS dashboard
7 changes: 6 additions & 1 deletion src/nni_manager/main.ts
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ import { PAIYarnTrainingService } from './training_service/pai/paiYarn/paiYarnTr
import {
RemoteMachineTrainingService
} from './training_service/remote_machine/remoteMachineTrainingService';
import { DLTSTrainingService } from './training_service/dlts/dltsTrainingService';

function initStartupInfo(
startExpMode: string, resumeExperimentId: string, basePort: number,
Expand Down Expand Up @@ -60,6 +61,10 @@ async function initContainer(foreground: boolean, platformMode: string, logFileN
Container.bind(TrainingService)
.to(FrameworkControllerTrainingService)
.scope(Scope.Singleton);
} else if (platformMode === 'dlts') {
Container.bind(TrainingService)
.to(DLTSTrainingService)
.scope(Scope.Singleton);
} else {
throw new Error(`Error: unsupported mode: ${platformMode}`);
}
Expand Down Expand Up @@ -108,7 +113,7 @@ const foreground: boolean = foregroundArg.toLowerCase() === 'true' ? true : fals
const port: number = parseInt(strPort, 10);

const mode: string = parseArg(['--mode', '-m']);
if (!['local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn'].includes(mode)) {
if (!['local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts'].includes(mode)) {
console.log(`FATAL: unknown mode: ${mode}`);
usage();
process.exit(1);
Expand Down
9 changes: 9 additions & 0 deletions src/nni_manager/rest_server/restValidationSchemas.ts
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,15 @@ export namespace ValidationSchemas {
}),
uploadRetryCount: joi.number().min(1)
}),
dlts_config: joi.object({ // eslint-disable-line @typescript-eslint/camelcase
dashboard: joi.string().min(1),

cluster: joi.string().min(1),
team: joi.string().min(1),

email: joi.string().min(1),
password: joi.string().min(1)
}),
nni_manager_ip: joi.object({ // eslint-disable-line @typescript-eslint/camelcase
nniManagerIp: joi.string().min(1)
})
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ export enum TrialConfigMetadataKey {
KUBEFLOW_CLUSTER_CONFIG = 'kubeflow_config',
NNI_MANAGER_IP = 'nni_manager_ip',
FRAMEWORKCONTROLLER_CLUSTER_CONFIG = 'frameworkcontroller_config',
DLTS_CLUSTER_CONFIG = 'dlts_config',
VERSION_CHECK = 'version_check',
LOG_COLLECTION = 'log_collection'
}
14 changes: 14 additions & 0 deletions src/nni_manager/training_service/dlts/dltsClusterConfig.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.

export interface DLTSClusterConfig {
dashboard: string;

cluster: string;
team: string;

email: string;
password: string;

gpuType?: string;
}
8 changes: 8 additions & 0 deletions src/nni_manager/training_service/dlts/dltsData.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.

export const DLTS_TRIAL_COMMAND_FORMAT: string =
`export NNI_PLATFORM=dlts NNI_SYS_DIR={0} NNI_OUTPUT_DIR={1} NNI_TRIAL_JOB_ID={2} NNI_EXP_ID={3} NNI_TRIAL_SEQ_ID={4} MULTI_PHASE={5} \
&& cd $NNI_SYS_DIR && sh install_nni.sh \
&& cd '{6}' && python3 -m nni_trial_tool.trial_keeper --trial_command '{7}' \
--nnimanager_ip '{8}' --nnimanager_port '{9}' --nni_manager_version '{10}' --log_collection '{11}'`;
45 changes: 45 additions & 0 deletions src/nni_manager/training_service/dlts/dltsJobConfig.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.

import { DLTSClusterConfig } from "./dltsClusterConfig";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add license at file beginning and a empty line at end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


export class DLTSJobConfig {
public readonly team: string;
public readonly userName: string;
public readonly vcName: string;
public readonly gpuType: string;
public readonly jobType = "training";
public readonly jobtrainingtype = "RegularJob";
public readonly ssh = false;
public readonly ipython = false;
public readonly tensorboard = false;
public readonly workPath = '';
public readonly enableworkpath = true;
public readonly dataPath = '';
public readonly enabledatapath = false;
public readonly jobPath = '';
public readonly enablejobpath = true;
public readonly mountpoints = [];
public readonly env = []
public readonly hostNetwork = false;
public readonly useGPUTopology = false;
public readonly isPrivileged = false;
public readonly hostIPC = false;
public readonly preemptionAllowed = "False"

public constructor(
clusterConfig: DLTSClusterConfig,
public readonly jobName: string,
public readonly resourcegpu: number,
public readonly image: string,
public readonly cmd: string,
public readonly interactivePorts: number[],
) {
if (clusterConfig.gpuType === undefined) {
throw Error('GPU type not fetched')
}
this.vcName = this.team = clusterConfig.team
this.gpuType = clusterConfig.gpuType
this.userName = clusterConfig.email
}
}
77 changes: 77 additions & 0 deletions src/nni_manager/training_service/dlts/dltsJobRestServer.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.

'use strict';

import { Request, Response, Router } from 'express';
import { Inject } from 'typescript-ioc';
import * as component from '../../common/component';
import { ClusterJobRestServer } from '../common/clusterJobRestServer';
import { DLTSTrainingService } from './dltsTrainingService';

export interface ParameterFileMeta {
readonly experimentId: string;
readonly trialId: string;
readonly filePath: string;
}

/**
* DLTS Training service Rest server, provides rest API to support DLTS job metrics update
*
*/
@component.Singleton
export class DLTSJobRestServer extends ClusterJobRestServer {
private parameterFileMetaList: ParameterFileMeta[] = [];

@Inject
private readonly dltsTrainingService: DLTSTrainingService;

/**
* constructor to provide NNIRestServer's own rest property, e.g. port
*/
constructor() {
super();
this.dltsTrainingService = component.get(DLTSTrainingService);
}

// tslint:disable-next-line:no-any
protected handleTrialMetrics(jobId: string, metrics: any[]): void {
// Split metrics array into single metric, then emit
// Warning: If not split metrics into single ones, the behavior will be UNKNOWN
for (const singleMetric of metrics) {
this.dltsTrainingService.MetricsEmitter.emit('metric', {
id : jobId,
data : singleMetric
});
}
}

protected createRestHandler(): Router {
const router: Router = super.createRestHandler();

router.post(`/parameter-file-meta`, (req: Request, res: Response) => {
try {
this.log.info(`POST /parameter-file-meta, body is ${JSON.stringify(req.body)}`);
this.parameterFileMetaList.push(req.body);
res.send();
} catch (err) {
this.log.error(`POST parameter-file-meta error: ${err}`);
res.status(500);
res.send(err.message);
}
});

router.get(`/parameter-file-meta`, (req: Request, res: Response) => {
try {
this.log.info(`GET /parameter-file-meta`);
res.send(this.parameterFileMetaList);
} catch (err) {
this.log.error(`GET parameter-file-meta error: ${err}`);
res.status(500);
res.send(err.message);
}
});

return router;
}
}
Loading