Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge master #142

Merged
merged 3 commits into from
Mar 14, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions docs/en_US/HowToUseDocker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
**How to Use Docker in NNI**
===

## Overview

[Docker](https://www.docker.com/) is a tool to make it easier for users to deploy and run applications based on their own operating system by starting containers. Docker is not a virtual machine, it does not create a virtual operating system, bug it allows different applications to use the same OS kernel, and isolate different applications by container.

Users could start NNI experiments using docker, and NNI provides an offical docker image [msranni/nni](https://hub.docker.com/r/msranni/nni) in docker hub.

## Using docker in local machine

### Step 1: Installation of docker
Before you start using docker to start NNI experiments, you should install a docker software in your local machine. [Refer](https://docs.docker.com/install/linux/docker-ce/ubuntu/)

### Step2: Start docker container
If you have installed the docker package in your local machine, you could start a docker container instance to run NNI examples. You should notice that because NNI will start a web UI process in container and continue to listen to a port, you need to specify the port mapping between your host machine and docker container to give access to web UI outside the container. By visting the host ip address and port, you could redirect to the web UI process started in docker container, and visit web UI content.

For example, you could start a new docker container from following command:
```
docker run -i -t -p [hostPort]:[containerPort] [image]
```
`-i:` Start a docker in an interactive mode.

`-t:` Docker assign the container a input terminal.

`-p:` Port mapping, map host port to a container port.

For more information about docker command, please [refer](https://docs.docker.com/v17.09/edge/engine/reference/run/)

Note:
```
NNI only support Ubuntu and MacOS system in local mode for the moment, please use correct docker image type.If you want to use gpu in docker container, please use nvidia-docker.
```
### Step3: Run NNI in docker container

If you start a docker image using NNI's offical image `msranni/nni`, you could directly start NNI experiments by using `nnictl` command. Our offical image has NNI's running environment and basic python and deep learning frameworks environment.

If you start your own docker image, you may need to install NNI package first, please [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/Installation.md).

If you want to run NNI's offical examples, you may need to clone NNI repo in github using
```
git clone https://github.com/Microsoft/nni.git
```
then you could enter `nni/examples/trials` to start an experiment.

After you prepare NNI's environment, you could start a new experiment using `nnictl` command, [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/QuickStart.md)

## Using docker in remote platform

NNI support starting experiments in [remoteTrainingService](https://github.com/Microsoft/nni/blob/master/docs/en_US/RemoteMachineMode.md), and run trial jobs in remote machines. As docker could start an independent Ubuntu system as SSH server, docker container could be used as the remote machine in NNI's remot mode.

### Step 1: Setting docker environment

You should install a docker software in your remote machine first, please [refer](https://docs.docker.com/install/linux/docker-ce/ubuntu/).

To make sure your docker container could be connected by NNI experiments, you should build your own docker image to set SSH server or use images with SSH configuration. If you want to use docker container as SSH server, you should configure SSH password login or private key login, please [refer](https://docs.docker.com/engine/examples/running_ssh_service/).

Note:
```
NNI's offical image msranni/nni does not support SSH server for the time being, you should build your own docker image with SSH configuration or use other images as remote server.
```

### Step2: Start docker container in remote machine

SSH server need a port, you need to expose docker's SSH port to NNI as the connection port. For example, if you set your container's SSH port as **`A`**, you should map container's port **`A`** to your remote host machine's another port **`B`**, NNI will connect port **`B`** as SSH port, and your host machine will map the connection from port **`B`** to port **`A`**, then NNI could connect to your docker container.

For example, you could start your docker container using following commands:
```
docker run -dit -p [hostPort]:[containerPort] [image]
```
The `containerPort` is the SSH port used in your docker container, and the `hostPort` is your host machine's port exposed to NNI. You could set your NNI's config file to connect to `hostPort`, and the connection will be transmitted to your docker container.
For more information about docker command, please [refer](https://docs.docker.com/v17.09/edge/engine/reference/run/).

Note:
```
If you use your own docker image as remote server, please make sure that this image has basic python environment and NNI SDK runtime environment. If you want to use gpu in docker container, please use nvidia-docker.
```

### Step3: Run NNI experiments

You could set your config file as remote platform, and setting the `machineList` configuration to connect your docker SSH server, [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/RemoteMachineMode.md). Note that you should set correct `port`,`username` and `passwd` or `sshKeyPath` of your host machine.

`port:` The host machine's port, mapping to docker's SSH port.

`username:` The username of docker container.

`passWd:` The password of docker container.

`sshKeyPath:` The path of private key of docker container.

After the configuration of config file, you could start an experiment, [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/QuickStart.md)
1 change: 1 addition & 0 deletions docs/en_US/Tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,5 @@ Tutorials
Assessors<assessors>
WebUI
Training Platform<training_services>
How to use docker<HowToUseDocker>
advanced
Original file line number Diff line number Diff line change
Expand Up @@ -24,21 +24,21 @@ import { Client } from 'ssh2';
import { getLogger, Logger } from '../../common/log';
import { randomSelect } from '../../common/utils';
import { GPUInfo } from '../common/gpuData';
import { RemoteMachineMeta, RemoteMachineScheduleResult, ScheduleResultType } from './remoteMachineData';
import { RemoteMachineMeta, RemoteMachineScheduleResult, ScheduleResultType, SSHClientManager } from './remoteMachineData';

/**
* A simple GPU scheduler implementation
*/
export class GPUScheduler {

private readonly machineSSHClientMap : Map<RemoteMachineMeta, Client>;
private readonly machineSSHClientMap : Map<RemoteMachineMeta, SSHClientManager>;
private log: Logger = getLogger();

/**
* Constructor
* @param machineSSHClientMap map from remote machine to sshClient
*/
constructor(machineSSHClientMap : Map<RemoteMachineMeta, Client>) {
constructor(machineSSHClientMap : Map<RemoteMachineMeta, SSHClientManager>) {
this.machineSSHClientMap = machineSSHClientMap;
}

Expand Down Expand Up @@ -113,7 +113,7 @@ export class GPUScheduler {
*/
private gpuResourceDetection() : Map<RemoteMachineMeta, GPUInfo[]> {
const totalResourceMap : Map<RemoteMachineMeta, GPUInfo[]> = new Map<RemoteMachineMeta, GPUInfo[]>();
this.machineSSHClientMap.forEach((client: Client, rmMeta: RemoteMachineMeta) => {
this.machineSSHClientMap.forEach((sshClientManager: SSHClientManager, rmMeta: RemoteMachineMeta) => {
// Assgin totoal GPU count as init available GPU number
if (rmMeta.gpuSummary !== undefined) {
const availableGPUs: GPUInfo[] = [];
Expand Down
135 changes: 135 additions & 0 deletions src/nni_manager/training_service/remote_machine/remoteMachineData.ts
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@

import { JobApplicationForm, TrialJobDetail, TrialJobStatus } from '../../common/trainingService';
import { GPUSummary } from '../common/gpuData';
import { Client, ConnectConfig } from 'ssh2';
import { Deferred } from 'ts-deferred';
import * as fs from 'fs';


/**
Expand Down Expand Up @@ -94,6 +97,138 @@ export class RemoteMachineTrialJobDetail implements TrialJobDetail {
}
}

/**
* The remote machine ssh client used for trial and gpu detector
*/
export class SSHClient {
private readonly sshClient: Client;
private usedConnectionNumber: number; //count the connection number of every client
constructor(sshClient: Client, usedConnectionNumber: number) {
this.sshClient = sshClient;
this.usedConnectionNumber = usedConnectionNumber;
}

public get getSSHClientInstance(): Client {
return this.sshClient;
}

public get getUsedConnectionNumber(): number {
return this.usedConnectionNumber;
}

public addUsedConnectionNumber() {
this.usedConnectionNumber += 1;
}

public minusUsedConnectionNumber() {
this.usedConnectionNumber -= 1;
}
}

export class SSHClientManager {
private sshClientArray: SSHClient[];
private readonly maxTrialNumberPerConnection: number;
private readonly rmMeta: RemoteMachineMeta;
constructor(sshClientArray: SSHClient[], maxTrialNumberPerConnection: number, rmMeta: RemoteMachineMeta) {
this.rmMeta = rmMeta;
this.sshClientArray = sshClientArray;
this.maxTrialNumberPerConnection = maxTrialNumberPerConnection;
}

/**
* Create a new ssh connection client and initialize it
*/
private initNewSSHClient(): Promise<Client> {
const deferred: Deferred<Client> = new Deferred<Client>();
const conn: Client = new Client();
let connectConfig: ConnectConfig = {
host: this.rmMeta.ip,
port: this.rmMeta.port,
username: this.rmMeta.username };
if (this.rmMeta.passwd) {
connectConfig.password = this.rmMeta.passwd;
} else if(this.rmMeta.sshKeyPath) {
if(!fs.existsSync(this.rmMeta.sshKeyPath)) {
//SSh key path is not a valid file, reject
deferred.reject(new Error(`${this.rmMeta.sshKeyPath} does not exist.`));
}
const privateKey: string = fs.readFileSync(this.rmMeta.sshKeyPath, 'utf8');

connectConfig.privateKey = privateKey;
connectConfig.passphrase = this.rmMeta.passphrase;
} else {
deferred.reject(new Error(`No valid passwd or sshKeyPath is configed.`));
}
conn.on('ready', () => {
this.addNewSSHClient(conn);
deferred.resolve(conn);
}).on('error', (err: Error) => {
// SSH connection error, reject with error message
deferred.reject(new Error(err.message));
}).connect(connectConfig);

return deferred.promise;
}

/**
* find a available ssh client in ssh array, if no ssh client available, return undefined
*/
public async getAvailableSSHClient(): Promise<Client> {
const deferred: Deferred<Client> = new Deferred<Client>();
for (const index in this.sshClientArray) {
let connectionNumber: number = this.sshClientArray[index].getUsedConnectionNumber;
if(connectionNumber < this.maxTrialNumberPerConnection) {
this.sshClientArray[index].addUsedConnectionNumber();
deferred.resolve(this.sshClientArray[index].getSSHClientInstance);
return deferred.promise;
}
};
//init a new ssh client if could not get an available one
return await this.initNewSSHClient();
}

/**
* add a new ssh client to sshClientArray
* @param sshClient
*/
public addNewSSHClient(client: Client) {
this.sshClientArray.push(new SSHClient(client, 1));
}

/**
* first ssh clilent instance is used for gpu collector and host job
*/
public getFirstSSHClient() {
return this.sshClientArray[0].getSSHClientInstance;
}

/**
* close all of ssh client
*/
public closeAllSSHClient() {
for (let sshClient of this.sshClientArray) {
sshClient.getSSHClientInstance.end();
}
}

/**
* retrieve resource, minus a number for given ssh client
* @param client
*/
public releaseConnection(client: Client | undefined) {
if(!client) {
throw new Error(`could not release a undefined ssh client`);
}
for(let index in this.sshClientArray) {
if(this.sshClientArray[index].getSSHClientInstance === client) {
this.sshClientArray[index].minusUsedConnectionNumber();
break;
}
}
}
}


export type RemoteMachineScheduleResult = { scheduleInfo : RemoteMachineScheduleInfo | undefined; resultType : ScheduleResultType};

export type RemoteMachineScheduleInfo = { rmMeta : RemoteMachineMeta; cuda_visible_device : string};
Expand Down
Loading