Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding mindspore example #845

Merged
merged 2 commits into from
Jun 2, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions example/MindSpore-example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# MindSpore Volcano Example

#### These examples shows how to run MindSpore via Volcano. Since MindSpore itself is relatively new, these examples maybe oversimplified, but will evolve with both communities.

## Introduction of MindSpore

MindSpore is a new open source deep learning training/inference framework that
could be used for mobile, edge and cloud scenarios. MindSpore is designed to
provide development experience with friendly design and efficient execution for
the data scientists and algorithmic engineers, native support for Ascend AI
processor, and software hardware co-optimization.

MindSpore is open sourced on both [Github](https://github.com/mindspore-ai/mindspore ) and [Gitee](https://gitee.com/mindspore/mindspore ).

## Prerequisites

These two examples are tested under below env:

- Ubuntu: `16.04.6 LTS`
- docker: `v18.06.1-ce`
- Kubernetes: `v1.16.6`
- NVIDIA Docker: `2.3.0`
- NVIDIA/k8s-device-plugin: `1.0.0-beta6`
- NVIDIA drivers: `418.39`
- CUDA: `10.1`

## MindSpore CPU example

Using a modified MindSpore CPU image as the container image which
trains LeNet with MNIST dataset.

pull image: `docker pull lyd911/mindspore-cpu-example:0.2.0`
to run: `kubectl apply -f mindspore-cpu.yaml`
to check the result: `kubectl logs mindspore-cpu-pod-0`

## MindSpore GPU example

Using a modified image which the openssh-server is installed from
the official MindSpore GPU image. To check the eligibility of
MindSpore GPU's ability to communicate with other processes, we
leverage the mpimaster and mpiworker task spec of Volcano. In this
example, we launch one mpimaster and two mpiworkers, the python script
is taken from [MindSpore Gitee README](https://gitee.com/mindspore/mindspore/blob/master/README.md ), which is also modified to be
able to run parallelly.

pull image: `docker pull lyd911/mindspore-gpu-example:0.2.0`
to run: `kubectl apply -f mindspore-gpu.yaml`
to check result: `kubectl logs mindspore-gpu-mpimster-0`

The expected output should be (2*3) of multi-dimensional array.

## Future

An end to end example of training a network using MindSpore on
distributed GPU via Volcano is expected in the future.
49 changes: 49 additions & 0 deletions example/MindSpore-example/mindspore_cpu/mindspore-cpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: mindspore-cpu
spec:
minAvailable: 1
schedulerName: volcano
policies:
- event: PodEvicted
action: RestartJob
plugins:
ssh: []
env: []
svc: []
maxRetry: 5
queue: default
# Comment out the following section to enable volumes for job input/output.
#volumes:
# - mountPath: "/myinput"
# - mountPath: "/myoutput"
# volumeClaimName: "testvolumeclaimname"
# volumeClaim:
# accessModes: [ "ReadWriteOnce" ]
# storageClassName: "my-storage-class"
# resources:
# requests:
# storage: 1Gi
hzxuzhonghu marked this conversation as resolved.
Show resolved Hide resolved
tasks:
- replicas: 8
name: "pod"
template:
spec:
containers:
- command: ["/bin/bash", "-c", "python /tmp/lenet.py"]
image: lyd911/mindspore-cpu-example:0.2.0
imagePullPolicy: IfNotPresent
name: mindspore-cpu-job
resources:
limits:
cpu: "1"
requests:
cpu: "1"
volumeMounts:
- name: training-result
mountPath: /tmp/result
hzxuzhonghu marked this conversation as resolved.
Show resolved Hide resolved
restartPolicy: OnFailure
volumes:
- name: training-result
emptyDir: {}
13 changes: 13 additions & 0 deletions example/MindSpore-example/mindspore_gpu/gpu-test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import numpy as np
import mindspore.context as context
from mindspore import Tensor
from mindspore.ops import functional as F
from mindspore.communication.management import init, get_rank, get_group_size

init('nccl')
context.set_context(device_target="GPU")
context.set_auto_parallel_context(parallel_mode="data_parallel", mirror_mean=True, device_num=get_group_size())

x = Tensor(np.ones([1,3,3,4]).astype(np.float32))
y = Tensor(np.ones([1,3,3,4]).astype(np.float32))
print(F.tensor_add(x, y))
54 changes: 54 additions & 0 deletions example/MindSpore-example/mindspore_gpu/mindspore-gpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: mindspore-gpu
spec:
minAvailable: 3
schedulerName: volcano
plugins:
ssh: []
svc: []
tasks:
- replicas: 1
name: mpimaster
template:
spec:
containers:
- command:
- /bin/bash
- -c
- |
mkdir -p /var/run/sshd; /usr/sbin/sshd;
MPI_HOST=`cat /etc/volcano/mpiworker.host | tr "\n" ","`;
sleep 10;
mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 --prefix /usr/local/openmpi-3.1.5 python /tmp/gpu-test.py;
sleep 3600;
image: lyd911/mindspore-gpu-example:0.2.0
name: mpimaster
ports:
- containerPort: 22
name: mpijob-port
workingDir: /home
restartPolicy: OnFailure
- replicas: 2
name: mpiworker
template:
spec:
containers:
- command:
- /bin/bash
- -c
- |
mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
image: lyd911/mindspore-gpu-example:0.2.0
name: mpiworker
resources:
limits:
nvidia.com/gpu: "1"
ports:
- containerPort: 22
name: mpijob-port
workingDir: /home
restartPolicy: OnFailure

---