volcano-sh · volcano-sh-bot · Jun 2, 2020 · Jun 1, 2020 · Jun 1, 2020
diff --git a/example/MindSpore-example/README.md b/example/MindSpore-example/README.md
@@ -0,0 +1,55 @@
+# MindSpore Volcano Example
+
+#### These examples shows how to run MindSpore via Volcano. Since MindSpore itself is relatively new, these examples maybe oversimplified, but will evolve with both communities.
+
+## Introduction of MindSpore
+
+MindSpore is a new open source deep learning training/inference framework that
+could be used for mobile, edge and cloud scenarios. MindSpore is designed to
+provide development experience with friendly design and efficient execution for
+the data scientists and algorithmic engineers, native support for Ascend AI
+processor, and software hardware co-optimization.
+
+MindSpore is open sourced on both [Github](https://github.com/mindspore-ai/mindspore ) and [Gitee](https://gitee.com/mindspore/mindspore ).
+
+## Prerequisites
+
+These two examples are tested under below env:
+
+- Ubuntu: `16.04.6 LTS` 
+- docker: `v18.06.1-ce`
+- Kubernetes: `v1.16.6`
+- NVIDIA Docker: `2.3.0`
+- NVIDIA/k8s-device-plugin: `1.0.0-beta6`
+- NVIDIA drivers: `418.39`
+- CUDA: `10.1`
+
+## MindSpore CPU example
+
+Using a modified MindSpore CPU image as the container image which
+trains LeNet with MNIST dataset. 
+
+pull image: `docker pull lyd911/mindspore-cpu-example:0.2.0`  
+to run: `kubectl apply -f mindspore-cpu.yaml`  
+to check the result: `kubectl logs mindspore-cpu-pod-0`
+
+## MindSpore GPU example
+
+Using a modified image which the openssh-server is installed from
+the official MindSpore GPU image. To check the eligibility of
+MindSpore GPU's ability to communicate with other processes, we
+leverage the mpimaster and mpiworker task spec of Volcano. In this
+example, we launch one mpimaster and two mpiworkers, the python script 
+is taken from [MindSpore Gitee README](https://gitee.com/mindspore/mindspore/blob/master/README.md ), which is also modified to be 
+able to run parallelly.
+
+pull image: `docker pull lyd911/mindspore-gpu-example:0.2.0`  
+to run: `kubectl apply -f mindspore-gpu.yaml`  
+to check result: `kubectl logs mindspore-gpu-mpimster-0`
+
+The expected output should be (2*3) of multi-dimensional array.
+
+## Future
+
+An end to end example of training a network using MindSpore on 
+distributed GPU via Volcano is expected in the future.
diff --git a/example/MindSpore-example/mindspore_cpu/mindspore-cpu.yaml b/example/MindSpore-example/mindspore_cpu/mindspore-cpu.yaml
@@ -0,0 +1,49 @@
+apiVersion: batch.volcano.sh/v1alpha1
+kind: Job
+metadata:
+  name: mindspore-cpu
+spec:
+  minAvailable: 1
+  schedulerName: volcano
+  policies:
+    - event: PodEvicted
+      action: RestartJob
+  plugins:
+    ssh: []
+    env: []
+    svc: []
+  maxRetry: 5
+  queue: default
+  # Comment out the following section to enable volumes for job input/output.
+  #volumes:
+  #  - mountPath: "/myinput"
+  #  - mountPath: "/myoutput"
+  #    volumeClaimName: "testvolumeclaimname"
+  #    volumeClaim:
+  #      accessModes: [ "ReadWriteOnce" ]
+  #      storageClassName: "my-storage-class"
+  #      resources:
+  #        requests:
+  #          storage: 1Gi
+  tasks:
+    - replicas: 8
+      name: "pod"
+      template:
+        spec:
+          containers:
+            - command: ["/bin/bash", "-c", "python /tmp/lenet.py"]
+              image: lyd911/mindspore-cpu-example:0.2.0
+              imagePullPolicy: IfNotPresent
+              name: mindspore-cpu-job
+              resources:
+                limits:
+                  cpu: "1"
+                requests:
+                  cpu: "1"
+              volumeMounts:
+              - name: training-result
+                mountPath: /tmp/result
+          restartPolicy: OnFailure
+          volumes:
+            - name: training-result
+              emptyDir: {}
diff --git a/example/MindSpore-example/mindspore_gpu/gpu-test.py b/example/MindSpore-example/mindspore_gpu/gpu-test.py
@@ -0,0 +1,13 @@
+import numpy as np
+import mindspore.context as context
+from mindspore import Tensor
+from mindspore.ops import functional as F
+from mindspore.communication.management import init, get_rank, get_group_size
+
+init('nccl')
+context.set_context(device_target="GPU")
+context.set_auto_parallel_context(parallel_mode="data_parallel", mirror_mean=True, device_num=get_group_size())
+
+x = Tensor(np.ones([1,3,3,4]).astype(np.float32))
+y = Tensor(np.ones([1,3,3,4]).astype(np.float32))
+print(F.tensor_add(x, y))
diff --git a/example/MindSpore-example/mindspore_gpu/mindspore-gpu.yaml b/example/MindSpore-example/mindspore_gpu/mindspore-gpu.yaml
@@ -0,0 +1,54 @@
+apiVersion: batch.volcano.sh/v1alpha1
+kind: Job
+metadata:
+  name: mindspore-gpu
+spec:
+  minAvailable: 3
+  schedulerName: volcano
+  plugins:
+    ssh: []
+    svc: []
+  tasks:
+    - replicas: 1
+      name: mpimaster
+      template:
+        spec:
+          containers:
+            - command:
+                - /bin/bash
+                - -c
+                - |
+                  mkdir -p /var/run/sshd; /usr/sbin/sshd;
+                  MPI_HOST=`cat /etc/volcano/mpiworker.host | tr "\n" ","`;
+                  sleep 10;
+                  mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 --prefix /usr/local/openmpi-3.1.5 python /tmp/gpu-test.py;
+                  sleep 3600;
+              image: lyd911/mindspore-gpu-example:0.2.0
+              name: mpimaster
+              ports:
+                - containerPort: 22
+                  name: mpijob-port
+              workingDir: /home
+          restartPolicy: OnFailure
+    - replicas: 2
+      name: mpiworker
+      template:
+        spec:
+          containers:
+            - command:
+                - /bin/bash
+                - -c
+                - |
+                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D; 
+              image: lyd911/mindspore-gpu-example:0.2.0
+              name: mpiworker
+              resources:
+                limits:
+                  nvidia.com/gpu: "1"
+              ports:
+                - containerPort: 22
+                  name: mpijob-port
+              workingDir: /home
+          restartPolicy: OnFailure
+
+---