Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ml example #76

Merged
merged 3 commits into from
Nov 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions examples/tests/dlio-ml/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# DLIO ML Example

This is an example of using the IO tool[DLIO](https://dlio-profiler.readthedocs.io/en/latest/build.html#build-dlio-profiler-with-pip-recommended) that can
be added on the fly with pip.

## Usage

Create a cluster and install JobSet to it.

```bash
kind create cluster
VERSION=v0.2.0
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml
```

Install the operator (from the development manifest here):

```bash
$ kubectl apply -f ../../dist/metrics-operator-dev.yaml
```

How to see metrics operator logs:

```bash
$ kubectl logs -n metrics-system metrics-controller-manager-859c66464c-7rpbw
```

Then create the metrics set. This is going to run a single run of LAMMPS over MPI.
as lammps runs.

```bash
kubectl apply -f metrics.yaml
```

Wait until you see pods created by the job and then running (there should be two - a launcher and worker for LAMMPS):

```bash
kubectl get pods
```
```diff
NAME READY STATUS RESTARTS AGE
metricset-sample-l-0-0-lt782 1/1 Running 0 3s
metricset-sample-w-0-0-4s5p9 1/1 Running 0 3s
```

In the above, "l" is a launcher pod, and "w" is a worker node.
If you inspect the log for the launcher you'll see a short sleep (the network isn't up immediately)
and then LAMMPS running, and the log is printed to the console.

```bash
kubectl logs metricset-sample-l-0-0-lt782 -f
```

There is purposefully a sleep infinity at the end to give you a chance to copy over data.

```bash
mkdir -p ./data ./output
# Only if you are interested in the ML data
kubectl cp metricset-sample-m-0-0-xfg6r:/dlio/data ./data/
kubectl cp metricset-sample-m-0-0-xfg6r:/dlio/output ./output
```

You can open the tiny file in [https://ui.perfetto.dev/](https://ui.perfetto.dev/).

![img/trace.png](img/trace.png)

Other applications of interest might be related to AI/ML - we will try more soon!
Cleanup when you are done.

```bash
kubectl delete -f metrics.yaml
```
Binary file added examples/tests/dlio-ml/img/trace.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
37 changes: 37 additions & 0 deletions examples/tests/dlio-ml/metrics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# kubectl apply -f metrics.yaml
# kubectl logs <launcher-pod> -f
pods: 1

metrics:
- name: io-ior
options:
command: mpirun --allow-run-as-root -np 10 dlio_benchmark workload=resnet50 ++workload.dataset.data_folder=/dlio/data ++workload.output.folder=/dlio/output
workdir: /dlio/data
addons:
- name: commands
options:
preBlock: |
apt-get update && apt-get install -y python3 python3-pip openmpi-bin openmpi-common libopenmpi-dev hwloc libhwloc-dev default-jre
#python3 -m pip install git+https://github.com/hariharan-devarajan/dlio-profiler.git
#python3 -m pip install git+https://github.com/argonne-lcf/dlio_benchmark.git
python3 -m pip install "dlio_benchmark[dlio_profiler] @ git+https://github.com/argonne-lcf/dlio_benchmark.git"
mkdir -p /dlio/data /dlio/output /dlio/logs
export DLIO_PROFILER_ENABLE=0
mpirun -np 10 --allow-run-as-root dlio_benchmark workload=resnet50 ++workload.dataset.data_folder=/dlio/data ++workload.output.folder=/dlio/output ++workload.workflow.generate_data=True ++workload.workflow.train=False
export DLIO_PROFILER_LOG_LEVEL=ERROR
export DLIO_PROFILER_ENABLE=1
export DLIO_PROFILER_INC_METADATA=1
cd /dlio/data
postBlock: |
gzip -d /dlio/output/.trace*.pfw.gz
cat /dlio/output/.trace*.pfw
sleep infinity