Skip to content

Commit

Permalink
Add Helm Chart (horovod#3546)
Browse files Browse the repository at this point in the history
* Add helm chart from github.com/helm/charts, commit 41312e665542de109088139ac64f695031d2bd11
* Mark docker helm chart as non-code for CI, as it is not tested

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
  • Loading branch information
EnricoMi authored Jun 15, 2022
1 parent 1b3452f commit a304c81
Show file tree
Hide file tree
Showing 16 changed files with 670 additions and 2 deletions.
1 change: 1 addition & 0 deletions .buildkite/get_changed_code_files.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
r'^.buildkite/get_changed_code_files.py$',
r'^.github/',
r'^docs/',
r'^docker/helm/',
r'^.*\.md',
r'^.*\.rst'
]
Expand Down
1 change: 1 addition & 0 deletions .github/get-changed-code-files.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
r'^.buildkite/get_changed_code_files.py$',
r'^.github/',
r'^docs/',
r'^docker/helm/',
r'^.*\.md',
r'^.*\.rst'
]
Expand Down
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,7 @@ See `Run Horovod <docs/running.rst>`_ for more details, including RoCE/InfiniBan

4. To run in Docker, see `Horovod in Docker <docs/docker.rst>`_.

5. To run on Kubernetes, see `Kubeflow MPI Operator <https://github.com/kubeflow/mpi-operator/>`_, `Helm Chart <https://github.com/kubernetes/charts/tree/master/stable/horovod/>`_, `FfDL <https://github.com/IBM/FfDL/tree/master/etc/examples/horovod/>`_, and `Polyaxon <https://docs.polyaxon.com/integrations/horovod/>`_.
5. To run on Kubernetes, see `Helm Chart <https://github.com/horovod/horovod/tree/master/docker/helm/>`_, `Kubeflow MPI Operator <https://github.com/kubeflow/mpi-operator/>`_, `FfDL <https://github.com/IBM/FfDL/tree/master/etc/examples/horovod/>`_, and `Polyaxon <https://docs.polyaxon.com/integrations/horovod/>`_.

6. To run on Spark, see `Horovod on Spark <docs/spark.rst>`_.

Expand Down
6 changes: 6 additions & 0 deletions docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,9 @@ docker build \

See the [Horovod in Docker](../docs/docker.rst) documentation for guidance on running these Docker images, and
[Horovod on Ray](../docs/ray.rst) for usage with Ray.

## Running in Kubernetes

See the [Horovod Helm Chart](helm/README.md), [Kubeflow MPI Operator](https://github.com/kubeflow/mpi-operator/),
[FfDL](https://github.com/IBM/FfDL/tree/master/etc/examples/horovod/), and [Polyaxon](https://docs.polyaxon.com/integrations/horovod/)
for guidance on running these Docker images in Kubernetes.
9 changes: 9 additions & 0 deletions docker/helm/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
apiVersion: v1
description: A Helm chart for deploying Horovod
name: horovod
version: 1.0.3
appVersion: 0.24.3
sources:
- https://github.com/horovod/horovod
- https://github.com/horovod/horovod/blob/master/docs/docker.rst
home: https://horovod.ai
158 changes: 158 additions & 0 deletions docker/helm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# Horovod Helm Chart

## Introduction

This chart bootstraps Horovod which is a Distributed TensorFlow Framework on a Kubernetes cluster using the Helm Package Manager. It deploys Horovod workers as statefulsets, and the Horovod driver as a job, then discover the host list automatically.

## Prerequisites

- Kubernetes cluster v1.8+

## Build Docker Image

You can download [official Horovod Dockerfile](https://github.com/horovod/horovod/blob/master/docker/horovod/Dockerfile), then modify it according to your requirement, e.g. select a different CUDA, TensorFlow or Python version.

```
# mkdir horovod-docker
# wget -O horovod-docker/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/docker/horovod/Dockerfile
# docker build -t horovod:latest horovod-docker
```

## Prepare ssh keys

```
# Setup ssh key
export SSH_KEY_DIR=`mktemp -d`
cd $SSH_KEY_DIR
yes | ssh-keygen -N "" -f id_rsa
```

## Create the values.yaml

To run Horovod with GPU, you can create `values.yaml` like below

```
# cat << EOF > ~/values.yaml
---
ssh:
useSecrets: true
hostKey: |-
$(cat $SSH_KEY_DIR/id_rsa | sed 's/^/ /g')
hostKeyPub: |-
$(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/ /g')
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
worker:
number: 2
image:
repository: horovod/horovod
tag: 0.24.3
driver:
image:
repository: horovod/horovod
tag: 0.24.3
args:
- "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'"
EOF
```

For most cases, the overlay network impacts the Horovod performance greatly, so we should apply `Host Network` solution. To run Horovod with Host Network and GPU, you can create `values.yaml` like below


```
# cat << EOF > ~/values.yaml
---
+useHostNetwork: true
ssh:
useSecrets: true
port: 32222
hostKey: |-
$(cat $SSH_KEY_DIR/id_rsa | sed 's/^/ /g')
hostKeyPub: |-
$(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/ /g')
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
worker:
number: 2
image:
repository: horovod/horovod
tag: 0.24.3
driver:
image:
repository: horovod/horovod
tag: 0.24.3
args:
- "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'"
EOF
```

> notice: the difference is that you should set `useHostNetwork` as true, then set another ssh port rather than `22`
## Installing the Chart

To install the chart with the release name `mnist`:

```bash
$ helm install --values ~/values.yaml mnist stable/horovod
```

## Uninstalling the Chart

To uninstall/delete the `mnist` deployment:

```bash
$ helm delete mnist
```

The command removes all the Kubernetes components associated with the chart and
deletes the release.

## Upgrading an existing Release to a new major version
A major chart version change (like v1.2.3 -> v2.0.0) indicates that there is an
incompatible breaking change needing manual actions.

### 1.0.0
This version removes the `chart` label from the `spec.selector.matchLabels`
which is immutable since `StatefulSet apps/v1beta2`. It has been inadvertently
added, causing any subsequent upgrade to fail. See https://github.com/helm/charts/issues/7726.

In order to upgrade, delete the Horovod StatefulSet before upgrading, supposing your Release is named `my-release`:

```bash
$ kubectl delete statefulsets.apps --cascade=false my-release
```

## Configuration

The following table lists the configurable parameters of the Horovod
chart and their default values.

| Parameter | Description | Default |
|-----------|-------------|---------|
| `useHostNetwork` | Host network | `false` |
| `ssh.port` | The ssh port | `22` |
| `ssh.useSecrets` | Determine if using the secrets for ssh | `false` |
| `worker.number`| The worker's number | `5` |
| `worker.image.repository` | horovod worker image | `horovod/horovod` |
| `worker.image.pullPolicy` | `pullPolicy` for the worker | `IfNotPresent` |
| `worker.image.tag` | `tag` for the worker | `0.24.3` |
| `resources`| pod resource requests & limits| `{}`|
| `worker.env` | worker's environment variables | `{}` |
| `driver.image.repository` | horovod driver image | `horovod/horovod` |
| `driver.image.tag` | `tag` for the driver | `0.24.3` |
| `driver.image.pullPolicy` | image pullPolicy for the driver image| `IfNotPresent` |
| `driver.args` | driver's args | `{}` |
| `driver.env` | driver's environment variables | `{}` |
5 changes: 5 additions & 0 deletions docker/helm/templates/NOTES.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
1. Get the application URL by running these commands:

*** NOTE: It may take a few minutes for the statefulset to be available

*** you can watch the status of statefulset by running 'kubectl get sts --namespace {{ .Release.Namespace }} -w {{ template "horovod.fullname" . }}' ***
32 changes: 32 additions & 0 deletions docker/helm/templates/_helpers.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{{/* vim: set filetype=mustache: */}}
{{/*
Expand the name of the chart.
*/}}
{{- define "horovod.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}}
{{- end -}}

{{/*
Create a default fully qualified app name.
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
If release name contains chart name it will be used as a full name.
*/}}
{{- define "horovod.fullname" -}}
{{- if .Values.fullnameOverride -}}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- $name := default .Chart.Name .Values.nameOverride -}}
{{- if contains $name .Release.Name -}}
{{- .Release.Name | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}}
{{- end -}}
{{- end -}}
{{- end -}}

{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "horovod.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" -}}
{{- end -}}
129 changes: 129 additions & 0 deletions docker/helm/templates/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
{{- $workerNum := .Values.worker.number -}}
{{- $name := include "horovod.fullname" . }}
{{- $slots := 1 }}
{{- if index .Values.resources "nvidia.com/gpu" }}
{{- $slots := index .Values.resources "nvidia.com/gpu" }}
{{- end }}
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ template "horovod.fullname" . }}
labels:
heritage: {{ .Release.Service | quote }}
release: {{ .Release.Name | quote }}
chart: {{ template "horovod.chart" . }}
app: {{ template "horovod.fullname" . }}
data:
hostfile.config: |
{{ $name }}-driver slots={{ $slots }}
{{- range $i, $none := until (int $workerNum) }}
{{ $name }}-{{ $i }}.{{ $name }} slots={{ $slots }}
{{- end }}
ssh.readiness: |
#!/bin/bash
set -xev
ssh localhost ls
driver.run: |
#!/bin/bash
set -x
sleep 5
mkdir -p /root/.ssh
rm -f /root/.ssh/config
touch /root/.ssh/config
if [ "$USESECRETS" == "true" ];then
set +e
yes | cp /etc/secret-volume/id_rsa /root/.ssh/id_rsa
yes | cp /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys
set -e
fi
if [ -n "$SSHPORT" ]; then
echo "Port $SSHPORT" > /root/.ssh/config
sed -i "s/^Port.*/Port $SSHPORT /g" /etc/ssh/sshd_config
fi
echo "StrictHostKeyChecking no" >> /root/.ssh/config
/usr/sbin/sshd
if [ $# -eq 0 ]; then
sleep infinity
else
bash -c "$*"
fi
driver.waitWorkerReady: |
#!/bin/bash
set -xev
function updateSSHPort() {
mkdir -p /root/.ssh
rm -f /root/.ssh/config
touch /root/.ssh/config
if [ -n "$SSHPORT" ]; then
echo "Port $SSHPORT" > /root/.ssh/config
echo "StrictHostKeyChecking no" >> /root/.ssh/config
fi
}
function runCheckSSH() {
if [[ "$USESECRETS" == "true" ]];then
set +e
yes | cp /etc/secret-volume/id_rsa /root/.ssh/id_rsa
yes | cp /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys
set -e
fi
for i in `cat $1 | awk '{print $(1)}'`;do
if [[ "$i" != *"driver" ]];then
retry 30 ssh -o ConnectTimeout=2 -q $i exit
fi
done
}
function retry()
{
local n=0;local try=$1
local cmd="${@: 2}"
[[ $# -le 1 ]] && {
echo "Usage $0 <retry_number> <Command>";
}
set +e
until [[ $n -ge $try ]]
do
$cmd && break || {
echo "Command Fail.."
((n++))
echo "retry $n :: [$cmd]"
sleep 1;
}
done
$cmd
if [ $? -ne 0 ]; then
exit 1
fi
set -e
}
updateSSHPort
runCheckSSH $1
worker.run: |
#!/bin/bash
set -x
mkdir -p /root/.ssh
rm -f /root/.ssh/config
touch /root/.ssh/config
if [[ "$USESECRETS" == "true" ]];then
set +e
yes | cp /etc/secret-volume/id_rsa /root/.ssh/id_rsa
yes | cp /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys
set -e
fi
if [ -n "$SSHPORT" ]; then
echo "Port $SSHPORT" > /root/.ssh/config
sed -i "s/^Port.*/Port $SSHPORT /g" /etc/ssh/sshd_config
fi
echo "StrictHostKeyChecking no" >> /root/.ssh/config
/usr/sbin/sshd -D
19 changes: 19 additions & 0 deletions docker/helm/templates/job-service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
apiVersion: v1
kind: Service
metadata:
name: {{ template "horovod.fullname" . }}-driver
labels:
app: {{ template "horovod.name" . }}
chart: {{ template "horovod.chart" . }}
release: {{ .Release.Name }}
heritage: {{ .Release.Service }}
spec:
clusterIP: None
ports:
- name: ssh
port: {{ .Values.ssh.port }}
targetPort: {{ .Values.ssh.port }}
selector:
app: {{ template "horovod.name" . }}
release: {{ .Release.Name }}
role: driver
Loading

0 comments on commit a304c81

Please sign in to comment.