forked from horovod/horovod
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add helm chart from github.com/helm/charts, commit 41312e665542de109088139ac64f695031d2bd11 * Mark docker helm chart as non-code for CI, as it is not tested Signed-off-by: Enrico Minack <github@enrico.minack.dev>
- Loading branch information
Showing
16 changed files
with
670 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
apiVersion: v1 | ||
description: A Helm chart for deploying Horovod | ||
name: horovod | ||
version: 1.0.3 | ||
appVersion: 0.24.3 | ||
sources: | ||
- https://github.com/horovod/horovod | ||
- https://github.com/horovod/horovod/blob/master/docs/docker.rst | ||
home: https://horovod.ai |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
# Horovod Helm Chart | ||
|
||
## Introduction | ||
|
||
This chart bootstraps Horovod which is a Distributed TensorFlow Framework on a Kubernetes cluster using the Helm Package Manager. It deploys Horovod workers as statefulsets, and the Horovod driver as a job, then discover the host list automatically. | ||
|
||
## Prerequisites | ||
|
||
- Kubernetes cluster v1.8+ | ||
|
||
## Build Docker Image | ||
|
||
You can download [official Horovod Dockerfile](https://github.com/horovod/horovod/blob/master/docker/horovod/Dockerfile), then modify it according to your requirement, e.g. select a different CUDA, TensorFlow or Python version. | ||
|
||
``` | ||
# mkdir horovod-docker | ||
# wget -O horovod-docker/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/docker/horovod/Dockerfile | ||
# docker build -t horovod:latest horovod-docker | ||
``` | ||
|
||
## Prepare ssh keys | ||
|
||
``` | ||
# Setup ssh key | ||
export SSH_KEY_DIR=`mktemp -d` | ||
cd $SSH_KEY_DIR | ||
yes | ssh-keygen -N "" -f id_rsa | ||
``` | ||
|
||
## Create the values.yaml | ||
|
||
To run Horovod with GPU, you can create `values.yaml` like below | ||
|
||
``` | ||
# cat << EOF > ~/values.yaml | ||
--- | ||
ssh: | ||
useSecrets: true | ||
hostKey: |- | ||
$(cat $SSH_KEY_DIR/id_rsa | sed 's/^/ /g') | ||
hostKeyPub: |- | ||
$(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/ /g') | ||
resources: | ||
limits: | ||
nvidia.com/gpu: 1 | ||
requests: | ||
nvidia.com/gpu: 1 | ||
worker: | ||
number: 2 | ||
image: | ||
repository: horovod/horovod | ||
tag: 0.24.3 | ||
driver: | ||
image: | ||
repository: horovod/horovod | ||
tag: 0.24.3 | ||
args: | ||
- "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'" | ||
EOF | ||
``` | ||
|
||
For most cases, the overlay network impacts the Horovod performance greatly, so we should apply `Host Network` solution. To run Horovod with Host Network and GPU, you can create `values.yaml` like below | ||
|
||
|
||
``` | ||
# cat << EOF > ~/values.yaml | ||
--- | ||
+useHostNetwork: true | ||
ssh: | ||
useSecrets: true | ||
port: 32222 | ||
hostKey: |- | ||
$(cat $SSH_KEY_DIR/id_rsa | sed 's/^/ /g') | ||
hostKeyPub: |- | ||
$(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/ /g') | ||
resources: | ||
limits: | ||
nvidia.com/gpu: 1 | ||
requests: | ||
nvidia.com/gpu: 1 | ||
worker: | ||
number: 2 | ||
image: | ||
repository: horovod/horovod | ||
tag: 0.24.3 | ||
driver: | ||
image: | ||
repository: horovod/horovod | ||
tag: 0.24.3 | ||
args: | ||
- "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'" | ||
EOF | ||
``` | ||
|
||
> notice: the difference is that you should set `useHostNetwork` as true, then set another ssh port rather than `22` | ||
## Installing the Chart | ||
|
||
To install the chart with the release name `mnist`: | ||
|
||
```bash | ||
$ helm install --values ~/values.yaml mnist stable/horovod | ||
``` | ||
|
||
## Uninstalling the Chart | ||
|
||
To uninstall/delete the `mnist` deployment: | ||
|
||
```bash | ||
$ helm delete mnist | ||
``` | ||
|
||
The command removes all the Kubernetes components associated with the chart and | ||
deletes the release. | ||
|
||
## Upgrading an existing Release to a new major version | ||
A major chart version change (like v1.2.3 -> v2.0.0) indicates that there is an | ||
incompatible breaking change needing manual actions. | ||
|
||
### 1.0.0 | ||
This version removes the `chart` label from the `spec.selector.matchLabels` | ||
which is immutable since `StatefulSet apps/v1beta2`. It has been inadvertently | ||
added, causing any subsequent upgrade to fail. See https://github.com/helm/charts/issues/7726. | ||
|
||
In order to upgrade, delete the Horovod StatefulSet before upgrading, supposing your Release is named `my-release`: | ||
|
||
```bash | ||
$ kubectl delete statefulsets.apps --cascade=false my-release | ||
``` | ||
|
||
## Configuration | ||
|
||
The following table lists the configurable parameters of the Horovod | ||
chart and their default values. | ||
|
||
| Parameter | Description | Default | | ||
|-----------|-------------|---------| | ||
| `useHostNetwork` | Host network | `false` | | ||
| `ssh.port` | The ssh port | `22` | | ||
| `ssh.useSecrets` | Determine if using the secrets for ssh | `false` | | ||
| `worker.number`| The worker's number | `5` | | ||
| `worker.image.repository` | horovod worker image | `horovod/horovod` | | ||
| `worker.image.pullPolicy` | `pullPolicy` for the worker | `IfNotPresent` | | ||
| `worker.image.tag` | `tag` for the worker | `0.24.3` | | ||
| `resources`| pod resource requests & limits| `{}`| | ||
| `worker.env` | worker's environment variables | `{}` | | ||
| `driver.image.repository` | horovod driver image | `horovod/horovod` | | ||
| `driver.image.tag` | `tag` for the driver | `0.24.3` | | ||
| `driver.image.pullPolicy` | image pullPolicy for the driver image| `IfNotPresent` | | ||
| `driver.args` | driver's args | `{}` | | ||
| `driver.env` | driver's environment variables | `{}` | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
1. Get the application URL by running these commands: | ||
|
||
*** NOTE: It may take a few minutes for the statefulset to be available | ||
|
||
*** you can watch the status of statefulset by running 'kubectl get sts --namespace {{ .Release.Namespace }} -w {{ template "horovod.fullname" . }}' *** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
{{/* vim: set filetype=mustache: */}} | ||
{{/* | ||
Expand the name of the chart. | ||
*/}} | ||
{{- define "horovod.name" -}} | ||
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} | ||
{{- end -}} | ||
|
||
{{/* | ||
Create a default fully qualified app name. | ||
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). | ||
If release name contains chart name it will be used as a full name. | ||
*/}} | ||
{{- define "horovod.fullname" -}} | ||
{{- if .Values.fullnameOverride -}} | ||
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" -}} | ||
{{- else -}} | ||
{{- $name := default .Chart.Name .Values.nameOverride -}} | ||
{{- if contains $name .Release.Name -}} | ||
{{- .Release.Name | trunc 63 | trimSuffix "-" -}} | ||
{{- else -}} | ||
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} | ||
{{- end -}} | ||
{{- end -}} | ||
{{- end -}} | ||
|
||
{{/* | ||
Create chart name and version as used by the chart label. | ||
*/}} | ||
{{- define "horovod.chart" -}} | ||
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" -}} | ||
{{- end -}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
{{- $workerNum := .Values.worker.number -}} | ||
{{- $name := include "horovod.fullname" . }} | ||
{{- $slots := 1 }} | ||
{{- if index .Values.resources "nvidia.com/gpu" }} | ||
{{- $slots := index .Values.resources "nvidia.com/gpu" }} | ||
{{- end }} | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: {{ template "horovod.fullname" . }} | ||
labels: | ||
heritage: {{ .Release.Service | quote }} | ||
release: {{ .Release.Name | quote }} | ||
chart: {{ template "horovod.chart" . }} | ||
app: {{ template "horovod.fullname" . }} | ||
data: | ||
hostfile.config: | | ||
{{ $name }}-driver slots={{ $slots }} | ||
{{- range $i, $none := until (int $workerNum) }} | ||
{{ $name }}-{{ $i }}.{{ $name }} slots={{ $slots }} | ||
{{- end }} | ||
ssh.readiness: | | ||
#!/bin/bash | ||
set -xev | ||
ssh localhost ls | ||
driver.run: | | ||
#!/bin/bash | ||
set -x | ||
sleep 5 | ||
mkdir -p /root/.ssh | ||
rm -f /root/.ssh/config | ||
touch /root/.ssh/config | ||
if [ "$USESECRETS" == "true" ];then | ||
set +e | ||
yes | cp /etc/secret-volume/id_rsa /root/.ssh/id_rsa | ||
yes | cp /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys | ||
set -e | ||
fi | ||
if [ -n "$SSHPORT" ]; then | ||
echo "Port $SSHPORT" > /root/.ssh/config | ||
sed -i "s/^Port.*/Port $SSHPORT /g" /etc/ssh/sshd_config | ||
fi | ||
echo "StrictHostKeyChecking no" >> /root/.ssh/config | ||
/usr/sbin/sshd | ||
if [ $# -eq 0 ]; then | ||
sleep infinity | ||
else | ||
bash -c "$*" | ||
fi | ||
driver.waitWorkerReady: | | ||
#!/bin/bash | ||
set -xev | ||
function updateSSHPort() { | ||
mkdir -p /root/.ssh | ||
rm -f /root/.ssh/config | ||
touch /root/.ssh/config | ||
if [ -n "$SSHPORT" ]; then | ||
echo "Port $SSHPORT" > /root/.ssh/config | ||
echo "StrictHostKeyChecking no" >> /root/.ssh/config | ||
fi | ||
} | ||
function runCheckSSH() { | ||
if [[ "$USESECRETS" == "true" ]];then | ||
set +e | ||
yes | cp /etc/secret-volume/id_rsa /root/.ssh/id_rsa | ||
yes | cp /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys | ||
set -e | ||
fi | ||
for i in `cat $1 | awk '{print $(1)}'`;do | ||
if [[ "$i" != *"driver" ]];then | ||
retry 30 ssh -o ConnectTimeout=2 -q $i exit | ||
fi | ||
done | ||
} | ||
function retry() | ||
{ | ||
local n=0;local try=$1 | ||
local cmd="${@: 2}" | ||
[[ $# -le 1 ]] && { | ||
echo "Usage $0 <retry_number> <Command>"; | ||
} | ||
set +e | ||
until [[ $n -ge $try ]] | ||
do | ||
$cmd && break || { | ||
echo "Command Fail.." | ||
((n++)) | ||
echo "retry $n :: [$cmd]" | ||
sleep 1; | ||
} | ||
done | ||
$cmd | ||
if [ $? -ne 0 ]; then | ||
exit 1 | ||
fi | ||
set -e | ||
} | ||
updateSSHPort | ||
runCheckSSH $1 | ||
worker.run: | | ||
#!/bin/bash | ||
set -x | ||
mkdir -p /root/.ssh | ||
rm -f /root/.ssh/config | ||
touch /root/.ssh/config | ||
if [[ "$USESECRETS" == "true" ]];then | ||
set +e | ||
yes | cp /etc/secret-volume/id_rsa /root/.ssh/id_rsa | ||
yes | cp /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys | ||
set -e | ||
fi | ||
if [ -n "$SSHPORT" ]; then | ||
echo "Port $SSHPORT" > /root/.ssh/config | ||
sed -i "s/^Port.*/Port $SSHPORT /g" /etc/ssh/sshd_config | ||
fi | ||
echo "StrictHostKeyChecking no" >> /root/.ssh/config | ||
/usr/sbin/sshd -D |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
apiVersion: v1 | ||
kind: Service | ||
metadata: | ||
name: {{ template "horovod.fullname" . }}-driver | ||
labels: | ||
app: {{ template "horovod.name" . }} | ||
chart: {{ template "horovod.chart" . }} | ||
release: {{ .Release.Name }} | ||
heritage: {{ .Release.Service }} | ||
spec: | ||
clusterIP: None | ||
ports: | ||
- name: ssh | ||
port: {{ .Values.ssh.port }} | ||
targetPort: {{ .Values.ssh.port }} | ||
selector: | ||
app: {{ template "horovod.name" . }} | ||
release: {{ .Release.Name }} | ||
role: driver |
Oops, something went wrong.