Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add chainer to the website. #200

Merged
merged 5 commits into from
Sep 12, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions content/docs/guides/components/chainer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
+++
title = "Chainer Training"
description = "Instructions for using Chainer for training"
weight = 10
abhi-g marked this conversation as resolved.
Show resolved Hide resolved
toc = true
bref= "This guide will walk you through using Chainer for training"
[menu]
[menu.docs]
parent = "components"
weight = 4
+++

## What is Chainer?

[Chainer](https://chainer.org/) is a powerful, flexible and intuitive deep learning framework.

- Chainer supports CUDA computation. It only requires a few lines of code to leverage a GPU. It also runs on multiple GPUs with little effort.
- Chainer supports various network architectures including feed-forward nets, convnets, recurrent nets and recursive nets. It also supports per-batch architectures.
- Forward computation can include any control flow statements of Python without lacking the ability of backpropagation. It makes code intuitive and easy to debug.

[ChainerMN](https://github.com/chainer/chainermn) is an additional package for Chainer, a flexible deep learning framework. ChainerMN enables multi-node distributed deep learning with the following features:

- Scalable --- it makes full use of the latest technologies such as NVIDIA NCCL and CUDA-Aware MPI,
- Flexible --- even dynamic neural networks can be trained in parallel thanks to Chainer's flexibility, and
- Easy --- minimal changes to existing user code are required.

[This blog post](https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html) provides a benchmark results using up to 128 GPUs.

## Installing Chainer Operator

If you haven't already done so please follow the [Getting Started Guide](/docs/started/getting-started/) to deploy Kubeflow.

An **alpha** version of [Chainer](https://chainer.org/) support was introduced with Kubeflow 0.3.0. You must be using a version of Kubeflow newer than 0.3.0.

## Verify that Chainer support is included in your Kubeflow deployment

Check that the Chainer Job custom resource is installed

```shell
kubectl get crd
```

The output should include `chainerjobs.kubeflow.org`

```
NAME AGE
...
chainerjobs.kubeflow.org 4d
...
```

If it is not included you can add it as follows

```shells
cd ${KSONNET_APP}
ks pkg install kubeflow/chainer-job
ks generate chainer-operator chainer-operator
ks apply ${ENVIRONMENT} -c chainer-operator
```

## Creating a Chainer Job

You can create an Chainer Job by defining an ChainerJob config file. First, please create a file `example-job-mn.yaml` like below:

```yaml
apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
name: example-job-mn
spec:
backend: mpi
master:
mpiConfig:
slots: 1
activeDeadlineSeconds: 6000
backoffLimit: 60
template:
spec:
containers:
- name: chainer
image: everpeace/chainermn:1.3.0
command:
- sh
- -c
- |
mpiexec -n 3 -N 1 --allow-run-as-root --display-map --mca mpi_cuda_support 0 \
python3 /train_mnist.py -e 2 -b 1000 -u 100
workerSets:
ws0:
replicas: 2
mpiConfig:
slots: 1
template:
spec:
containers:
- name: chainer
image: everpeace/chainermn:1.3.0
command:
- sh
- -c
- |
while true; do sleep 1 & wait; done
```

See [examples/chainerjob-reference.yaml](https://github.com/kubeflow/chainer-operator/blob/master/examples/chainerjob-reference.yaml) for definitions of each attributes. You may change the config file based on your requirements. By default, the example job is distributed learning with 3 nodes (1 master, 2 workers).

Deploy the ChainerJob resource to start training:

```shell
kubectl create -f example-job-mn.yaml
```

You should now be able to see the created pods which consist of the chainer job.

```
kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn
```

The training should run only for 2 epochs and takes within a few minutes even on cpu only cluster. Logs can be inspected to see its training progress.

```
PODNAME=$(kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn,chainerjob.kubeflow.org/role=master -o name)
kubectl logs -f ${PODNAME}
```

## Monitoring an Chainer Job

```shell
kubectl get -o yaml chainerjobs example-job-mn
```

See the status section to monitor the job status. Here is sample output when the job is successfully completed.

```yaml
apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
name: example-job-mn
...
status:
completionTime: 2018-09-01T16:42:35Z
conditions:
- lastProbeTime: 2018-09-01T16:42:35Z
lastTransitionTime: 2018-09-01T16:42:35Z
status: "True"
type: Complete
startTime: 2018-09-01T16:34:04Z
succeeded: 1
```
2 changes: 1 addition & 1 deletion themes/kf/layouts/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ <h4>Model serving</h4>
</div>
<div class="text">
<h4>Multi-framework</h4>
<p>Our development plans go beyond TensorFlow, and we are working hard to include PyTorch, MXNet, and more. We also integrate with Ambassador for ingress and Pachyderm for managing your data science pipelines.</p>
<p>Our development plans go beyond TensorFlow, and we are working hard to include PyTorch, MXNet, Chainer, and more. We also integrate with Ambassador for ingress and Pachyderm for managing your data science pipelines.</p>
</div>
</div>

Expand Down
2 changes: 1 addition & 1 deletion themes/kf/sass/src/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ <h4>Model serving</h4>
</div>
<div class="text">
<h4>Multi-framework</h4>
<p>Our development plans go beyond TensorFlow, and we are working hard to include PyTorch, MXNet, and more. We also integrate with Ambassador for ingress and Pachyderm for managing your data science pipelines.</p>
<p>Our development plans go beyond TensorFlow, and we are working hard to include PyTorch, MXNet, Chainer, and more. We also integrate with Ambassador for ingress and Pachyderm for managing your data science pipelines.</p>
</div>
</div>

Expand Down
Loading