Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Kustomize manifests for Katib #1464

Merged
merged 18 commits into from
Mar 12, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@ vet:
update:
hack/update-gofmt.sh

# Deploy Katib v1beta1 manifests into a k8s cluster
# Deploy Katib v1beta1 manifests using Kustomize into a k8s cluster.
deploy:
bash scripts/v1beta1/deploy.sh

# Undeploy Katib v1beta1 manifests from a k8s cluster
# Undeploy Katib v1beta1 manifests using Kustomize from a k8s cluster
undeploy:
bash scripts/v1beta1/undeploy.sh

Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -202,7 +202,7 @@ kubectl create namespace kubeflow
Clone Kubeflow manifest repository:

```
git clone git@github.com:kubeflow/manifests.git
git clone -b v1.2-branch git@github.com:kubeflow/manifests.git
Set `MANIFESTS_DIR` to the cloned folder.
export MANIFESTS_DIR=<cloned-folder>
```
Expand Down Expand Up @@ -231,7 +231,8 @@ kustomize build . | kubectl apply -f -

### Katib

Finally, you can install Katib:
Note that your [kustomize](https://kustomize.io/) version should be >= 3.2.
To install Katib run:

```
git clone git@github.com:kubeflow/katib.git
Expand Down
21 changes: 11 additions & 10 deletions docs/developer-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ see the following user guides:

- [Go](https://golang.org/) (1.13 or later)
- [Docker](https://docs.docker.com/) (17.05 or later.)
- [kustomize](https://kustomize.io/) (3.2 or later)

## Build from source code

Expand Down Expand Up @@ -65,16 +66,16 @@ make generate

Below is a list of command-line flags accepted by Katib controller:

| Name | Type | Default | Description |
| ------------------------------- | --------------------------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| cert-localfs | bool | false | Store the webhook cert in local file system |
| enable-grpc-probe-in-suggestion | bool | true | Enable grpc probe in suggestions |
| experiment-suggestion-name | string | "default" | The implementation of suggestion interface in experiment controller |
| metrics-addr | string | ":8080" | The address the metric endpoint binds to |
| trial-resources | []schema.GroupVersionKind | null | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org) |
| webhook-inject-securitycontext | bool | false | Inject the securityContext of container[0] in the sidecar |
| webhook-port | int | 8443 | The port number to be used for admission webhook server |
| webhook-service-name | string | "katib-controller" | The service name which will be used in webhook |
| Name | Type | Default | Description |
| ------------------------------- | ------------------------- | ------------------ | ---------------------------------------------------------------------------------------------------------------------- |
| cert-localfs | bool | false | Store the webhook cert in local file system |
| enable-grpc-probe-in-suggestion | bool | true | Enable grpc probe in suggestions |
| experiment-suggestion-name | string | "default" | The implementation of suggestion interface in experiment controller |
| metrics-addr | string | ":8080" | The address the metric endpoint binds to |
| trial-resources | []schema.GroupVersionKind | null | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org) |
| webhook-inject-securitycontext | bool | false | Inject the securityContext of container[0] in the sidecar |
| webhook-port | int | 8443 | The port number to be used for admission webhook server |
| webhook-service-name | string | "katib-controller" | The service name which will be used in webhook |

## Workflow design

Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/bayesianoptimization-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/cmaes-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
4 changes: 1 addition & 3 deletions examples/v1beta1/custom-metricscollector-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,9 +66,7 @@ spec:
spec:
containers:
- name: training-container
# TODO (andreyvelich): Add tag to the image.
image: docker.io/kubeflowkatib/pytorch-mnist:latest
imagePullPolicy: Always
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/early-stopping/median-stop.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
4 changes: 1 addition & 3 deletions examples/v1beta1/file-metricscollector-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,7 @@ spec:
spec:
containers:
- name: training-container
# TODO (andreyvelich): Add tag to the image.
image: docker.io/kubeflowkatib/pytorch-mnist:latest
imagePullPolicy: Always
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/grid-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/hyperband-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/metric-strategy-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
39 changes: 9 additions & 30 deletions examples/v1beta1/mxnet-mnist/mnist.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,40 +37,19 @@
level=logging.DEBUG)


def read_data(label, image):
"""
download and read data into numpy
"""
base_url = 'http://yann.lecun.com/exdb/mnist/'
with gzip.open(utils.download_file(base_url+label, os.path.join('data', label))) as flbl:
magic, num = struct.unpack(">II", flbl.read(8))
label = np.fromstring(flbl.read(), dtype=np.int8)
with gzip.open(utils.download_file(base_url+image, os.path.join('data', image)), 'rb') as fimg:
magic, num, rows, cols = struct.unpack(">IIII", fimg.read(16))
image = np.fromstring(fimg.read(), dtype=np.uint8).reshape(len(label), rows, cols)
return (label, image)


def to4d(img):
def get_mnist_iter(args, kv):
"""
reshape to 4D arrays
Create data iterator with NDArrayIter
"""
return img.reshape(img.shape[0], 1, 28, 28).astype(np.float32)/255
mnist = mx.test_utils.get_mnist()

# Get MNIST data.
train_data = mx.io.NDArrayIter(
mnist['train_data'], mnist['train_label'], args.batch_size, shuffle=True)
val_data = mx.io.NDArrayIter(
mnist['test_data'], mnist['test_label'], args.batch_size)

def get_mnist_iter(args, kv):
"""
create data iterator with NDArrayIter
"""
(train_lbl, train_img) = read_data(
'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz')
(val_lbl, val_img) = read_data(
't10k-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz')
train = mx.io.NDArrayIter(
to4d(train_img), train_lbl, args.batch_size, shuffle=True)
val = mx.io.NDArrayIter(
to4d(val_img), val_lbl, args.batch_size)
return (train, val)
return (train_data, val_data)


if __name__ == '__main__':
Expand Down
3 changes: 1 addition & 2 deletions examples/v1beta1/nas/darts-example-cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/darts-cnn-cifar10:v1beta1-e294a90
imagePullPolicy: Always
image: docker.io/kubeflowkatib/darts-cnn-cifar10:v1beta1-c6c9172
command:
- python3
- run_trial.py
Expand Down
3 changes: 1 addition & 2 deletions examples/v1beta1/nas/darts-example-gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/darts-cnn-cifar10:v1beta1-e294a90
imagePullPolicy: Always
image: docker.io/kubeflowkatib/darts-cnn-cifar10:v1beta1-c6c9172
command:
- python3
- run_trial.py
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/nas/enas-example-cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/enas-cnn-cifar10-cpu:v1beta1-e294a90
image: docker.io/kubeflowkatib/enas-cnn-cifar10-cpu:v1beta1-c6c9172
command:
- python3
- -u
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/nas/enas-example-gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/enas-cnn-cifar10-gpu:v1beta1-e294a90
image: docker.io/kubeflowkatib/enas-cnn-cifar10-gpu:v1beta1-c6c9172
command:
- python3
- -u
Expand Down
28 changes: 13 additions & 15 deletions examples/v1beta1/pytorch-mnist/mnist.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,6 @@
import torch.nn.functional as F
import torch.optim as optim

# To fix this issue: https://github.com/pytorch/vision/issues/1938.
from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [("User-agent", "Mozilla/5.0")]
urllib.request.install_opener(opener)

WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1))


Expand Down Expand Up @@ -138,18 +132,22 @@ def main():
dist.init_process_group(backend=args.backend)

kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {}

train_loader = torch.utils.data.DataLoader(
datasets.MNIST("../data", train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
datasets.FashionMNIST("./data",
train=True,
download=True,
transform=transforms.Compose([
transforms.ToTensor()
])),
batch_size=args.batch_size, shuffle=True, **kwargs)

test_loader = torch.utils.data.DataLoader(
datasets.MNIST("../data", train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
datasets.FashionMNIST("./data",
train=False,
transform=transforms.Compose([
transforms.ToTensor()
])),
batch_size=args.test_batch_size, shuffle=False, **kwargs)

model = Net().to(device)
Expand Down
8 changes: 2 additions & 6 deletions examples/v1beta1/pytorchjob-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,7 @@ spec:
spec:
containers:
- name: pytorch
# TODO (andreyvelich): Add tag to the image.
image: docker.io/kubeflowkatib/pytorch-mnist:latest
imagePullPolicy: Always
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
Expand All @@ -61,9 +59,7 @@ spec:
spec:
containers:
- name: pytorch
# TODO (andreyvelich): Add tag to the image.
image: docker.io/kubeflowkatib/pytorch-mnist:latest
imagePullPolicy: Always
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/random-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/resume-experiment/from-volume-resume.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/resume-experiment/never-resume.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/tekton/pipeline-run.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ spec:
description: Number of training examples
steps:
- name: model-training
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
1 change: 0 additions & 1 deletion examples/v1beta1/tfjob-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,6 @@ spec:
containers:
- name: tensorflow
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
imagePullPolicy: Always
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/tpe-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/trial-metadata-substitution.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ kind: Deployment
metadata:
name: katib-controller
namespace: kubeflow
# TODO (andreyvelich): Modify labels to follow k8s guidelines.
labels:
app: katib-controller
spec:
Expand All @@ -21,7 +22,6 @@ spec:
containers:
- name: katib-controller
image: docker.io/kubeflowkatib/katib-controller
imagePullPolicy: Always
command: ["./katib-controller"]
args:
- "--webhook-port=8443"
Expand Down
Loading