diff --git a/README.md b/README.md index 7a761cf7282..11b71a999b2 100644 --- a/README.md +++ b/README.md @@ -3,43 +3,61 @@
-[![Build Status](https://travis-ci.org/kubeflow/katib.svg?branch=master)](https://travis-ci.org/kubeflow/katib) +[![Build Status](https://travis-ci.com/kubeflow/katib.svg?branch=master)](https://travis-ci.com/kubeflow/katib) [![Coverage Status](https://coveralls.io/repos/github/kubeflow/katib/badge.svg?branch=master)](https://coveralls.io/github/kubeflow/katib?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/katib)](https://goreportcard.com/report/github.com/kubeflow/katib) - -Katib is a Kubernetes-based system for [Hyperparameter Tuning][1] and [Neural Architecture Search][2]. Katib supports a number of ML frameworks, including TensorFlow, Apache MXNet, PyTorch, XGBoost, and others. - -Table of Contents -================= - - * [Getting Started](#getting-started) - * [Name](#name) - * [Concepts in Katib](#concepts-in-katib) - * [Experiment](#experiment) - * [Suggestion](#suggestion) - * [Trial](#trial) - * [Worker Job](#worker-job) - * [Components in Katib](#components-in-katib) - * [Web UI](#web-ui) - * [API documentation](#api-documentation) - * [Installation](#installation) - * [TF operator](#tf-operator) - * [PyTorch operator](#pytorch-operator) - * [Katib](#katib) - * [Running examples](#running-examples) - * [Cleanups](#cleanups) - * [Katib SDK](#katib-sdk) - * [Quick Start](#quick-start) - * [Who are using Katib?](#who-are-using-katib) - * [Citation](#citation) - * [CONTRIBUTING](#contributing) - -Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc) +[![Releases](https://img.shields.io/github/release-pre/kubeflow/katib.svg?sort=semver)](https://github.com/kubeflow/katib/releases) +[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://kubeflow.slack.com/archives/C018PMV53NW) + +Katib is a Kubernetes-native project for automated machine learning (AutoML). +Katib supports +[Hyperparameter Tuning](https://en.wikipedia.org/wiki/Hyperparameter_optimization), +[Early Stopping](https://en.wikipedia.org/wiki/Early_stopping) and +[Neural Architecture Search](https://en.wikipedia.org/wiki/Neural_architecture_search) + +Katib is the project which is agnostic to machine learning (ML) frameworks. +It can tune hyperparameters of applications written in any language of the +users’ choice and natively supports many ML frameworks, such as TensorFlow, +MXNet, PyTorch, XGBoost, and others. + + + + +# Table of Contents + +- [Getting Started](#getting-started) +- [Name](#name) +- [Concepts in Katib](#concepts-in-katib) + - [Experiment](#experiment) + - [Suggestion](#suggestion) + - [Trial](#trial) + - [Worker Job](#worker-job) + - [Search Algorithms](#search-algorithms) + - [Hyperparameter Tuning](#hyperparameter-tuning) + - [Neural Architecture Search](#neural-architecture-search) +- [Components in Katib](#components-in-katib) +- [Web UI](#web-ui) +- [GRPC API documentation](#grpc-api-documentation) +- [Installation](#installation) + - [TF operator](#tf-operator) + - [PyTorch operator](#pytorch-operator) + - [Katib](#katib) + - [Running examples](#running-examples) + - [Katib SDK](#katib-sdk) + - [Cleanups](#cleanups) +- [Quick Start](#quick-start) +- [Who are using Katib?](#who-are-using-katib) +- [CONTRIBUTING](#contributing) +- [Citation](#citation) + + + +Created by [doctoc](https://github.com/thlorenz/doctoc). ## Getting Started -See the [getting-started -guide](https://www.kubeflow.org/docs/components/hyperparameter-tuning/hyperparameter/) +Follow the +[getting-started guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/) on the Kubeflow website. ## Name @@ -48,101 +66,132 @@ Katib stands for `secretary` in Arabic. ## Concepts in Katib -For a detailed description of the concepts in Katib, hyperparameter tuning, and -neural architecture search, see the [Kubeflow -documentation](https://www.kubeflow.org/docs/components/hyperparameter-tuning/overview/). +For a detailed description of the concepts in Katib and AutoML, check the +[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/overview/). -Katib has the concepts of Experiment, Trial, Job and Suggestion. +Katib has the concepts of `Experiment`, `Suggestion`, `Trial` and `Worker Job`. ### Experiment -`Experiment` represents a single optimization run over a feasible space. +An `Experiment` represents a single optimization run over a feasible space. Each `Experiment` contains a configuration: -1. Objective: What we are trying to optimize. -2. Search Space: Constraints for configurations describing the feasible space. -3. Search Algorithm: How to find the optimal configurations. +1. **Objective**: What you want to optimize. +2. **Search Space**: Constraints for configurations describing the feasible space. +3. **Search Algorithm**: How to find the optimal configurations. -`Experiment` is defined as a CRD. See the detailed guide to [configuring and running a Katib -experiment](https://kubeflow.org/docs/components/hyperparameter-tuning/experiment/) +Katib `Experiment` is defined as a CRD. Check the detailed guide to +[configuring and running a Katib `Experiment`](https://kubeflow.org/docs/components/katib/experiment/) in the Kubeflow docs. ### Suggestion -A Suggestion is a proposed solution to the optimization problem which is one set of hyperparameter values or a list of parameter assignments. Then a `Trial` will be created to evaluate the parameter assignments. +A `Suggestion` is a set of hyperparameter values that the hyperparameter tuning +process has proposed. Katib creates a `Trial` to evaluate +the suggested set of values. -`Suggestion` is defined as a CRD. +Katib `Suggestion` is defined as a CRD. ### Trial -A `Trial` is one iteration of the optimization process, which is one `worker job` instance with a list of parameter assignments(corresponding to a suggestion). +A `Trial` is one iteration of the hyperparameter tuning process. +A `Trial` corresponds to one worker job instance with a list of parameter +assignments. The list of parameter assignments corresponds to a `Suggestion`. + +Each `Experiment` runs several `Trials`. The `Experiment` runs the `Trials` until +it reaches either the objective or the configured maximum number of `Trials`. + +Katib `Trial` is defined as a CRD. + +### Worker Job + +The `Worker Job` is the process that runs to evaluate a `Trial` and calculate +its objective value. + +The `Worker Job` can be any type of Kubernetes resource or +[Kubernetes CRD](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/). +Follow the [`Trial` template guide](https://www.kubeflow.org/docs/components/katib/trial-template/#custom-resource) +to support your own Kubernetes resource in Katib. + +Katib has these CRD examples in upstream: + +- [Kubernetes `Job`](https://kubernetes.io/docs/concepts/workloads/controllers/job/) -`Trial` is defined as a CRD. +- [Kubeflow `TFJob`](https://www.kubeflow.org/docs/components/training/tftraining/) -### Worker Job +- [Kubeflow `PyTorchJob`](https://www.kubeflow.org/docs/components/training/pytorch/) -A `Worker Job` refers to a process responsible for evaluating a `Trial` and calculating its objective value. +- [Kubeflow `MPIJob`](https://www.kubeflow.org/docs/components/training/mpi/) -The worker kind can be [Kubernetes Job](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/) which is a non distributed execution, [Kubeflow TFJob](https://www.kubeflow.org/docs/guides/components/tftraining/) or [Kubeflow PyTorchJob](https://www.kubeflow.org/docs/guides/components/pytorch/) which are distributed executions. -Thus, Katib supports multiple frameworks with the help of different job kinds. +- [Tekton `Pipeline`](https://github.com/tektoncd/pipeline) -Currently Katib supports the following exploration algorithms: +Thus, Katib supports multiple frameworks with the help of different job kinds. + +### Search Algorithms + +Katib currently supports several search algorithms. Follow the +[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/experiment/#search-algorithms-in-detail) +to know more about each algorithm. #### Hyperparameter Tuning -* [Random Search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Random_search) -* [Tree of Parzen Estimators (TPE)](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf) -* [Grid Search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search) -* [Hyperband](https://arxiv.org/pdf/1603.06560.pdf) -* [Bayesian Optimization](https://arxiv.org/pdf/1012.2599.pdf) -* [CMA Evolution Strategy](https://arxiv.org/abs/1604.00772) +- [Random Search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Random_search) +- [Tree of Parzen Estimators (TPE)](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf) +- [Grid Search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search) +- [Hyperband](https://arxiv.org/pdf/1603.06560.pdf) +- [Bayesian Optimization](https://arxiv.org/pdf/1012.2599.pdf) +- [Covariance Matrix Adaptation Evolution Strategy (CMA-ES)](https://arxiv.org/abs/1604.00772) #### Neural Architecture Search -* [Efficient Neural Architecture Search (ENAS)](https://github.com/kubeflow/katib/tree/master/pkg/suggestion/v1beta1/nas/enas) -* [Differentiable Architecture Search (DARTS)](https://github.com/kubeflow/katib/tree/master/pkg/suggestion/v1beta1/nas/darts) - +- [Efficient Neural Architecture Search (ENAS)](https://github.com/kubeflow/katib/tree/master/pkg/suggestion/v1beta1/nas/enas) +- [Differentiable Architecture Search (DARTS)](https://github.com/kubeflow/katib/tree/master/pkg/suggestion/v1beta1/nas/darts) ## Components in Katib -Katib consists of several components as shown below. Each component is running on k8s as a deployment. -Each component communicates with others via GRPC and the API is defined at `pkg/apis/manager/v1beta1/api.proto` -for v1beta1 version and `pkg/apis/manager/v1alpha3/api.proto` for v1alpha3 version. +Katib consists of several components as shown below. Each component is running +on Kubernetes as a deployment. Each component communicates with others via GRPC +and the API is defined at `pkg/apis/manager/v1beta1/api.proto`. - Katib main components: - - katib-db-manager: GRPC API server of Katib which is the DB Interface. - - katib-mysql: Data storage backend of Katib using mysql. - - katib-ui: User interface of Katib. - - katib-controller: Controller for Katib CRDs in Kubernetes. + - `katib-db-manager` - the GRPC API server of Katib which is the DB Interface. + - `katib-mysql` - the data storage backend of Katib using mysql. + - `katib-ui` - the user interface of Katib. + - `katib-controller` - the controller for the Katib CRDs in Kubernetes. ## Web UI Katib provides a Web UI. -You can visualize general trend of Hyper parameter space and each training history. You can use -[random-example](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/random-example.yaml) or -[other examples](https://github.com/kubeflow/katib/blob/master/examples/v1beta1) to generate a similar UI. +You can visualize general trend of Hyper parameter space and +each training history. You can use +[random-example](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/random-example.yaml) +or +[other examples](https://github.com/kubeflow/katib/blob/master/examples/v1beta1) +to generate a similar UI. Follow the +[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-ui) +to access the Katib UI. ![katibui](./docs/images/katib-ui.png) ## GRPC API documentation -See the [Katib v1beta1 API reference docs](https://github.com/kubeflow/katib/blob/master/pkg/apis/manager/v1beta1/gen-doc/api.md). - -See the [Katib v1alpha3 API reference docs](https://www.kubeflow.org/docs/reference/katib/). +Check the [Katib v1beta1 API reference docs](https://www.kubeflow.org/docs/reference/katib/v1beta1/katib/). ## Installation -For standard installation of Katib with support for all job operators, -install Kubeflow. Current official Katib version in Kubeflow latest release is v1alpha3. -See the documentation: +For standard installation of Katib with support for all job operators, +install Kubeflow. +Follow the documentation: -* [Kubeflow installation -guide](https://www.kubeflow.org/docs/started/getting-started/) -* [Kubeflow hyperparameter tuning -guides](https://www.kubeflow.org/docs/components/hyperparameter-tuning/). +- [Kubeflow installation guide](https://www.kubeflow.org/docs/started/getting-started/) +- [Kubeflow Katib guides](https://www.kubeflow.org/docs/components/katib/). -If you install Katib with other Kubeflow components, you can't submit Katib jobs in Kubeflow namespace. +If you install Katib with other Kubeflow components, +you can't submit Katib jobs in Kubeflow namespace. Check the +[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-algorithm) +to know more about it. -Alternatively, if you want to install Katib manually with TF and PyTorch operators support, follow these steps: +Alternatively, if you want to install Katib manually with TF and PyTorch +operators support, follow these steps: Create Kubeflow namespace: @@ -166,7 +215,7 @@ For installing TF operator, run the following: cd "${MANIFESTS_DIR}/tf-training/tf-job-crds/base" kustomize build . | kubectl apply -f - cd "${MANIFESTS_DIR}/tf-training/tf-job-operator/base" -kustomize build . | kubectl apply -n kubeflow -f - +kustomize build . | kubectl apply -f - ``` ### PyTorch operator @@ -177,54 +226,18 @@ For installing PyTorch operator, run the following: cd "${MANIFESTS_DIR}/pytorch-job/pytorch-job-crds/base" kustomize build . | kubectl apply -f - cd "${MANIFESTS_DIR}/pytorch-job/pytorch-operator/base/" -kustomize build . | kubectl apply -n kubeflow -f - +kustomize build . | kubectl apply -f - ``` ### Katib -Finally, you can install Katib. - -For v1beta1 version, run the following: +Finally, you can install Katib: ``` git clone git@github.com:kubeflow/katib.git -bash katib/scripts/v1beta1/deploy.sh +make deploy ``` -For v1alpha3 version, run the following: - -``` -cd "${MANIFESTS_DIR}/katib/katib-crds/base" -kustomize build . | kubectl apply -f - -cd "${MANIFESTS_DIR}/katib/katib-controller/base" -kustomize build . | kubectl apply -f - - -``` - -If you install Katib from Kubeflow manifest repository and you want to use Katib in a cluster that doesn't have a StorageClass for dynamic volume provisioning, you have to create persistent volume manually to bound your persistent volume claim. - -This is sample yaml file for creating a persistent volume with local storage: - -```yaml -apiVersion: v1 -kind: PersistentVolume -metadata: - name: katib-mysql - labels: - type: local - app: katib -spec: - storageClassName: katib - capacity: - storage: 10Gi - accessModes: - - ReadWriteOnce - hostPath: - path: /tmp/katib -``` - -Create this PV after deploying Katib package - Check if all components are running successfully: ``` @@ -246,7 +259,6 @@ tf-job-operator-796b4747d8-4fh82 1/1 Running 0 21m ### Running examples After deploy everything, you can run examples to verify the installation. -Examples bellow are for v1beta1 version. This is an example for TF operator: @@ -260,161 +272,40 @@ This is an example for PyTorch operator: kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/pytorchjob-example.yaml ``` -You can check status of experiment - -```yaml -$ kubectl describe experiment tfjob-example -n kubeflow - -Name: tfjob-example -Namespace: kubeflow -Labels: -Annotations: -API Version: kubeflow.org/v1beta1 -Kind: Experiment -Metadata: - Creation Timestamp: 2020-07-15T14:27:53Z - Finalizers: - update-prometheus-metrics - Generation: 1 - Resource Version: 100380029 - Self Link: /apis/kubeflow.org/v1beta1/namespaces/kubeflow/experiments/tfjob-example - UID: 5e3cf1f5-c6a7-11ea-90dd-42010a9a0020 -Spec: - Algorithm: - Algorithm Name: random - Max Failed Trial Count: 3 - Max Trial Count: 12 - Metrics Collector Spec: - Collector: - Kind: TensorFlowEvent - Source: - File System Path: - Kind: Directory - Path: /train - Objective: - Goal: 0.99 - Metric Strategies: - Name: accuracy_1 - Value: max - Objective Metric Name: accuracy_1 - Type: maximize - Parallel Trial Count: 3 - Parameters: - Feasible Space: - Max: 0.05 - Min: 0.01 - Name: learning_rate - Parameter Type: double - Feasible Space: - Max: 200 - Min: 100 - Name: batch_size - Parameter Type: int - Resume Policy: LongRunning - Trial Template: - Trial Parameters: - Description: Learning rate for the training model - Name: learningRate - Reference: learning_rate - Description: Batch Size - Name: batchSize - Reference: batch_size - Trial Spec: - API Version: kubeflow.org/v1 - Kind: TFJob - Spec: - Tf Replica Specs: - Worker: - Replicas: 2 - Restart Policy: OnFailure - Template: - Spec: - Containers: - Command: - python - /var/tf_mnist/mnist_with_summaries.py - --log_dir=/train/metrics - --learning_rate=${trialParameters.learningRate} - --batch_size=${trialParameters.batchSize} - Image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0 - Image Pull Policy: Always - Name: tensorflow -Status: - Completion Time: 2020-07-15T14:30:52Z - Conditions: - Last Transition Time: 2020-07-15T14:27:53Z - Last Update Time: 2020-07-15T14:27:53Z - Message: Experiment is created - Reason: ExperimentCreated - Status: True - Type: Created - Last Transition Time: 2020-07-15T14:30:52Z - Last Update Time: 2020-07-15T14:30:52Z - Message: Experiment is running - Reason: ExperimentRunning - Status: False - Type: Running - Last Transition Time: 2020-07-15T14:30:52Z - Last Update Time: 2020-07-15T14:30:52Z - Message: Experiment has succeeded because Objective goal has reached - Reason: ExperimentGoalReached - Status: True - Type: Succeeded - Current Optimal Trial: - Best Trial Name: tfjob-example-gjxn54vl - Observation: - Metrics: - Latest: 0.966300010681 - Max: 1.0 - Min: 0.103260867298 - Name: accuracy_1 - Parameter Assignments: - Name: learning_rate - Value: 0.015945204040626416 - Name: batch_size - Value: 184 - Start Time: 2020-07-15T14:27:53Z - Succeeded Trial List: - tfjob-example-5jd8nnjg - tfjob-example-bgjfpd5t - tfjob-example-gjxn54vl - tfjob-example-vpdqxkch - tfjob-example-wvptx7gt - Trials: 5 - Trials Succeeded: 5 -Events: -``` +Check the +[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-algorithm) +how to monitor your `Experiment` status. -When the spec.Status.Condition becomes ```Succeeded```, the experiment is finished. - -You can monitor your results in Katib UI. -Access Katib UI via Kubeflow dashboard if you have used standard installation or port-forward the `katib-ui` service if you have installed manually. +You can view your results in Katib UI. +If you used standard installation, access the Katib UI via Kubeflow dashboard. +Otherwise, port-forward the `katib-ui`: ``` kubectl -n kubeflow port-forward svc/katib-ui 8080:80 ``` -You can access the Katib UI using this URL: ```http://localhost:8080/katib/```. +You can access the Katib UI using this URL: `http://localhost:8080/katib/`. ### Katib SDK -Katib supports Python SDK for v1beta1 and v1alpha3 version. - -* See the [Katib v1beta1 SDK documentation](https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1). +Katib supports Python SDK: -* See the [Katib v1alpha3 SDK documentation](https://github.com/kubeflow/katib/tree/master/sdk/python/v1alpha3). +- Check the [Katib v1beta1 SDK documentation](https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1). -Run [`gen-sdk.sh`](https://github.com/kubeflow/katib/blob/master/hack/gen-python-sdk/gen-sdk.sh) to update SDK. +Run `make generate` to update Katib SDK. ### Cleanups -To delete installed TF and PyTorch operator run `kubectl delete -f` on the respective folders. +To delete installed TF and PyTorch operator run `kubectl delete -f` +on the respective folders. -To delete Katib for v1beta1 version run `bash katib/scripts/v1beta1/undeploy.sh`. +To delete Katib run `make undeploy`. ## Quick Start -Please see [Quick Start Guide](./docs/quick-start.md). +Please follow the +[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/hyperparameter/#examples) +to submit your first Katib experiment. ## Who are using Katib? @@ -422,18 +313,16 @@ Please see [ADOPTERS.md](ADOPTERS.md). ## CONTRIBUTING -Please feel free to test the system! [developer-guide.md](./docs/developer-guide.md) is a good starting point for developers. - -[1]: https://en.wikipedia.org/wiki/Hyperparameter_optimization -[2]: https://en.wikipedia.org/wiki/Neural_architecture_search -[3]: https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/bcb15507f4b52991a0783013df4222240e942381.pdf +Please feel free to test the system! +[developer-guide.md](./docs/developer-guide.md) is a good starting point +for developers. ## Citation If you use Katib in a scientific publication, we would appreciate citations to the following paper: -[A Scalable and Cloud-Native Hyperparameter Tuning System](https://arxiv.org/abs/2006.02085), George *et al.*, arXiv:2006.02085, 2020. +[A Scalable and Cloud-Native Hyperparameter Tuning System](https://arxiv.org/abs/2006.02085), George _et al._, arXiv:2006.02085, 2020. Bibtex entry: diff --git a/docs/algorithm-settings.md b/docs/algorithm-settings.md deleted file mode 100644 index e2ab437247e..00000000000 --- a/docs/algorithm-settings.md +++ /dev/null @@ -1,40 +0,0 @@ -# Hyperparameter Tuning Algorithms - -Table of Contents -================= - - * [Hyperparameter Tuning Algorithms](#hyperparameter-tuning-algorithms) - * [Table of Contents](#table-of-contents) - * [Grid Search](#grid-search) - * [Chocolate](#chocolate) - * [Random Search](#random-search) - * [Hyperopt](#hyperopt) - * [TPE](#tpe) - * [Hyperopt](#hyperopt-1) - * [Bayesian Optimization](#bayesian-optimization) - * [scikit-optimize](#scikit-optimize) - * [References](#references) - -Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc) - - - - - -For information about the hyperparameter tuning algorithms and neural -architecture search implemented or integrated in Katib, see the detailed guide -to [configuring and running a Katib -experiment](https://kubeflow.org/docs/components/hyperparameter-tuning/experiment/). -For information about supported algorithms in Katib, see the [Katib configuration settings](https://kubeflow.org/docs/components/hyperparameter-tuning/katib-config/#suggestion-settings). diff --git a/docs/images/quickstart-trial.png b/docs/images/quickstart-trial.png deleted file mode 100644 index e763ce030ec..00000000000 Binary files a/docs/images/quickstart-trial.png and /dev/null differ diff --git a/docs/images/quickstart.png b/docs/images/quickstart.png deleted file mode 100644 index 5ecae64d65a..00000000000 Binary files a/docs/images/quickstart.png and /dev/null differ diff --git a/docs/quick-start.md b/docs/quick-start.md deleted file mode 100644 index d619cbbc442..00000000000 --- a/docs/quick-start.md +++ /dev/null @@ -1,176 +0,0 @@ -# Quick Start - -Katib is a Kubernetes Native System for [Hyperparameter Tuning][1] and [Neural Architecture Search][2]. This short introduction illustrates how to use Katib to: - -- Define a hyperparameter tuning experiment. -- Evaluate it using the resources in Kubernetes. -- Get the best hyperparameter combination in all these trials. - -## Requirements - -Before you run the hyperparameter tuning experiment, you need to have: - -- A Kubernetes cluster with [installed TF operator and Katib](https://github.com/kubeflow/katib#installation) - -## Katib in Kubeflow - -See the following guides in the Kubeflow documentation: - -* [Concepts](https://www.kubeflow.org/docs/components/hyperparameter-tuning/overview/) - in Katib, hyperparameter tuning, and neural architecture search. -* [Getting started with Katib](https://kubeflow.org/docs/components/hyperparameter-tuning/hyperparameter/). -* Detailed guide to [configuring and running a Katib - experiment](https://kubeflow.org/docs/components/hyperparameter-tuning/experiment/). - -## Hyperparameter Tuning on MNIST - -Katib supports multiple [Machine Learning Frameworks](https://en.wikipedia.org/wiki/Comparison_of_deep-learning_software) (e.g. TensorFlow, PyTorch, MXNet, and XGBoost). - -In this quick start guide, we demonstrate how to use TensorFlow in Katib, which is one of the most popular framework among the world, to run a hyperparameter tuning job on MNIST. - -### Package Training Code - -The first thing we need to do is to package the training code to a docker image. We use the [example code](https://github.com/kubeflow/tf-operator/blob/master/examples/v1/mnist_with_summaries/mnist_with_summaries.py), which builds a simple neural network, to train on MNIST. The code trains the network and outputs the TFEvents to `/tmp` by default. - -You can use our prebuilt image `gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0`. Thus we can skip it. - -### Create the Experiment - -If you want to use Katib to automatically tune hyperparameters, you need to define the `Experiment`, which is a CRD that represents a single optimization run over a feasible space. Each `Experiment` contains: - -1. Configuration about parallelism: The configuration about the parallelism. -1. Objective: The metric that we want to optimize. -1. Search space: The name and the distribution (discrete valued or continuous valued) of all the hyperparameters you need to search. -1. Search algorithm: The algorithm (e.g. Random Search, Grid Search, TPE, Bayesian Optimization) used to find the best hyperparameters. -1. Trial Template: The template used to define the trial. -1. Metrics Collection: Definition about how to collect the metrics (e.g. accuracy, loss). - -The `Experiment`'s definition is defined here: - -
- Click here to get YAML configuration - -```yaml -apiVersion: "kubeflow.org/v1beta1" -kind: Experiment -metadata: - namespace: kubeflow - name: tfjob-example -spec: - parallelTrialCount: 3 - maxTrialCount: 12 - maxFailedTrialCount: 3 - objective: - type: maximize - goal: 0.99 - objectiveMetricName: accuracy_1 - algorithm: - algorithmName: random - metricsCollectorSpec: - source: - fileSystemPath: - path: /train - kind: Directory - collector: - kind: TensorFlowEvent - parameters: - - name: learning_rate - parameterType: double - feasibleSpace: - min: "0.01" - max: "0.05" - - name: batch_size - parameterType: int - feasibleSpace: - min: "100" - max: "200" - trialTemplate: - trialParameters: - - name: learningRate - description: Learning rate for the training model - reference: learning_rate - - name: batchSize - description: Batch Size - reference: batch_size - trialSpec: - apiVersion: "kubeflow.org/v1" - kind: TFJob - spec: - tfReplicaSpecs: - Worker: - replicas: 2 - restartPolicy: OnFailure - template: - spec: - containers: - - name: tensorflow - image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0 - imagePullPolicy: Always - command: - - "python" - - "/var/tf_mnist/mnist_with_summaries.py" - - "--log_dir=/train/metrics" - - "--learning_rate=${trialParameters.learningRate}" - - "--batch_size=${trialParameters.batchSize}" - -``` - -The experiment has two hyperparameters defined in `parameters`: `learning_rate` and `batch_size`. We decide to use random search algorithm, and collect metrics from the TF Events. - -
- -Or you could just run: - -```bash -kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/tfjob-example.yaml -``` - -### Get trial results - -You can get the trial results using the command (Need to install [`jq`](https://stedolan.github.io/jq/download/) to parse JSON): - -```bash -kubectl -n kubeflow get trials -o json | jq ".items[] | {assignments: .spec.parameterAssignments, observation: .status.observation}" -``` - -You should get the output: - -```json -... -{ - "assignments": [ - { - "name": "learning_rate", - "value": "0.01156268890324629" - }, - { - "name": "batch_size", - "value": "196" - } - ], - "observation": { - "metrics": [ - { - "latest": "0.968200027943", - "max": "1.0", - "min": "0.0714285746217", - "name": "accuracy_1" - } - ] - } -} -... -``` - -Or you could get the result in UI: `/katib/#/katib/hp_monitor/kubeflow/tfjob-example`. - -![](./images/quickstart.png) - -When you click the trial name, you should get the details about metrics: - -![](./images/quickstart-trial.png) - - - -[1]: https://en.wikipedia.org/wiki/Hyperparameter_optimization -[2]: https://en.wikipedia.org/wiki/Neural_architecture_search \ No newline at end of file diff --git a/docs/user-guide.md b/docs/user-guide.md deleted file mode 100644 index 4f2ef3e38d1..00000000000 --- a/docs/user-guide.md +++ /dev/null @@ -1,3 +0,0 @@ -See the detailed guide to [configuring and running a Katib -experiment](https://kubeflow.org/docs/components/hyperparameter-tuning/experiment/) -in the Kubeflow docs. diff --git a/docs/workflow-design.md b/docs/workflow-design.md index 89075f4be13..d9e6194f2d9 100644 --- a/docs/workflow-design.md +++ b/docs/workflow-design.md @@ -1,17 +1,28 @@ # How Katib v1beta1 tunes hyperparameter automatically in a Kubernetes native way -See the following guides in the Kubeflow documentation: +Follow the Kubeflow documentation guides: -* [Concepts](https://www.kubeflow.org/docs/components/hyperparameter-tuning/overview/) +- [Concepts](https://www.kubeflow.org/docs/components/katib/overview/) in Katib, hyperparameter tuning, and neural architecture search. -* [Getting started with Katib](https://kubeflow.org/docs/components/hyperparameter-tuning/hyperparameter/). -* Detailed guide to [configuring and running a Katib - experiment](https://kubeflow.org/docs/components/hyperparameter-tuning/experiment/). +- [Getting started with Katib](https://kubeflow.org/docs/components/katib/hyperparameter/). +- Detailed guide to + [configuring and running a Katib `Experiment`](https://kubeflow.org/docs/components/katib/experiment/). ## Example and Illustration -After install Katib v1beta1, you can run `kubectl apply -f katib/examples/v1beta1/random-example.yaml` to try the first example of Katib. -Then you can get the new `Experiment` as below. Katib concepts will be introduced based on this example. +After install Katib v1beta1, you can run +`kubectl apply -f katib/examples/v1beta1/random-example.yaml` to try the first +example of Katib. + +### Experiment + +When you want to tune hyperparameters for your machine learning model before +training it further, you just need to create an `Experiment` CR. To +learn what fields are included in the `Experiment.spec`, follow +the detailed guide to +[configuring and running a Katib `Experiment`](https://kubeflow.org/docs/components/katib/experiment/). +Then you can get the new `Experiment` as below. +Katib concepts are introduced based on this example. ```yaml $ kubectl get experiment random-example -n kubeflow -o yaml @@ -63,6 +74,9 @@ spec: parameterType: categorical resumePolicy: LongRunning trialTemplate: + failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")# + primaryContainerName: training-container + successCondition: status.conditions.#(type=="Complete")#|#(status=="True")# trialParameters: - description: Learning rate for the training model name: learningRate @@ -87,48 +101,180 @@ spec: - --lr=${trialParameters.learningRate} - --num-layers=${trialParameters.numberLayers} - --optimizer=${trialParameters.optimizer} - image: docker.io/kubeflowkatib/mxnet-mnist + image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90 name: training-container restartPolicy: Never status: - ... + completionTime: "2020-11-16T20:13:02Z" + conditions: + - lastTransitionTime: "2020-11-16T20:00:15Z" + lastUpdateTime: "2020-11-16T20:00:15Z" + message: Experiment is created + reason: ExperimentCreated + status: "True" + type: Created + - lastTransitionTime: "2020-11-16T20:13:02Z" + lastUpdateTime: "2020-11-16T20:13:02Z" + message: Experiment is running + reason: ExperimentRunning + status: "False" + type: Running + - lastTransitionTime: "2020-11-16T20:13:02Z" + lastUpdateTime: "2020-11-16T20:13:02Z" + message: Experiment has succeeded because max trial count has reached + reason: ExperimentMaxTrialsReached + status: "True" + type: Succeeded + currentOptimalTrial: + bestTrialName: random-example-gnz5nccf + observation: + metrics: + - latest: "0.979299" + max: "0.979299" + min: "0.955115" + name: Validation-accuracy + - latest: "0.993503" + max: "0.993503" + min: "0.912413" + name: Train-accuracy + parameterAssignments: + - name: lr + value: "0.01874909352953323" + - name: num-layers + value: "5" + - name: optimizer + value: sgd + startTime: "2020-11-16T20:00:15Z" + succeededTrialList: + - random-example-2fpnqfv8 + - random-example-2s9vfb9s + - random-example-5hxm45x4 + - random-example-8xmpj4gv + - random-example-b6gnl4cs + - random-example-ftm2v84q + - random-example-gnz5nccf + - random-example-p74tn9gk + - random-example-q6jrlshx + - random-example-tkk46c4x + - random-example-w5qgblgk + - random-example-xcnrpx4x + trials: 12 + trialsSucceeded: 12 ``` -#### Experiment -When you want to tune hyperparameters for your machine learning model before -training it further, you just need to create an `Experiment` CR like above. To -learn what fields are included in the `Experiment.spec`, see -the detailed guide to [configuring and running a Katib -experiment](https://kubeflow.org/docs/components/hyperparameter-tuning/experiment/). +### Suggestion -#### Trial +Katib internally creates a `Suggestion` CR for each `Experiment` CR. The +`Suggestion` CR includes the hyperparameter algorithm name by `algorithmName` +field and how many sets of hyperparameter Katib asks to be generated by +`requests` field. The `Suggestion` also traces all already generated sets of +hyperparameter in `status.suggestions`. The `Suggestion` CR is used for internal +logic control and end user can even ignore it. -For each set of hyperparameters, Katib will internally generate a `Trial` CR with the hyperparameters key-value pairs, job manifest string with parameters instantiated and some other fields like below. `Trial` CR is used for internal logic control, and end user can even ignore it. +```yaml +$ kubectl get suggestion random-example -n kubeflow -o yaml + +apiVersion: kubeflow.org/v1beta1 +kind: Suggestion +metadata: + ... + name: random-example + namespace: kubeflow + ownerReferences: + - apiVersion: kubeflow.org/v1beta1 + blockOwnerDeletion: true + controller: true + kind: Experiment + name: random-example + uid: 302e79ae-8659-4679-9e2d-461209619883 + ... +spec: + algorithm: + algorithmName: random + requests: 12 + resumePolicy: LongRunning +status: + conditions: + - lastTransitionTime: "2020-11-16T20:00:15Z" + lastUpdateTime: "2020-11-16T20:00:15Z" + message: Suggestion is created + reason: SuggestionCreated + status: "True" + type: Created + - lastTransitionTime: "2020-11-16T20:00:36Z" + lastUpdateTime: "2020-11-16T20:00:36Z" + message: Deployment is ready + reason: DeploymentReady + status: "True" + type: DeploymentReady + - lastTransitionTime: "2020-11-16T20:00:38Z" + lastUpdateTime: "2020-11-16T20:00:38Z" + message: Suggestion is running + reason: SuggestionRunning + status: "True" + type: Running + startTime: "2020-11-16T20:00:15Z" + suggestionCount: 12 + suggestions: + ... + - name: random-example-2fpnqfv8 + parameterAssignments: + - name: lr + value: "0.021135228357807213" + - name: num-layers + value: "4" + - name: optimizer + value: sgd + - name: random-example-xcnrpx4x + parameterAssignments: + - name: lr + value: "0.02414696373094622" + - name: num-layers + value: "3" + - name: optimizer + value: adam + - name: random-example-8xmpj4gv + parameterAssignments: + - name: lr + value: "0.02471053882990492" + - name: num-layers + value: "4" + - name: optimizer + value: sgd + ... +``` + +### Trial + +For each set of hyperparameters, Katib internally generates a `Trial` CR +with the hyperparameters key-value pairs, `Worker Job` run specification with +parameters instantiated and some other fields like below. The `Trial` CR +is used for internal logic control and end user can even ignore it. ```yaml $ kubectl get trial -n kubeflow NAME TYPE STATUS AGE -random-example-58tbx6xc Succeeded True 14m -random-example-5nkb2gz2 Succeeded True 21m -random-example-88bdbkzr Succeeded True 20m -random-example-9tgjl9nt Succeeded True 17m -random-example-dqzjb2r9 Succeeded True 19m -random-example-gjfdgxxn Succeeded True 20m -random-example-nhrx8tb8 Succeeded True 15m -random-example-nkv76z8z Succeeded True 18m -random-example-pcnmzl76 Succeeded True 21m -random-example-spmk57dw Succeeded True 14m -random-example-tvxz667x Succeeded True 16m -random-example-xpw8wnjc Succeeded True 21m - -$ kubectl get trial random-example-gjfdgxxn -o yaml -n kubeflow +random-example-2fpnqfv8 Succeeded True 10m +random-example-2s9vfb9s Succeeded True 8m15s +random-example-5hxm45x4 Succeeded True 17m +random-example-8xmpj4gv Succeeded True 8m44s +random-example-b6gnl4cs Succeeded True 12m +random-example-ftm2v84q Succeeded True 17m +random-example-gnz5nccf Succeeded True 14m +random-example-p74tn9gk Succeeded True 11m +random-example-q6jrlshx Succeeded True 17m +random-example-tkk46c4x Succeeded True 12m +random-example-w5qgblgk Succeeded True 12m +random-example-xcnrpx4x Succeeded True 10m + +$ kubectl get trial random-example-2fpnqfv8 -o yaml -n kubeflow apiVersion: kubeflow.org/v1beta1 kind: Trial metadata: ... - name: random-example-gjfdgxxn + name: random-example-2fpnqfv8 namespace: kubeflow ownerReferences: - apiVersion: kubeflow.org/v1beta1 @@ -136,9 +282,10 @@ metadata: controller: true kind: Experiment name: random-example - uid: 34349cb7-c6af-11ea-90dd-42010a9a0020 + uid: 302e79ae-8659-4679-9e2d-461209619883 ... spec: + failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")# metricsCollector: collector: kind: StdOut @@ -155,16 +302,17 @@ spec: type: maximize parameterAssignments: - name: lr - value: "0.012171302435678337" + value: "0.021135228357807213" - name: num-layers - value: "3" + value: "4" - name: optimizer - value: adam + value: sgd + primaryContainerName: training-container runSpec: apiVersion: batch/v1 kind: Job metadata: - name: random-example-gjfdgxxn + name: random-example-2fpnqfv8 namespace: kubeflow spec: template: @@ -174,117 +322,95 @@ spec: - python3 - /opt/mxnet-mnist/mnist.py - --batch-size=64 - - --lr=0.012171302435678337 - - --num-layers=3 - - --optimizer=adam - image: docker.io/kubeflowkatib/mxnet-mnist + - --lr=0.021135228357807213 + - --num-layers=4 + - --optimizer=sgd + image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90 name: training-container restartPolicy: Never + successCondition: status.conditions.#(type=="Complete")#|#(status=="True")# status: - completionTime: "2020-07-15T15:29:00Z" + completionTime: "2020-11-16T20:09:33Z" conditions: - - lastTransitionTime: "2020-07-15T15:25:16Z" - lastUpdateTime: "2020-07-15T15:25:16Z" + - lastTransitionTime: "2020-11-16T20:07:48Z" + lastUpdateTime: "2020-11-16T20:07:48Z" message: Trial is created reason: TrialCreated status: "True" type: Created - - lastTransitionTime: "2020-07-15T15:29:00Z" - lastUpdateTime: "2020-07-15T15:29:00Z" + - lastTransitionTime: "2020-11-16T20:09:33Z" + lastUpdateTime: "2020-11-16T20:09:33Z" message: Trial is running reason: TrialRunning status: "False" type: Running - - lastTransitionTime: "2020-07-15T15:29:00Z" - lastUpdateTime: "2020-07-15T15:29:00Z" + - lastTransitionTime: "2020-11-16T20:09:33Z" + lastUpdateTime: "2020-11-16T20:09:33Z" message: Trial has succeeded reason: TrialSucceeded status: "True" type: Succeeded observation: metrics: - - latest: "0.959594" - max: "0.960490" - min: "0.940585" + - latest: "0.977309" + max: "0.978105" + min: "0.958002" name: Validation-accuracy - - latest: "0.959022" - max: "0.959188" - min: "0.921658" + - latest: "0.993820" + max: "0.993820" + min: "0.916611" name: Train-accuracy - startTime: "2020-07-15T15:25:16Z" + startTime: "2020-11-16T20:07:48Z" ``` -#### Suggestion +## What happens after an `Experiment` CR is created -Katib will internally create a `Suggestion` CR for each `Experiment` CR. `Suggestion` CR includes the hyperparameter algorithm name by `algorithmName` field and how many sets of hyperparameter Katib asks to be generated by `requests` field. The CR also traces all already generated sets of hyperparameter in `status.suggestions`. Same as `Trial`, `Suggestion` CR is used for internal logic control and end user can even ignore it. +When user creates an `Experiment` CR, Katib `Experiment` controller, +`Suggestion` controller and `Trial` controller is working together to achieve +hyperparameters tuning for user's Machine learning model. The Experiment +workflow looks as follows: -```yaml -$ kubectl get suggestion random-example -n kubeflow -o yaml - -apiVersion: kubeflow.org/v1beta1 -kind: Suggestion -metadata: - ... - name: random-example - namespace: kubeflow - ownerReferences: - - apiVersion: kubeflow.org/v1beta1 - blockOwnerDeletion: true - controller: true - kind: Experiment - name: random-example - uid: 34349cb7-c6af-11ea-90dd-42010a9a0020 - ... -spec: - algorithmName: random - requests: 12 -status: - suggestionCount: 12 - suggestions: - ... - - name: random-example-gjfdgxxn - parameterAssignments: - - name: lr - value: "0.012171302435678337" - - name: num-layers - value: "3" - - name: optimizer - value: adam - - name: random-example-88bdbkzr - parameterAssignments: - - name: lr - value: "0.013408352284328112" - - name: num-layers - value: "4" - - name: optimizer - value: ftrl - - name: random-example-dqzjb2r9 - parameterAssignments: - - name: lr - value: "0.028873905258692753" - - name: num-layers - value: "3" - - name: optimizer - value: adam - ... -``` - -## What happens after an `Experiment` CR created - -When a user created an `Experiment` CR, Katib controllers including experiment controller, trial controller and suggestion controller will work together to achieve hyperparameters tuning for user Machine learning model.
image
-1. A `Experiment` CR is submitted to Kubernetes API server, Katib experiment mutating and validating webhook will be called to set default value for the `Experiment` CR and validate the CR separately. -2. Experiment controller creates a `Suggestion` CR. -3. Suggestion controller creates the algorithm deployment and service based on the new `Suggestion` CR. -4. When Suggestion controller verifies that the algorithm service is ready, it calls the service to generate `spec.request - len(status.suggestions)` sets of hyperparamters and append them into `status.suggestions` -5. Experiment controller finds that `Suggestion` CR had been updated, then generate each `Trial` for each new hyperparamters set. -6. Trial controller generates job based on `trialSpec` manifest with the new hyperparamters set. -7. Related job controller (Kubernetes batch Job, Kubeflow PyTorchJob or Kubeflow TFJob) generates Pods. -8. Katib Pod mutating webhook is called to inject metrics collector sidecar container to the candidate Pod. -9. During the ML model container runs, metrics collector container in the same Pod tries to collect metrics from it and persists them into Katib DB backend. -10. When the ML model Job ends, Trial controller will update status of the corresponding `Trial` CR. -11. When a `Trial` CR goes to end, Experiment controller will increase `request` field of corresponding -`Suggestion` CR if it is needed, then everything goes to `step 4` again. Of course, if `Trial` CRs meet one of `end` condition (exceeds `maxTrialCount`, `maxFailedTrialCount` or `goal`), Experiment controller will take everything done. +1. The `Experiment` CR is submitted to the Kubernetes API server. Katib + `Experiment` mutating and validating webhook is called to set the default + values for the `Experiment` CR and validate the CR separately. + +1. The `Experiment` controller creates the `Suggestion` CR. + +1. The `Suggestion` controller creates the algorithm deployment and service + based on the new `Suggestion` CR. + +1. When the `Suggestion` controller verifies that the algorithm service is + ready, it calls the service to generate + `spec.request - len(status.suggestions)` sets of hyperparamters and append + them into `status.suggestions`. + +1. The `Experiment` controller finds that `Suggestion` CR had been updated and + generates each `Trial` for the each new hyperparamters set. + +1. The `Trial` controller generates `Worker Job` based on the `runSpec` + from the `Trial` CR with the new hyperparamters set. + +1. The related job controller + (Kubernetes batch Job, Kubeflow TFJob, Tekton Pipeline, etc.) generates + Kubernetes Pods. + +1. Katib Pod mutating webhook is called to inject the metrics collector sidecar + container to the candidate Pods. + +1. During the ML model container runs, the metrics collector container + collects metrics from the injected pod and persists metrics to the Katib + DB backend. + +1. When the ML model training ends, the `Trial` controller updates status + of the corresponding `Trial` CR. + +1. When the `Trial` CR goes to end, the `Experiment` controller increases + `request` field of the corresponding `Suggestion` CR if it is needed, + then everything goes to `step 4` again. + Of course, if the `Trial` CRs meet one of `end` condition + (exceeds `maxTrialCount`, `maxFailedTrialCount` or `goal`), + the `Experiment` controller takes everything done.