NAME                                READY   STATUS    RESTARTS   AGE
katib-controller-858d6cc48c-df9jc   1/1     Running   1          20m
katib-db-manager-7966fbdf9b-w2tn8   1/1     Running   0          20m
katib-mysql-7f8bc6956f-898f9        1/1     Running   0          20m
katib-ui-7cf9f967bf-nm72p           1/1     Running   0          20m
pytorch-operator-55f966b548-9gq9v   1/1     Running   0          20m
tf-job-operator-796b4747d8-4fh82    1/1     Running   0          21m

Running examples

After deploy everything, you can run examples to verify the installation. Examples bellow are for v1beta1 version.

This is an example for TF operator:

kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/tfjob-example.yaml

This is an example for PyTorch operator:

kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/pytorchjob-example.yaml

You can check status of experiment

$ kubectl describe experiment tfjob-example -n kubeflow

Name:         tfjob-example
Namespace:    kubeflow
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1beta1
Kind:         Experiment
Metadata:
  Creation Timestamp:  2020-07-15T14:27:53Z
  Finalizers:
    update-prometheus-metrics
  Generation:        1
  Resource Version:  100380029
  Self Link:         /apis/kubeflow.org/v1beta1/namespaces/kubeflow/experiments/tfjob-example
  UID:               5e3cf1f5-c6a7-11ea-90dd-42010a9a0020
Spec:
  Algorithm:
    Algorithm Name:        random
  Max Failed Trial Count:  3
  Max Trial Count:         12
  Metrics Collector Spec:
    Collector:
      Kind:  TensorFlowEvent
    Source:
      File System Path:
        Kind:  Directory
        Path:  /train
  Objective:
    Goal:  0.99
    Metric Strategies:
      Name:                 accuracy_1
      Value:                max
    Objective Metric Name:  accuracy_1
    Type:                   maximize
  Parallel Trial Count:     3
  Parameters:
    Feasible Space:
      Max:           0.05
      Min:           0.01
    Name:            learning_rate
    Parameter Type:  double
    Feasible Space:
      Max:           200
      Min:           100
    Name:            batch_size
    Parameter Type:  int
  Resume Policy:     LongRunning
  Trial Template:
    Trial Parameters:
      Description:  Learning rate for the training model
      Name:         learningRate
      Reference:    learning_rate
      Description:  Batch Size
      Name:         batchSize
      Reference:    batch_size
    Trial Spec:
      API Version:  kubeflow.org/v1
      Kind:         TFJob
      Spec:
        Tf Replica Specs:
          Worker:
            Replicas:        2
            Restart Policy:  OnFailure
            Template:
              Spec:
                Containers:
                  Command:
                    python
                    /var/tf_mnist/mnist_with_summaries.py
                    --log_dir=/train/metrics
                    --learning_rate=${trialParameters.learningRate}
                    --batch_size=${trialParameters.batchSize}
                  Image:              gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
                  Image Pull Policy:  Always
                  Name:               tensorflow
Status:
  Completion Time:  2020-07-15T14:30:52Z
  Conditions:
    Last Transition Time:  2020-07-15T14:27:53Z
    Last Update Time:      2020-07-15T14:27:53Z
    Message:               Experiment is created
    Reason:                ExperimentCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2020-07-15T14:30:52Z
    Last Update Time:      2020-07-15T14:30:52Z
    Message:               Experiment is running
    Reason:                ExperimentRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2020-07-15T14:30:52Z
    Last Update Time:      2020-07-15T14:30:52Z
    Message:               Experiment has succeeded because Objective goal has reached
    Reason:                ExperimentGoalReached
    Status:                True
    Type:                  Succeeded
  Current Optimal Trial:
    Best Trial Name:  tfjob-example-gjxn54vl
    Observation:
      Metrics:
        Latest:  0.966300010681
        Max:     1.0
        Min:     0.103260867298
        Name:    accuracy_1
    Parameter Assignments:
      Name:    learning_rate
      Value:   0.015945204040626416
      Name:    batch_size
      Value:   184
  Start Time:  2020-07-15T14:27:53Z
  Succeeded Trial List:
    tfjob-example-5jd8nnjg
    tfjob-example-bgjfpd5t
    tfjob-example-gjxn54vl
    tfjob-example-vpdqxkch
    tfjob-example-wvptx7gt
  Trials:            5
  Trials Succeeded:  5
Events:              <none>

When the spec.Status.Condition becomes Succeeded, the experiment is finished.

You can monitor your results in Katib UI. Access Katib UI via Kubeflow dashboard if you have used standard installation or port-forward the katib-ui service if you have installed manually.

kubectl -n kubeflow port-forward svc/katib-ui 8080:80

You can access the Katib UI using this URL: http://localhost:8080/katib/.

Katib SDK

Katib supports Python SDK for v1beta1 and v1alpha3 version.

See the Katib v1beta1 SDK documentation.
See the Katib v1alpha3 SDK documentation.

Run gen-sdk.sh to update SDK.

Cleanups

To delete installed TF and PyTorch operator run kubectl delete -f on the respective folders.

To delete Katib for v1beta1 version run bash katib/scripts/v1beta1/undeploy.sh.

Quick Start

Please see Quick Start Guide.

Who are using Katib?

Please see adopters.md.

CONTRIBUTING

Please feel free to test the system! developer-guide.md is a good starting point for developers.

Citation

If you use Katib in a scientific publication, we would appreciate citations to the following paper:

A Scalable and Cloud-Native Hyperparameter Tuning System, George et al., arXiv:2006.02085, 2020.

Bibtex entry:

@misc{george2020katib,
    title={A Scalable and Cloud-Native Hyperparameter Tuning System},
    author={Johnu George and Ce Gao and Richard Liu and Hou Gang Liu and Yuan Tang and Ramdoot Pydipaty and Amit Kumar Saha},
    year={2020},
    eprint={2006.02085},
    archivePrefix={arXiv},
    primaryClass={cs.DC}
}

Name		Name	Last commit message	Last commit date
Latest commit History 739 Commits
.github		.github
cmd		cmd
docs		docs
examples		examples
hack		hack
manifests		manifests
pkg		pkg
scripts		scripts
sdk/python		sdk/python
test		test
vendor		vendor
.dockerignore		.dockerignore
.gcloudignore		.gcloudignore
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
Gopkg.lock		Gopkg.lock
Gopkg.toml		Gopkg.toml
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
PROJECT		PROJECT
README.md		README.md
ROADMAP.md		ROADMAP.md
prow_config.yaml		prow_config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Getting Started

Name

Concepts in Katib

Experiment

Suggestion

Trial

Worker Job

Hyperparameter Tuning

Neural Architecture Search

Components in Katib

Web UI

GRPC API documentation

Installation

TF operator

PyTorch operator

Katib

Running examples

Katib SDK

Cleanups

Quick Start

Who are using Katib?

CONTRIBUTING

Citation

About

Releases

Packages

Languages

License

robbertvdg/katib

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Getting Started

Name

Concepts in Katib

Experiment

Suggestion

Trial

Worker Job

Hyperparameter Tuning

Neural Architecture Search

Components in Katib

Web UI

GRPC API documentation

Installation

TF operator

PyTorch operator

Katib

Running examples

Katib SDK

Cleanups

Quick Start

Who are using Katib?

CONTRIBUTING

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages