Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR:grpc._server:Exception calling application: Method not implemented! #981

Closed
devxoxo opened this issue Dec 20, 2019 · 38 comments
Closed
Labels

Comments

@devxoxo
Copy link

devxoxo commented Dec 20, 2019

/kind bug

Hi, I'm having trouble using katib v1alpha3.
First, I installed katib by the followings

  1. git clone https://github.com/kubeflow/katib
  2. sh katib/scripts/v1alpha3/deploy.sh

And I tried to apply random-example.yaml
kubectl apply -f random-example.yaml
(example in katib/examples/v1alpha3)

Results:
kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
katib-controller-6c6974678d-zsnlc 1/1 Running 1 24m
katib-db-558f649dc6-8cd9t 1/1 Running 0 24m
katib-manager-5f74bdff84-4d78z 1/1 Running 0 24m
katib-ui-6568bd6b44-qbq5k 1/1 Running 0 24m
random-example-random-846dc99654-bxb8j 1/1 Running 0 23m

kubectl get trials -n kubeflow
NAME TYPE STATUS AGE
random-example-drpkvb4b Running True 23m
random-example-k7xv6ktt Running True 23m
random-example-w6jlwdp2 Running True 23m

kubectl get experiment -n kubeflow -oyaml
apiVersion: v1
items:

  • apiVersion: kubeflow.org/v1alpha3
    kind: Experiment
    metadata:
    annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
    {"apiVersion":"kubeflow.org/v1alpha3","kind":"Experiment","metadata":{"annotations":{},"labels":{"controller-tools.k8s.io":"1.0"},"name":"random-example","namespace":"kubeflow"},"spec":{"algorithm":{"algorithmName":"random"},"maxFailedTrialCount":3,"maxTrialCount":12,"objective":{"additionalMetricNames":["accuracy"],"goal":0.99,"objectiveMetricName":"Validation-accuracy","type":"maximize"},"parallelTrialCount":3,"parameters":[{"feasibleSpace":{"max":"0.03","min":"0.01"},"name":"--lr","parameterType":"double"},{"feasibleSpace":{"max":"5","min":"2"},"name":"--num-layers","parameterType":"int"},{"feasibleSpace":{"list":["sgd","adam","ftrl"]},"name":"--optimizer","parameterType":"categorical"}],"trialTemplate":{"goTemplate":{"rawTemplate":"apiVersion: batch/v1\nkind: Job\nmetadata:\n name: {{.Trial}}\n namespace: {{.NameSpace}}\nspec:\n template:\n spec:\n containers:\n - name: {{.Trial}}\n image: docker.io/kubeflowkatib/mxnet-mnist-example\n command:\n - "python"\n - "/mxnet/example/image-classification/train_mnist.py"\n - "--batch-size=64"\n {{- with .HyperParameters}}\n {{- range .}}\n - "{{.Name}}={{.Value}}"\n {{- end}}\n {{- end}}\n restartPolicy: Never"}}}}
    creationTimestamp: "2019-12-20T07:58:52Z"
    finalizers:
    • update-prometheus-metrics
      generation: 2
      labels:
      controller-tools.k8s.io: "1.0"
      name: random-example
      namespace: kubeflow
      resourceVersion: "11682124"
      selfLink: /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/experiments/random-example
      uid: 9005bab0-22fe-11ea-8cf0-0679676001a5
      spec:
      algorithm:
      algorithmName: random
      algorithmSettings: null
      maxFailedTrialCount: 3
      maxTrialCount: 12
      metricsCollectorSpec:
      collector:
      kind: StdOut
      objective:
      additionalMetricNames:
      • accuracy
        goal: 0.99
        objectiveMetricName: Validation-accuracy
        type: maximize
        parallelTrialCount: 3
        parameters:
    • feasibleSpace:
      max: "0.03"
      min: "0.01"
      name: --lr
      parameterType: double
    • feasibleSpace:
      max: "5"
      min: "2"
      name: --num-layers
      parameterType: int
    • feasibleSpace:
      list:
      • sgd
      • adam
      • ftrl
        name: --optimizer
        parameterType: categorical
        trialTemplate:
        goTemplate:
        rawTemplate: |-
        apiVersion: batch/v1
        kind: Job
        metadata:
        name: {{.Trial}}
        namespace: {{.NameSpace}}
        spec:
        template:
        spec:
        containers:
        - name: {{.Trial}}
        image: docker.io/kubeflowkatib/mxnet-mnist-example
        command:
        - "python"
        - "/mxnet/example/image-classification/train_mnist.py"
        - "--batch-size=64"
        {{- with .HyperParameters}}
        {{- range .}}
        - "{{.Name}}={{.Value}}"
        {{- end}}
        {{- end}}
        restartPolicy: Never
        status:
        conditions:
    • lastTransitionTime: "2019-12-20T07:58:52Z"
      lastUpdateTime: "2019-12-20T07:58:52Z"
      message: Experiment is created
      reason: ExperimentCreated
      status: "True"
      type: Created
    • lastTransitionTime: "2019-12-20T08:00:22Z"
      lastUpdateTime: "2019-12-20T08:00:22Z"
      message: Experiment is running
      reason: ExperimentRunning
      status: "True"
      type: Running
      currentOptimalTrial:
      observation:
      metrics: null
      parameterAssignments: null
      startTime: "2019-12-20T07:58:52Z"
      trials: 3
      trialsRunning: 3
      kind: List
      metadata:
      resourceVersion: ""
      selfLink: ""

kubectl logs -n kubeflow random-example-random-846dc99654-bxb8j
INFO:hyperopt.utils:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.
INFO:hyperopt.fmin:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.
ERROR:grpc._server:Exception calling application: Method not implemented!
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/grpc/_server.py", line 434, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/usr/src/app/github.com/kubeflow/katib/pkg/apis/manager/v1alpha3/python/api_pb2_grpc.py", line 135, in ValidateAlgorithmSettings
raise NotImplementedError('Method not implemented!')
NotImplementedError: Method not implemented!

What can I do to fix it?
Thank you for your help in solving this problem.

  • Kubernetes version: (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5+icp", GitCommit:"903c3b31caddc675ce2d8bddf62aa0f875c2a3bc", GitTreeState:"clean", BuildDate:"2019-05-08T06:16:32Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5+icp", GitCommit:"903c3b31caddc675ce2d8bddf62aa0f875c2a3bc", GitTreeState:"clean", BuildDate:"2019-05-08T06:16:32Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

  • OS (e.g. from /etc/os-release): CentOS Linux release 7.7.1908 (Core)

@nrchakradhar
Copy link

We are also facing same error and not able to find what is the actual error.
/cc @nrchakradhar

@nrchakradhar
Copy link

/cc @johnugeorge @richardsliu
Can you please help. We may be making some basic mistake

@johnugeorge
Copy link
Member

Just wondering if this is a redeployment of Katib ? Or is it a fresh deployment?

@johnugeorge
Copy link
Member

Suggestion algorithms uses images of latest tag (in master branch) . https://github.com/kubeflow/katib/blob/master/manifests/v1alpha3/katib-controller/katib-config.yaml
I am wondering if there are any older images with latest tag.can you check sha of existing image in your cluster?

@nrchakradhar
Copy link

We tried random_example and tf job example in a fresh installation of kubeflow v0.7, and bith are working fine. Not able to reproduce the error now.

@shaowei-su
Copy link
Contributor

shaowei-su commented Mar 5, 2020

We ran into exact the same issue after upgrading from v1alpha2 to v1alpha3. Any suggestion for a fix? is this related to a stale Docker image?

@nrchakradhar
Copy link

@shaowei-su can you check if experiment is running in kubeflow namespace? If yes, then create a profile and run experiment in the profile namespace.

@shaowei-su
Copy link
Contributor

@nrchakradhar yes it's running in kubeflow namespace; I tried to run it in a different namespace but it requires katib-metricscollector-injection: enabled.

@nrchakradhar
Copy link

Yes. You can edit the namespace with kubectl edit and add the annotation.

@andreyvelich
Copy link
Member

@shaowei-su If you are using default Kubeflow profile controller to create namespaces, it should by default add katib-metricscollector-injection: enabled annotation to created namespace.

If you deploy only Katib components without Kubeflow, you can submit Experiment in kubeflow namespace.

@shaowei-su
Copy link
Contributor

we decided to roll back to v1alpha2 at this point. thanks for the help though! @nrchakradhar @andreyvelich

@andrewlarimer
Copy link

I am experiencing this issue using version v1alpha3 on both the kubeflow and personal profile namespaces.

@andreyvelich
Copy link
Member

@andrewlarimer Which Katib image versions are you using?

@kunalyogenshah
Copy link

@nrchakradhar : I tried creating a new namespace to run the experiment, still running into

INFO:hyperopt.utils:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.
INFO:hyperopt.fmin:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.
ERROR:grpc._server:Exception calling application: Method not implemented!
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/grpc/_server.py", line 434, in _call_behavior
    response_or_iterator = behavior(argument, context)
  File "/usr/src/app/github.com/kubeflow/katib/pkg/apis/manager/v1alpha3/python/api_pb2_grpc.py", line 135, in ValidateAlgorithmSettings
    raise NotImplementedError('Method not implemented!')
NotImplementedError: Method not implemented!

I installed Katib using the manifests repo instructions.

@kunalyogenshah
Copy link

kunalyogenshah commented Mar 15, 2020

To be more precise, the higher level error is that when I kick off the random-example experiment, the trial jobs finish running, but the trial is stuck at MetricsUnavailable for the trial-controller with the message Metrics are not available for Job random-example-dh7tw95g, so the experiment never ends.

@nrchakradhar
Copy link

@kunalyogenshah as mentioned above, I hope you are creating profiles instead of just namespace.
Is metrics side car getting added into you trial pods?
You can also check the logs for karib manager for any clues.

@kunalyogenshah
Copy link

kunalyogenshah commented Mar 16, 2020

@nrchakradhar : Not sure what you mean by profile... This is how the kubeflow namespace looks :

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Namespace","metadata":{"annotations":{},"labels":{"katib-metricscollector-injection":"enabled"},"name":"kubeflow"}}
  creationTimestamp: "2020-03-14T23:02:25Z"
  labels:
    katib-metricscollector-injection: enabled
  name: kubeflow
  resourceVersion: "90742282"
  selfLink: /api/v1/namespaces/kubeflow
  uid: dec5f5b4-6647-11ea-ae47-12818b53d7c7
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

The running components in this namespace are :

NAME                                    READY   STATUS    RESTARTS   AGE
pod/katib-controller-68cdfb7856-z55mb   1/1     Running   1          27h
pod/katib-db-manager-7568b44bbb-9p52q   1/1     Running   0          27h
pod/katib-mysql-7f45b96999-t7dqj        1/1     Running   0          27h
pod/katib-ui-778d5b7479-hkfb8           1/1     Running   0          27h
pod/pytorch-operator-5bcb87c97f-blcv9   1/1     Running   0          26h
pod/tf-job-operator-d79b446c5-gf5kr     1/1     Running   0          26h

NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)            AGE
service/katib-controller   ClusterIP   100.68.172.201   <none>        443/TCP,8080/TCP   27h
service/katib-db-manager   ClusterIP   100.68.13.0      <none>        6789/TCP           27h
service/katib-mysql        ClusterIP   100.68.191.234   <none>        3306/TCP           27h
service/katib-ui           ClusterIP   100.68.47.186    <none>        80/TCP             27h
service/pytorch-operator   ClusterIP   100.68.9.25      <none>        8443/TCP           26h
service/tf-job-operator    ClusterIP   100.68.138.222   <none>        8443/TCP           26h

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/katib-controller   1/1     1            1           27h
deployment.apps/katib-db-manager   1/1     1            1           27h
deployment.apps/katib-mysql        1/1     1            1           27h
deployment.apps/katib-ui           1/1     1            1           27h
deployment.apps/pytorch-operator   1/1     1            1           26h
deployment.apps/tf-job-operator    1/1     1            1           26h

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/katib-controller-68cdfb7856   1         1         1       27h
replicaset.apps/katib-db-manager-7568b44bbb   1         1         1       27h
replicaset.apps/katib-mysql-7f45b96999        1         1         1       27h
replicaset.apps/katib-ui-778d5b7479           1         1         1       27h
replicaset.apps/pytorch-operator-5bcb87c97f   1         1         1       26h
replicaset.apps/tf-job-operator-d79b446c5     1         1         1       26h

When you say katib manager logs I presume you mean the controller? The created pods do not have a sidecar in them. The logs from the controller have some infos, and a few errors like this one :
{"level":"error","ts":1584229199.2961588,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"experiment-controller","request":"katib-jobs/random-example","error":"Operation cannot be fulfilled on experiments.kubeflow.org \"random-example\": the object has been modified; please apply your changes to the latest version and try again"

There are some other info logs about "level":"info","ts":1584229199.2966504,"logger":"provider-job","msg":"NestedFieldCopy","err":"status cannot be found in job" and "level":"info","ts":1584229199.4742026,"logger":"trial-controller","msg":"Creating Job","Trial":"katib-jobs/random-example-dh7tw95g","kind":"Job","name":"random-example-dh7tw95g"

PS: The alternate namespace I tried using looks like this :

apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: "2020-03-14T23:36:06Z"
  labels:
    katib-metricscollector-injection: enabled
  name: katib-jobs
  resourceVersion: "90754713"
  selfLink: /api/v1/namespaces/katib-jobs
  uid: 93860ce2-664c-11ea-97d3-0ef4b6e8fa69
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

@andreyvelich
Copy link
Member

@kunalyogenshah From your information above, I can see that you deployed Katib without other Kubeflow components and it means without profile-controller. In that case, you can submit Katib jobs in any namespace and in Kubeflow namespace also.
NotImplementedError: Method not implemented! error is fine. It just says that in random Suggestion we don't have validation part.

Can you describe one of your Trials, please? Are you using latest version of Katib images?

@kunalyogenshah
Copy link

kunalyogenshah commented Mar 16, 2020

Thanks @andreyvelich . This is one of the trials :

Name:         random-example-pqdtthf6
Namespace:    kubeflow
Labels:       controller-tools.k8s.io=1.0
              experiment=random-example
Annotations:  <none>
API Version:  kubeflow.org/v1alpha3
Kind:         Trial
Metadata:
  Creation Timestamp:  2020-03-16T14:42:59Z
  Finalizers:
    clean-metrics-in-db
  Generation:  1
  Owner References:
    API Version:           kubeflow.org/v1alpha3
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Experiment
    Name:                  random-example
    UID:                   5638b8ee-6794-11ea-97d3-0ef4b6e8fa69
  Resource Version:        91609828
  Self Link:               /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/trials/random-example-pqdtthf6
  UID:                     6e3c97c3-6794-11ea-97d3-0ef4b6e8fa69
Spec:
  Metrics Collector:
  Objective:
    Additional Metric Names:
      Train-accuracy
    Goal:                   0.99
    Objective Metric Name:  Validation-accuracy
    Type:                   maximize
  Parameter Assignments:
    Name:    --lr
    Value:   0.028367224838114588
    Name:    --num-layers
    Value:   4
    Name:    --optimizer
    Value:   ftrl
  Run Spec:  apiVersion: batch/v1
kind: Job
metadata:
  name: random-example-pqdtthf6
  namespace: kubeflow
spec:
  template:
    spec:
      containers:
      - name: random-example-pqdtthf6
        image: docker.io/kubeflowkatib/mxnet-mnist
        imagePullPolicy: Always
        command:
        - "python3"
        - "/opt/mxnet-mnist/mnist.py"
        - "--batch-size=64"
        - "--lr=0.028367224838114588"
        - "--num-layers=4"
        - "--optimizer=ftrl"
      restartPolicy: Never
Status:
  Conditions:
    Last Transition Time:  2020-03-16T14:42:59Z
    Last Update Time:      2020-03-16T14:42:59Z
    Message:               Trial is created
    Reason:                TrialCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2020-03-16T14:43:17Z
    Last Update Time:      2020-03-16T14:43:17Z
    Message:               Trial is running
    Reason:                TrialRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2020-03-16T14:43:17Z
    Last Update Time:      2020-03-16T14:43:17Z
    Message:               Metrics are not available
    Reason:                MetricsUnavailable
    Status:                False
    Type:                  Succeeded
  Start Time:              2020-03-16T14:42:59Z
Events:
  Type     Reason              Age                From              Message
  ----     ------              ----               ----              -------
  Normal   JobCreated          33s                trial-controller  Job random-example-pqdtthf6 has been created
  Normal   JobRunning          33s (x2 over 33s)  trial-controller  Job random-example-pqdtthf6 is running:
  Warning  MetricsUnavailable  15s (x2 over 15s)  trial-controller  Metrics are not available for Job random-example-pqdtthf6

The corresponding job is :

Name:           random-example-pqdtthf6
Namespace:      kubeflow
Selector:       controller-uid=6e421275-6794-11ea-97d3-0ef4b6e8fa69
Labels:         controller-uid=6e421275-6794-11ea-97d3-0ef4b6e8fa69
                job-name=random-example-pqdtthf6
Annotations:    <none>
Controlled By:  Trial/random-example-pqdtthf6
Parallelism:    1
Completions:    1
Start Time:     Mon, 16 Mar 2020 07:42:59 -0700
Completed At:   Mon, 16 Mar 2020 07:43:17 -0700
Duration:       18s
Pods Statuses:  0 Running / 1 Succeeded / 0 Failed
Pod Template:
  Labels:       controller-uid=6e421275-6794-11ea-97d3-0ef4b6e8fa69
                job-name=random-example-pqdtthf6
  Annotations:  sidecar.istio.io/inject: false
  Containers:
   random-example-pqdtthf6:
    Image:      docker.io/kubeflowkatib/mxnet-mnist
    Port:       <none>
    Host Port:  <none>
    Command:
      python3
      /opt/mxnet-mnist/mnist.py
      --batch-size=64
      --lr=0.028367224838114588
      --num-layers=4
      --optimizer=ftrl
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  83s   job-controller  Created pod: random-example-pqdtthf6-4r4l4

and the nested pod :

Name:           random-example-pqdtthf6-4r4l4
Namespace:      kubeflow
Priority:       0
Node:           ip-172-21-244-90.ec2.internal/172.21.244.90
Start Time:     Mon, 16 Mar 2020 07:42:59 -0700
Labels:         controller-uid=6e421275-6794-11ea-97d3-0ef4b6e8fa69
                job-name=random-example-pqdtthf6
Annotations:    cni.projectcalico.org/podIP: 100.64.21.16/32
                sidecar.istio.io/inject: false
Status:         Succeeded
IP:             100.64.21.16
IPs:            <none>
Controlled By:  Job/random-example-pqdtthf6
Containers:
  random-example-pqdtthf6:
    Container ID:  docker://de0d3b77bedc833c7a3effa291208407a0cd056e80c727b5dd1852538b151c40
    Image:         docker.io/kubeflowkatib/mxnet-mnist
    Image ID:      docker-pullable://kubeflowkatib/mxnet-mnist@sha256:85e62e489033dd327e5db7322b636db15b7fe6b380c5846093926c66afb39d8a
    Port:          <none>
    Host Port:     <none>
    Command:
      python3
      /opt/mxnet-mnist/mnist.py
      --batch-size=64
      --lr=0.028367224838114588
      --num-layers=4
      --optimizer=ftrl
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 16 Mar 2020 07:43:00 -0700
      Finished:     Mon, 16 Mar 2020 07:43:17 -0700
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nt2th (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-nt2th:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-nt2th
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From                                    Message
  ----     ------            ----                 ----                                    -------
  Normal   Scheduled         2m3s                 default-scheduler                       Successfully assigned kubeflow/random-example-pqdtthf6-4r4l4 to ip-172-21-244-90.ec2.internal
  Normal   Pulling           2m2s                 kubelet, ip-172-21-244-90.ec2.internal  pulling image "docker.io/kubeflowkatib/mxnet-mnist"
  Normal   Pulled            2m2s                 kubelet, ip-172-21-244-90.ec2.internal  Successfully pulled image "docker.io/kubeflowkatib/mxnet-mnist"
  Normal   Created           2m2s                 kubelet, ip-172-21-244-90.ec2.internal  Created container
  Normal   Started           2m2s                 kubelet, ip-172-21-244-90.ec2.internal  Started container
  Warning  DNSConfigForming  104s (x6 over 2m3s)  kubelet, ip-172-21-244-90.ec2.internal  Search Line limits were exceeded, some search paths have been omitted, the applied search line is: kubeflow.svc.cluster.local svc.cluster.local cluster.local

@kunalyogenshah
Copy link

Oh, and the image versions are v0.8.0

@nrchakradhar
Copy link

@kunalyogenshah I doubt if you have Istio installed. If yes, please set istio-injection to false

trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: batch/v1
kind: Job
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
annotations:
sidecar.istio.io/inject: "false"

Also, your controller logs do not indicate that metrics side car is being injected.
I can see the following in our test setup

{"level":"info","ts":1584376334.2574615,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-pslfp945"}

Can you try upgrade to 1.0 images if feasible

@andreyvelich
Copy link
Member

@kunalyogenshah Thank you for this information.
As @nrchakradhar noticed, I can see that Metrics Collector mutation webhook didn't work properly and your Trial and training pod doesn't have metrics collector container.

Try to delete everything, Katib crd, and, also, validating and mutating webhooks.
You can delete webhooks by running this:
kubectl delete MutatingWebhookConfiguration katib-mutating-webhook-config
kubectl delete ValidatingWebhookConfiguration katib-validating-webhook-config

After that, deploy Katib again with the latest images from manifest folder. While creating, Katib controller will create webhooks.

@kunalyogenshah
Copy link

kunalyogenshah commented Mar 16, 2020

Thank you so much @andreyvelich, @nrchakradhar :
I created a fresh install, and then those entries showed up in my logs :

{"level":"info","ts":1584390104.491147,"logger":"trial-controller","msg":"Creating Job","Trial":"kubeflow/random-example-jls977hp","kind":"Job","name":"random-example-jls977hp"}
{"level":"info","ts":1584390104.4973269,"logger":"provider-job","msg":"NestedFieldCopy","err":"status cannot be found in job"}
{"level":"info","ts":1584390104.5008252,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-jls977hp"}
{"level":"info","ts":1584390104.5029964,"logger":"trial-controller","msg":"Creating Job","Trial":"kubeflow/random-example-vk7km9vh","kind":"Job","name":"random-example-vk7km9vh"}
{"level":"info","ts":1584390104.5067983,"logger":"provider-job","msg":"NestedFieldCopy","err":"status cannot be found in job"}
{"level":"info","ts":1584390104.5083642,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-vk7km9vh"}
{"level":"info","ts":1584390104.6708071,"logger":"trial-controller","msg":"Creating Job","Trial":"kubeflow/random-example-slh4f89c","kind":"Job","name":"random-example-slh4f89c"}
{"level":"info","ts":1584390104.6747541,"logger":"provider-job","msg":"NestedFieldCopy","err":"status cannot be found in job"}
{"level":"info","ts":1584390104.676306,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-slh4f89c"}
{"level":"info","ts":1584390114.5079634,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-jls977hp"}
{"level":"info","ts":1584390114.5140254,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-vk7km9vh"}
{"level":"info","ts":1584390114.6812024,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-slh4f89c"}
{"level":"info","ts":1584390134.5161057,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-jls977hp"}
{"level":"info","ts":1584390134.5202854,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-vk7km9vh"}
{"level":"info","ts":1584390134.6859908,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-slh4f89c"}
{"level":"info","ts":1584390174.5252905,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-jls977hp"}
{"level":"info","ts":1584390174.5253968,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-vk7km9vh"}
{"level":"info","ts":1584390174.6906996,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-slh4f89c"}
{"level":"info","ts":1584390254.5345287,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-jls977hp"}
{"level":"info","ts":1584390254.5345314,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-vk7km9vh"}
{"level":"info","ts":1584390254.6964347,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod":"","Trial":"random-example-slh4f89c"}

As you can see, it goes on in a loop trying to inject the metrics sidecar, but it seems to be failing, with these logs in the Job describe :

Warning  FailedCreate  2m31s  job-controller  Error creating: pods "random-example-jls977hp-gq92c" is forbidden: spec.containers[1].imagePullPolicy: Unsupported value: "IfNotPresent": supported values: "Always"

Does this point to an issue with my cluster rather than Katib?

@andreyvelich
Copy link
Member

andreyvelich commented Mar 16, 2020

@kunalyogenshah Did you delete all Experiments before reinstalling Katib ? Maybe your cluster doesn't support this imagePullPolicy, for metrics collector container, in Katib config (https://github.com/kubeflow/katib/blob/master/manifests/v1alpha3/katib-controller/katib-config.yaml) you can specify "imagePullPolicy": "Always", like in this suggestion: https://github.com/kubeflow/katib/blob/master/manifests/v1alpha3/katib-controller/katib-config.yaml#L43

@kunalyogenshah
Copy link

Bingo! That was it. Thank you @andreyvelich .

@kunalyogenshah
Copy link

kunalyogenshah commented Mar 17, 2020

Before I close the door on this conversation though, there is one last question. Every time an experiment is created, it creates a replicaset deployment, which is not turned off when the experiment succeeds. Is this expected? This would mean an ever growing pool of deployments every time I create an experiment. cc @andreyvelich

NAME                                                              READY   STATUS    RESTARTS   AGE
bayesianoptimization-example-bayesianoptimization-68c444cdbzqx6   1/1     Running   0          6m32s
bayesianoptimization-example2-bayesianoptimization-585d5d57npbw   1/1     Running   0          3m11s
katib-controller-854d787c7c-2hq6j                                 1/1     Running   1          5h19m
katib-db-manager-5df457875f-fj2dq                                 1/1     Running   1          5h19m
katib-mysql-7f45b96999-629cp                                      1/1     Running   0          5h19m
katib-ui-675c5b5cb6-9zg8r                                         1/1     Running   0          5h19m
NAME                            STATUS      AGE
bayesianoptimization-example    Succeeded   6m
bayesianoptimization-example2   Succeeded   3m

@andreyvelich
Copy link
Member

@kunalyogenshah Yes, current Katib implementation works like this. It has been done for resuming experiment.
We have issues: #1062 and #1061, related to this.
If users don't want to see always-on Suggestion deployment, but want to resume experiment.

@kunalyogenshah
Copy link

@andreyvelich : I see. So is the resume experiment feature coming soon? Because if we are not looking to resume an experiment, we could end up with manual deployment cleanups post completion for now. Our use case would be a few hundred experiments a day, which is a reasonable number of dead resources.

@andreyvelich
Copy link
Member

@kunalyogenshah Yes, this feature will come in the next releases.
For now, if you don't want to waste resources, after Experiment is finished and after you analysed the results, you need to delete Experiments.

@kunalyogenshah
Copy link

Got it, thanks @andreyvelich . Just to be clear, is it ok if we delete just the suggestion deployment and leave the experiment as is? Or will that cause issues and we need to delete both?

@andreyvelich
Copy link
Member

@kunalyogenshah Right now, I am not sure that you can delete only suggestion deployment, Katib controller will deploy it again, unless you delete Experiment.

@kunalyogenshah
Copy link

Thanks for being patient with my questions @andreyvelich . Greatly appreciated. I'll throw in a one more (hopefully the last)...

I have created a Katib deployment with a custom Database service endpoint. I initialized the controller container using the following env arguments :

- name: KATIB_DB_MANAGER_PORT_6789_TCP_ADDR
  value: <svc name>.<NS>
- name: KATIB_CORE_NAMESPACE
  value: NS

However, the injected sidecar containers are created with the -s katib-db-manager.<NS>:6789 argument. It's using the default svc name. I looked at the code, and it seems to be using katibmanagerv1alpha3.GetDBManagerAddr() to populate this. So why doesn't use the custom svc name I set up? Am I doing something wrong here?

@kunalyogenshah
Copy link

Oh, it seems I found the cause. It expects the length of the namespace to be zero, else it won't use the custom arguments. But if I set the namespace to '', then it will break the other components right?

@andreyvelich
Copy link
Member

andreyvelich commented Mar 17, 2020

@kunalyogenshah You are welcome :)

Interesting question. @johnugeorge @gaocegege Why here:

, user can specify custom KATIB_DB_MANAGER_PORT_6789_TCP_ADDR and KATIB_DB_MANAGER_PORT_6789_TCP_PORT only with the empty KATIB_CORE_NAMESPACE ?

We use KATIB_CORE_NAMESPACE not only in that place and it is not correct to specify it as an empty string.

@johnugeorge
Copy link
Member

I think, this is a legacy code which doesn't seem correct. We have to use separate DB namespace env variable instead of using KATIB_CORE_NAMESPACE env

@johnugeorge
Copy link
Member

Closing this PR as fix is merged in #1102

@johnugeorge
Copy link
Member

/close

@k8s-ci-robot
Copy link

@johnugeorge: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants