-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERROR:grpc._server:Exception calling application: Method not implemented! #981
Comments
We are also facing same error and not able to find what is the actual error. |
/cc @johnugeorge @richardsliu |
Just wondering if this is a redeployment of Katib ? Or is it a fresh deployment? |
Suggestion algorithms uses images of latest tag (in master branch) . https://github.com/kubeflow/katib/blob/master/manifests/v1alpha3/katib-controller/katib-config.yaml |
We tried random_example and tf job example in a fresh installation of kubeflow v0.7, and bith are working fine. Not able to reproduce the error now. |
We ran into exact the same issue after upgrading from v1alpha2 to v1alpha3. Any suggestion for a fix? is this related to a stale Docker image? |
@shaowei-su can you check if experiment is running in kubeflow namespace? If yes, then create a profile and run experiment in the profile namespace. |
@nrchakradhar yes it's running in |
Yes. You can edit the namespace with kubectl edit and add the annotation. |
@shaowei-su If you are using default Kubeflow profile controller to create namespaces, it should by default add If you deploy only Katib components without Kubeflow, you can submit Experiment in kubeflow namespace. |
we decided to roll back to v1alpha2 at this point. thanks for the help though! @nrchakradhar @andreyvelich |
I am experiencing this issue using version v1alpha3 on both the kubeflow and personal profile namespaces. |
@andrewlarimer Which Katib image versions are you using? |
@nrchakradhar : I tried creating a new namespace to run the experiment, still running into
I installed Katib using the |
To be more precise, the higher level error is that when I kick off the |
@kunalyogenshah as mentioned above, I hope you are creating profiles instead of just namespace. |
@nrchakradhar : Not sure what you mean by profile... This is how the kubeflow namespace looks :
The running components in this namespace are :
When you say katib manager logs I presume you mean the controller? The created pods do not have a sidecar in them. The logs from the controller have some infos, and a few errors like this one : There are some other info logs about PS: The alternate namespace I tried using looks like this :
|
@kunalyogenshah From your information above, I can see that you deployed Katib without other Kubeflow components and it means without profile-controller. In that case, you can submit Katib jobs in any namespace and in Kubeflow namespace also. Can you describe one of your Trials, please? Are you using latest version of Katib images? |
Thanks @andreyvelich . This is one of the trials :
The corresponding job is :
and the nested pod :
|
Oh, and the image versions are |
@kunalyogenshah I doubt if you have Istio installed. If yes, please set istio-injection to false
Also, your controller logs do not indicate that metrics side car is being injected.
Can you try upgrade to 1.0 images if feasible |
@kunalyogenshah Thank you for this information. Try to delete everything, Katib crd, and, also, validating and mutating webhooks. After that, deploy Katib again with the latest images from |
Thank you so much @andreyvelich, @nrchakradhar :
As you can see, it goes on in a loop trying to inject the metrics sidecar, but it seems to be failing, with these logs in the Job describe :
Does this point to an issue with my cluster rather than Katib? |
@kunalyogenshah Did you delete all Experiments before reinstalling Katib ? Maybe your cluster doesn't support this imagePullPolicy, for metrics collector container, in Katib config (https://github.com/kubeflow/katib/blob/master/manifests/v1alpha3/katib-controller/katib-config.yaml) you can specify |
Bingo! That was it. Thank you @andreyvelich . |
Before I close the door on this conversation though, there is one last question. Every time an experiment is created, it creates a replicaset deployment, which is not turned off when the experiment succeeds. Is this expected? This would mean an ever growing pool of deployments every time I create an experiment. cc @andreyvelich
|
@kunalyogenshah Yes, current Katib implementation works like this. It has been done for resuming experiment. |
@andreyvelich : I see. So is the resume experiment feature coming soon? Because if we are not looking to resume an experiment, we could end up with manual deployment cleanups post completion for now. Our use case would be a few hundred experiments a day, which is a reasonable number of dead resources. |
@kunalyogenshah Yes, this feature will come in the next releases. |
Got it, thanks @andreyvelich . Just to be clear, is it ok if we delete just the suggestion deployment and leave the experiment as is? Or will that cause issues and we need to delete both? |
@kunalyogenshah Right now, I am not sure that you can delete only suggestion deployment, Katib controller will deploy it again, unless you delete Experiment. |
Thanks for being patient with my questions @andreyvelich . Greatly appreciated. I'll throw in a one more (hopefully the last)... I have created a Katib deployment with a custom Database service endpoint. I initialized the controller container using the following env arguments :
However, the injected sidecar containers are created with the |
Oh, it seems I found the cause. It expects the length of the namespace to be zero, else it won't use the custom arguments. But if I set the namespace to '', then it will break the other components right? |
@kunalyogenshah You are welcome :) Interesting question. @johnugeorge @gaocegege Why here:
KATIB_DB_MANAGER_PORT_6789_TCP_ADDR and KATIB_DB_MANAGER_PORT_6789_TCP_PORT only with the empty KATIB_CORE_NAMESPACE ?
We use |
I think, this is a legacy code which doesn't seem correct. We have to use separate DB namespace env variable instead of using KATIB_CORE_NAMESPACE env |
Closing this PR as fix is merged in #1102 |
/close |
@johnugeorge: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
Hi, I'm having trouble using katib v1alpha3.
First, I installed katib by the followings
And I tried to apply random-example.yaml
kubectl apply -f random-example.yaml
(example in katib/examples/v1alpha3)
Results:
kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
katib-controller-6c6974678d-zsnlc 1/1 Running 1 24m
katib-db-558f649dc6-8cd9t 1/1 Running 0 24m
katib-manager-5f74bdff84-4d78z 1/1 Running 0 24m
katib-ui-6568bd6b44-qbq5k 1/1 Running 0 24m
random-example-random-846dc99654-bxb8j 1/1 Running 0 23m
kubectl get trials -n kubeflow
NAME TYPE STATUS AGE
random-example-drpkvb4b Running True 23m
random-example-k7xv6ktt Running True 23m
random-example-w6jlwdp2 Running True 23m
kubectl get experiment -n kubeflow -oyaml
apiVersion: v1
items:
kind: Experiment
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"kubeflow.org/v1alpha3","kind":"Experiment","metadata":{"annotations":{},"labels":{"controller-tools.k8s.io":"1.0"},"name":"random-example","namespace":"kubeflow"},"spec":{"algorithm":{"algorithmName":"random"},"maxFailedTrialCount":3,"maxTrialCount":12,"objective":{"additionalMetricNames":["accuracy"],"goal":0.99,"objectiveMetricName":"Validation-accuracy","type":"maximize"},"parallelTrialCount":3,"parameters":[{"feasibleSpace":{"max":"0.03","min":"0.01"},"name":"--lr","parameterType":"double"},{"feasibleSpace":{"max":"5","min":"2"},"name":"--num-layers","parameterType":"int"},{"feasibleSpace":{"list":["sgd","adam","ftrl"]},"name":"--optimizer","parameterType":"categorical"}],"trialTemplate":{"goTemplate":{"rawTemplate":"apiVersion: batch/v1\nkind: Job\nmetadata:\n name: {{.Trial}}\n namespace: {{.NameSpace}}\nspec:\n template:\n spec:\n containers:\n - name: {{.Trial}}\n image: docker.io/kubeflowkatib/mxnet-mnist-example\n command:\n - "python"\n - "/mxnet/example/image-classification/train_mnist.py"\n - "--batch-size=64"\n {{- with .HyperParameters}}\n {{- range .}}\n - "{{.Name}}={{.Value}}"\n {{- end}}\n {{- end}}\n restartPolicy: Never"}}}}
creationTimestamp: "2019-12-20T07:58:52Z"
finalizers:
generation: 2
labels:
controller-tools.k8s.io: "1.0"
name: random-example
namespace: kubeflow
resourceVersion: "11682124"
selfLink: /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/experiments/random-example
uid: 9005bab0-22fe-11ea-8cf0-0679676001a5
spec:
algorithm:
algorithmName: random
algorithmSettings: null
maxFailedTrialCount: 3
maxTrialCount: 12
metricsCollectorSpec:
collector:
kind: StdOut
objective:
additionalMetricNames:
goal: 0.99
objectiveMetricName: Validation-accuracy
type: maximize
parallelTrialCount: 3
parameters:
max: "0.03"
min: "0.01"
name: --lr
parameterType: double
max: "5"
min: "2"
name: --num-layers
parameterType: int
list:
name: --optimizer
parameterType: categorical
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: batch/v1
kind: Job
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
template:
spec:
containers:
- name: {{.Trial}}
image: docker.io/kubeflowkatib/mxnet-mnist-example
command:
- "python"
- "/mxnet/example/image-classification/train_mnist.py"
- "--batch-size=64"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
restartPolicy: Never
status:
conditions:
lastUpdateTime: "2019-12-20T07:58:52Z"
message: Experiment is created
reason: ExperimentCreated
status: "True"
type: Created
lastUpdateTime: "2019-12-20T08:00:22Z"
message: Experiment is running
reason: ExperimentRunning
status: "True"
type: Running
currentOptimalTrial:
observation:
metrics: null
parameterAssignments: null
startTime: "2019-12-20T07:58:52Z"
trials: 3
trialsRunning: 3
kind: List
metadata:
resourceVersion: ""
selfLink: ""
kubectl logs -n kubeflow random-example-random-846dc99654-bxb8j
INFO:hyperopt.utils:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.
INFO:hyperopt.fmin:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.
ERROR:grpc._server:Exception calling application: Method not implemented!
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/grpc/_server.py", line 434, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/usr/src/app/github.com/kubeflow/katib/pkg/apis/manager/v1alpha3/python/api_pb2_grpc.py", line 135, in ValidateAlgorithmSettings
raise NotImplementedError('Method not implemented!')
NotImplementedError: Method not implemented!
What can I do to fix it?
Thank you for your help in solving this problem.
Kubernetes version: (use
kubectl version
):Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5+icp", GitCommit:"903c3b31caddc675ce2d8bddf62aa0f875c2a3bc", GitTreeState:"clean", BuildDate:"2019-05-08T06:16:32Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5+icp", GitCommit:"903c3b31caddc675ce2d8bddf62aa0f875c2a3bc", GitTreeState:"clean", BuildDate:"2019-05-08T06:16:32Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
OS (e.g. from
/etc/os-release
): CentOS Linux release 7.7.1908 (Core)The text was updated successfully, but these errors were encountered: