-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid TFJob spec can cause the TFJob operator pod to crash repeatedly #813
Comments
Could you give me an example of the case? I think it should be fixed after #807 merged |
/assign |
This is my jsonnet file (note that container.name is misspelled): local params = std.extVar("__ksonnet/params").components.estimator_runconfig; local k = import "k.libsonnet"; local parts(namespace, name, image) = { std.prune(k.core.v1.list.new([parts(params.namespace, params.name, params.image).job])) |
Which version of tf-operator are you using? BTW, we do not treat the name error as invalid spec, since it is also a string, we will report we cannot find the tensorflow container in the invalidation here: https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/tensorflow/validation/validation.go#L58 |
Does the "tensorlow" crash the operator? |
I am using gcr.io/kubeflow-images-public/tf_operator:v20180809-d2509aa. I can repro the crash if the jsonnet file for the TFJob has a spelling error for the container name, for example "tensorlow". Some relevant logs: jsonPayload: { |
I haven't observed crashes with the latest code at head. If its crashing with an older version, is there a stack trace indicating the cause of the crash? |
Going to mark this as fixed. |
We should log the invalid TFJob spec, but the job operator pod itself should not crash.
The text was updated successfully, but these errors were encountered: