Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid TFJob spec can cause the TFJob operator pod to crash repeatedly #813

Closed
richardsliu opened this issue Aug 31, 2018 · 8 comments
Closed

Comments

@richardsliu
Copy link
Contributor

We should log the invalid TFJob spec, but the job operator pod itself should not crash.

@gaocegege
Copy link
Member

Could you give me an example of the case? I think it should be fixed after #807 merged

@gaocegege
Copy link
Member

/assign

@richardsliu
Copy link
Contributor Author

This is my jsonnet file (note that container.name is misspelled):

local params = std.extVar("__ksonnet/params").components.estimator_runconfig;

local k = import "k.libsonnet";

local parts(namespace, name, image) = {
job:: {
apiVersion: "kubeflow.org/v1alpha2",
kind: "TFJob",
metadata: {
name: name,
namespace: namespace,
},
spec: {
cleanPodPolicy: "All",
tfReplicaSpecs: {
Chief: {
replicas: 1,
restartPolicy: "Never",
template: {
spec: {
containers: [
{
name: "tensorlow",
image: "gcr.io/kubeflow-images-staging/tf-operator-test-server:v20180830-867fcad8",
},
],
},
},
},
PS: {
replicas: 2,
restartPolicy: "Never",
template: {
spec: {
containers: [
{
name: "tensorlow",
image: "gcr.io/kubeflow-images-staging/tf-operator-test-server:v20180830-867fcad8",
},
],
},
},
},
Worker: {
replicas: 2,
restartPolicy: "Never",
template: {
spec: {
containers: [
{
name: "tensorlow",
image: "gcr.io/kubeflow-images-staging/tf-operator-test-server:v20180830-867fcad8",
},
],
},
},
},
},
},
},
};

std.prune(k.core.v1.list.new([parts(params.namespace, params.name, params.image).job]))

@gaocegege
Copy link
Member

Which version of tf-operator are you using?

BTW, we do not treat the name error as invalid spec, since it is also a string, we will report we cannot find the tensorflow container in the invalidation here: https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/tensorflow/validation/validation.go#L58

@gaocegege
Copy link
Member

Does the "tensorlow" crash the operator?

@richardsliu
Copy link
Contributor Author

I am using gcr.io/kubeflow-images-public/tf_operator:v20180809-d2509aa.

I can repro the crash if the jsonnet file for the TFJob has a spelling error for the container name, for example "tensorlow". Some relevant logs:

jsonPayload: {
filename: "validation/validation.go:50"
level: "warning"
msg: "There is no container named tensorflow in Chief"
}
jsonPayload: {
filename: "tfcontroller/informer.go:106"
job: "kubeflow.estimator-runconfig"
level: "error"
msg: "Failed to marshal the object to TFJob: TFJobSpec is not valid"
uid: "c3c7820e-ae3d-11e8-8c72-42010af000da"
}
jsonPayload: {
filename: "tfcontroller/controller_tfjob.go:32"
job: "kubeflow.estimator-runconfig"
level: "error"
msg: "Failed to convert the TFJob: Failed to marshal the object to TFJob"
uid: "c3c7820e-ae3d-11e8-8c72-42010af000da"
}
jsonPayload: {
filename: "tfcontroller/controller_tfjob.go:36"
job: "kubeflow.estimator-runconfig"
level: "warning"
msg: "Failed to unmarshal the object to TFJob object: Failed to marshal the object to TFJob"
uid: "c3c7820e-ae3d-11e8-8c72-42010af000da"
}

@jlewi
Copy link
Contributor

jlewi commented Sep 2, 2018

I haven't observed crashes with the latest code at head. If its crashing with an older version, is there a stack trace indicating the cause of the crash?

@jlewi
Copy link
Contributor

jlewi commented Sep 5, 2018

Going to mark this as fixed.

@jlewi jlewi closed this as completed Sep 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants