Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1alpha2] Invalid Job spec crashes operator #706

Closed
jlewi opened this issue Jun 29, 2018 · 10 comments
Closed

[v1alpha2] Invalid Job spec crashes operator #706

jlewi opened this issue Jun 29, 2018 · 10 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jun 29, 2018

The following spec (I assume its invalid) is causing TFJob operator to crash

apiVersion: v1
items:
- apiVersion: kubeflow.org/v1alpha2
  kind: TFJob
  metadata:
    annotations:
      ksonnet.io/managed: '{"pristine":"H4sIAAAAAAAA/4xST28aPxC9/z7GnPdPQL+odCUOKCURFQGUoF4itBq8w67Drm3Z400B7XevDGmCBFXr02jmvTcznncANPIHWSe1ggy2fk2bWr8l2pZp28PaVNiHCLZSFZDB8v67XkMEDTEWyAjZARQ2BBnw5lWv4zNKyDuDgs5koYvAGRKBx5snMrUU+GxIuJB5RMdkQ2RPFQdZ75MgtGKUiqyD7OUAaMsQQPqm7baQNmWLUiVmBxHEscPG1JQ7uadh7ya8Y1oq4zkPo+elcPnaiy3x8Pd4Mf080tw1rEGuhqXkyq9j6Zyn2PmmQSv3yFKrOADT87pL9tIclbTnINXoguq/971A/1Pnc1ZS3cIqAqGbBsPdXsDsuNIqJEm1x+97P9vDfP4wHeejxWI6uRstJ/NZfvc0/jaeLSej6TNE0GLtAzB1JCxxWgoTC0sFKZZYu3RLu+TVaQXdKgLZYBnApbCJ1OnHfgW1KW/i4JArC2Rt/6Y3uPm/fxvTl6+bwWDwbp9gK1JO26N3Imh17Rt61F7xyQNNCBfI1Z8H/JS6LFjCYq7qHWRsPXWrsIIlx2h5oWspdpDBXN2jrL2lj/7u/P8uRU9jBMOeotl1YNetuvD++wUAAP//AQAA//8pid8+ggMAAA=="}'
    clusterName: ""
    creationTimestamp: 2018-06-29T15:46:48Z
    labels:
      app.kubernetes.io/deploy-manager: ksonnet
    name: tfjob-v1alpha2
    namespace: kubeflow
    resourceVersion: "308756"
    selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/tfjob-v1alpha2
    uid: a238feee-7bb3-11e8-a525-42010a8e006c
  spec:
    tfReplicaSpecs:
      Master:
        replicas: 1
        spec:
          containers:
          - args:
            - /workdir/train.py
            - --sample_size=100000
            - --input_data_gcs_bucket=kubeflow-examples
            - --input_data_gcs_path=github-issue-summarization-data/github-issues.zip
            - --output_model_gcs_bucket=kubeflow-examples
            - --output_model_gcs_path=github-issue-summarization-data/output_model.h5
            command:
            - python
            env:
            - name: GOOGLE_APPLICATION_CREDENTIALS
              value: /secret/gcp-credentials/key.json
            image: gcr.io/kubeflow-dev/tf-job-issue-summarization:v20180425-e79f888
            name: tensorflow
            volumeMounts:
            - mountPath: /secret/gcp-credentials
              name: gcp-credentials
              readOnly: true
          restartPolicy: OnFailure
          volumes:
          - name: gcp-credentials
            secret:
              secretName: gcp-credentials
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Here are the operator logs

time="2018-06-29T15:47:04Z" level=info msg="EnvKubeflowNamespace not set, use default namespace" filename="app/server.go:65"
time="2018-06-29T15:47:04Z" level=info msg="[API Version: v1alpha2 Version: v0.1.0-alpha Git SHA: b2ac020 Go Version: go1.9.2 Go OS/Arch: linux/amd64]" filename="app/server.go:70"
...
E0629 15:47:04.345988       1 runtime.go:66] Observed a panic: "index out of range" (runtime error: index out of range)
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:509
/usr/local/go/src/runtime/panic.go:491
/usr/local/go/src/runtime/panic.go:28
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1alpha2/defaults.go:45
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1alpha2/defaults.go:91
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1alpha2/zz_generated.defaults.go:35
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1alpha2/zz_generated.defaults.go:29
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:394
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/pkg/controller.v2/controller_tfjob.go:33
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/pkg/controller.v2/controller.go:202
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/client-go/tools/cache/controller.go:195
<autogenerated>:1
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:550
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:387
/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71
/usr/local/go/src/runtime/asm_amd64.s:2337
panic: runtime error: index out of range [recovered]
	panic: runtime error: index out of range

goroutine 57 [running]:
github.com/kubeflow/tf-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x111
panic(0x1185e00, 0x1be5bf0)
	/usr/local/go/src/runtime/panic.go:491 +0x283
github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1alpha2.setDefaultPort(0xc4200b4370)
	/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1alpha2/defaults.go:45 +0x33a
github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1alpha2.SetDefaults_TFJob(0xc420292000)
	/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1alpha2/defaults.go:91 +0x9e
github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1alpha2.SetObjectDefaults_TFJob(0xc420292000)
	/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1alpha2/zz_generated.defaults.go:35 +0x2b
github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1alpha2.RegisterDefaults.func1(0x12c5b20, 0xc420292000)
	/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1alpha2/zz_generated.defaults.go:29 +0x3c
github.com/kubeflow/tf-operator/vendor/k8s.io/apimachinery/pkg/runtime.(*Scheme).Default(0xc4203f2700, 0x1b84200, 0xc420292000)
	/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:394 +0xb7
github.com/kubeflow/tf-operator/pkg/controller%2ev2.(*TFJobController).addTFJob(0xc4203541c0, 0x12e6280, 0xc4204ae0d8)
	/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/pkg/controller.v2/controller_tfjob.go:33 +0x2d0
github.com/kubeflow/tf-operator/pkg/controller%2ev2.(*TFJobController).(github.com/kubeflow/tf-operator/pkg/controller%2ev2.addTFJob)-fm(0x12e6280, 0xc4204ae0d8)
	/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/pkg/controller.v2/controller.go:202 +0x3e
github.com/kubeflow/tf-operator/vendor/k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(0xc4203220b0, 0xc4203220c0, 0xc4203220d0, 0x12e6280, 0xc4204ae0d8)
	/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/client-go/tools/cache/controller.go:195 +0x49
github.com/kubeflow/tf-operator/vendor/k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnAdd(0xc4201961e0, 0x12e6280, 0xc4204ae0d8)
	<autogenerated>:1 +0x62
github.com/kubeflow/tf-operator/vendor/k8s.io/client-go/tools/cache.(*processorListener).run(0xc420362000)
	/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:550 +0x272
github.com/kubeflow/tf-operator/vendor/k8s.io/client-go/tools/cache.(*processorListener).(github.com/kubeflow/tf-operator/vendor/k8s.io/client-go/tools/cache.run)-fm()
	/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:387 +0x2a
github.com/kubeflow/tf-operator/vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc4203fd168, 0xc420322950)
	/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x4f
created by github.com/kubeflow/tf-operator/vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/mnt/test-data-volume/kunming-tf-operator-release-b2ac020-4264/go/src/github.com/kubeflow/tf-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:69 +0x62

Docker image is

    image: gcr.io/kubeflow-images-public/tf_operator:v0.2.0
    imageID: docker-pullable://gcr.io/kubeflow-images-public/tf_operator@sha256:4f20e349f79059a009ef75aea158ca0c555fcc4a22e7c80a7cb9bff54fbab6c1
@jlewi jlewi changed the title [v1alpha2[ Job spec crashes operator [v1alpha2[ Invalid Job spec crashes operator Jun 29, 2018
@jlewi
Copy link
Contributor Author

jlewi commented Jun 29, 2018

Here's the valid spec.

  spec:
    tfReplicaSpecs:
      Master:
        replicas: 1
        template:
          metadata:
            creationTimestamp: null
          spec:
            containers:
            - args:
              - /workdir/train.py
              - --sample_size=100000
              - --input_data_gcs_bucket=kubeflow-examples
              - --input_data_gcs_path=github-issue-summarization-data/github-issues.zip
              - --output_model_gcs_bucket=kubeflow-examples
              - --output_model_gcs_path=github-issue-summarization-data/output_model.h5
              command:
              - python
              env:
              - name: GOOGLE_APPLICATION_CREDENTIALS
                value: /secret/gcp-credentials/key.json
              image: gcr.io/kubeflow-dev/tf-job-issue-summarization:v20180425-e79f888
              name: tensorflow
              ports:
              - containerPort: 2222
                name: tfjob-port
              resources: {}
              volumeMounts:
              - mountPath: /secret/gcp-credentials
                name: gcp-credentials
                readOnly: true
            restartPolicy: OnFailure
            volumes:
            - name: gcp-credentials
              secret:
                secretName: gcp-credentials
  status:
    conditions: null
    tfReplicaStatuses:

Can we avoid crashing the operator and instead add an error message to the status?

I though we added some handling with v1alpha2 to do that

@jlewi jlewi changed the title [v1alpha2[ Invalid Job spec crashes operator [v1alpha2] Invalid Job spec crashes operator Jun 29, 2018
@gaocegege
Copy link
Member

/assign @codeflitting

@k8s-ci-robot
Copy link

@gaocegege: GitHub didn't allow me to assign the following users: codeflitting.

Note that only kubeflow members and repo collaborators can be assigned.

In response to this:

/assign @codeflitting

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gaocegege
Copy link
Member

@jlewi Could you please invite @codeflitting and @yph152 as Kubeflow members? I can not assign issues to them.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 30, 2018

Invites sent

@gaocegege
Copy link
Member

/assign @codeflitting

@codeflitting
Copy link
Member

hi @jlewi
I have been fix it by add some validation(#702), can you try the master branch?

@jlewi
Copy link
Contributor Author

jlewi commented Jul 3, 2018

@codeflitting Thanks.

Rather than manually verifying it, do we have a test that checks invalid specs are handled properly? If not how about opening an issue to add one.

@codeflitting
Copy link
Member

Yes we have, I have wrote more test cases to cover the validation (#711)

@gaocegege
Copy link
Member

Yeah, I think we could close the issue. If we meet again we could reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants