Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to specify pod template metadata for TFJob #1403

Closed
andreyvelich opened this issue Sep 9, 2021 · 10 comments
Closed

Unable to specify pod template metadata for TFJob #1403

andreyvelich opened this issue Sep 9, 2021 · 10 comments

Comments

@andreyvelich
Copy link
Member

I tried to run this example with Pod Template metadata:

apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: test-tfjob
  namespace: kubeflow-user-example-com
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            custom-label: "test"
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
              command:
                - "python"
                - "/var/tf_mnist/mnist_with_summaries.py"
                - "--log_dir=/train/metrics"

The metadata was not populated in the TFJob.
This is output of the command kubectl get tfjob test-tfjob -n kubeflow-user-example-com -o yaml:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  creationTimestamp: "2021-09-09T19:27:08Z"
  generation: 1
  name: test-tfjob
  namespace: kubeflow-user-example-com
  resourceVersion: "367770085"
  uid: de1f70f7-7a33-4ff7-a521-22fcc713635d
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata: {}
        spec:
          containers:
          - command:
            - python
            - /var/tf_mnist/mnist_with_summaries.py
            - --log_dir=/train/metrics
            image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
            name: tensorflow
status:
  conditions:
  - lastTransitionTime: "2021-09-09T19:27:08Z"
    lastUpdateTime: "2021-09-09T19:27:08Z"
    message: TFJob test-tfjob is created.
    reason: TFJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2021-09-09T19:27:12Z"
    lastUpdateTime: "2021-09-09T19:27:12Z"
    message: TFJob kubeflow-user-example-com/test-tfjob is running.
    reason: TFJobRunning
    status: "True"
    type: Running
  replicaStatuses:
    Worker:
      active: 2
  startTime: "2021-09-09T19:27:09Z"
@andreyvelich
Copy link
Member Author

/kind bug
/cc @kubeflow/wg-training-leads

@gaocegege
Copy link
Member

I think it may be related to kubernetes/apiextensions-apiserver#50.

@shinytang6
Copy link
Member

upgrade controller-tools to >= 0.6 should solve this problem.

ref:

@andreyvelich
Copy link
Member Author

Should we add x-kubernetes-preserve-unknown-fields: true to our CR?
Similar to this: https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/components/crd/experiment.yaml#L27.

@gaocegege
Copy link
Member

I think we can add it.

@andreyvelich
Copy link
Member Author

/priority p0
cc @zijianjoy

@kubeflow/wg-training-leads Should we prioritise this task ?
For example, our users can't disable istio sidecar containers on the TFJobs and some examples will not work in Kubeflow installation.

@Jeffwan
Copy link
Member

Jeffwan commented Sep 20, 2021

@andreyvelich @zijianjoy

I filed a PR to address this issue and I verify it's working fine now. I can publish a new release later
https://github.com/kubeflow/tf-operator/pull/1409/files

➜  tf-operator git:(fix_pod_template) ✗ kubectl get tfjob test-tfjob -o yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"annotations":{},"name":"test-tfjob","namespace":"default"},"spec":{"tfReplicaSpecs":{"Worker":{"replicas":2,"restartPolicy":"OnFailure","template":{"metadata":{"annotations":{"sidecar.istio.io/inject":"false"},"labels":{"custom-label":"test"}},"spec":{"containers":[{"command":["python","/var/tf_mnist/mnist_with_summaries.py","--log_dir=/train/metrics"],"image":"gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0","name":"tensorflow"}]}}}}}}
  creationTimestamp: "2021-09-20T18:48:43Z"
  generation: 1
  name: test-tfjob
  namespace: default
  resourceVersion: "9788695"
  selfLink: /apis/kubeflow.org/v1/namespaces/default/tfjobs/test-tfjob
  uid: 335d2583-5726-4faa-a375-e42325f6f0f7
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
          labels:
            custom-label: test
        spec:
          containers:
          - command:
            - python
            - /var/tf_mnist/mnist_with_summaries.py
            - --log_dir=/train/metrics
            image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
            name: tensorflow
status:
  conditions:
  - lastTransitionTime: "2021-09-20T18:48:43Z"
    lastUpdateTime: "2021-09-20T18:48:43Z"
    message: TFJob test-tfjob is created.
    reason: TFJobCreated
    status: "True"
    type: Created
  replicaStatuses:
    Worker: {}
  startTime: "2021-09-20T18:48:44Z"

@Jeffwan Jeffwan self-assigned this Sep 20, 2021
@andreyvelich
Copy link
Member Author

Thank you @Jeffwan !

@zijianjoy
Copy link

Thank you @Jeffwan and @andreyvelich !

@Jeffwan
Copy link
Member

Jeffwan commented Sep 24, 2021

This has been fixed.

/close

@Jeffwan Jeffwan closed this as completed Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants