Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error syncing tfjob: Failed to found the port #768

Closed
activatedgeek opened this issue Aug 2, 2018 · 2 comments
Closed

Error syncing tfjob: Failed to found the port #768

activatedgeek opened this issue Aug 2, 2018 · 2 comments

Comments

@activatedgeek
Copy link

activatedgeek commented Aug 2, 2018

TFJob v1alpha2 seems to be throwing unknown errors. Here is the TFJob spec which is based on the distributed MNIST example.

Version: 0.2.2 (as used with the deploy.sh script)

TF Job Spec

---
apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  labels:
    ksonnet.io/component: t2t-code-search-trainer
  name: t2t-code-search-trainer
  namespace: kubeflow
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      template:
        spec:
          containers:
          - args:
            - /usr/local/sbin/t2t-entrypoint
            - t2t-trainer
            - --generate_data
            - --problem=github_function_docstring
            - --model=similarity_transformer
            - --hparams_set=transformer_tiny
            - --data_dir=gs://kubeflow-examples/t2t-code-search/20180801/data
            - --output_dir=gs://kubeflow-examples/t2t-code-search/20180801/output
            - --train_steps=100
            - --schedule=run_std_server
            - --ps_job=/job:ps
            env:
            - name: GOOGLE_APPLICATION_CREDENTIALS
              value: /secret/gcp-credentials/key.json
            image: gcr.io/kubeflow-dev/code-search:v20180801-784b560-gpu
            name: PS
            volumeMounts:
            - mountPath: /secret/gcp-credentials
              name: gcp-credentials
          imagePullSecrets:
          - name: gcp-registry-credentials
          volumes:
          - name: gcp-credentials
            secret:
              secretName: gcp-credentials
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - args:
            - /usr/local/sbin/t2t-entrypoint
            - t2t-trainer
            - --generate_data
            - --problem=github_function_docstring
            - --model=similarity_transformer
            - --hparams_set=transformer_tiny
            - --data_dir=gs://kubeflow-examples/t2t-code-search/20180801/data
            - --output_dir=gs://kubeflow-examples/t2t-code-search/20180801/output
            - --train_steps=100
            - --schedule=train
            - --ps_gpu=0
            - --worker_gpu=1
            - --worker_replicas=2
            - --ps_replicas=1
            - --eval_steps=10
            - --worker_job=/job:worker
            env:
            - name: GOOGLE_APPLICATION_CREDENTIALS
              value: /secret/gcp-credentials/key.json
            image: gcr.io/kubeflow-dev/code-search:v20180801-784b560-gpu
            name: WORKER
            resources:
              limits:
                nvidia.com/gpu: 1
            volumeMounts:
            - mountPath: /secret/gcp-credentials
              name: gcp-credentials
          imagePullSecrets:
          - name: gcp-registry-credentials
          volumes:
          - name: gcp-credentials
            secret:
              secretName: gcp-credentials

Logs

time="2018-08-02T00:31:38Z" level=info msg="Updating tfjob: t2t-code-search-trainer" filename="controller.v2/controller_tfjob.go:54"
time="2018-08-02T00:31:38Z" level=info msg="Reconcile TFJobs t2t-code-search-trainer" filename="controller.v2/controller.go:379"
time="2018-08-02T00:31:38Z" level=info msg="Need to create new pod: ps-0" filename="controller.v2/controller_pod.go:69" job=kubeflow/t2t-code-search-trainer replica-type=ps uid=9f5d4c8f-95e8-11e8-a7df-42010a80014e
time="2018-08-02T00:31:38Z" level=info msg="reconcilePods error Failed to found the port" filename="controller.v2/controller.go:399"
time="2018-08-02T00:31:38Z" level=info msg="Finished syncing tfjob \"kubeflow/t2t-code-search-trainer\" (883.508µs)" filename="controller.v2/controller.go:340"
E0802 00:31:38.332952       1 controller.go:318] Error syncing tfjob: Failed to found the port

@gaocegege
Copy link
Member

The name of the container should be tensorflow

@activatedgeek
Copy link
Author

Thank You! Closing this and redirecting any future discussion back to #563.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants