Error syncing tfjob: Failed to found the port #768

activatedgeek · 2018-08-02T00:34:07Z

TFJob v1alpha2 seems to be throwing unknown errors. Here is the TFJob spec which is based on the distributed MNIST example.

Version: 0.2.2 (as used with the deploy.sh script)

TF Job Spec

---
apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  labels:
    ksonnet.io/component: t2t-code-search-trainer
  name: t2t-code-search-trainer
  namespace: kubeflow
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      template:
        spec:
          containers:
          - args:
            - /usr/local/sbin/t2t-entrypoint
            - t2t-trainer
            - --generate_data
            - --problem=github_function_docstring
            - --model=similarity_transformer
            - --hparams_set=transformer_tiny
            - --data_dir=gs://kubeflow-examples/t2t-code-search/20180801/data
            - --output_dir=gs://kubeflow-examples/t2t-code-search/20180801/output
            - --train_steps=100
            - --schedule=run_std_server
            - --ps_job=/job:ps
            env:
            - name: GOOGLE_APPLICATION_CREDENTIALS
              value: /secret/gcp-credentials/key.json
            image: gcr.io/kubeflow-dev/code-search:v20180801-784b560-gpu
            name: PS
            volumeMounts:
            - mountPath: /secret/gcp-credentials
              name: gcp-credentials
          imagePullSecrets:
          - name: gcp-registry-credentials
          volumes:
          - name: gcp-credentials
            secret:
              secretName: gcp-credentials
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - args:
            - /usr/local/sbin/t2t-entrypoint
            - t2t-trainer
            - --generate_data
            - --problem=github_function_docstring
            - --model=similarity_transformer
            - --hparams_set=transformer_tiny
            - --data_dir=gs://kubeflow-examples/t2t-code-search/20180801/data
            - --output_dir=gs://kubeflow-examples/t2t-code-search/20180801/output
            - --train_steps=100
            - --schedule=train
            - --ps_gpu=0
            - --worker_gpu=1
            - --worker_replicas=2
            - --ps_replicas=1
            - --eval_steps=10
            - --worker_job=/job:worker
            env:
            - name: GOOGLE_APPLICATION_CREDENTIALS
              value: /secret/gcp-credentials/key.json
            image: gcr.io/kubeflow-dev/code-search:v20180801-784b560-gpu
            name: WORKER
            resources:
              limits:
                nvidia.com/gpu: 1
            volumeMounts:
            - mountPath: /secret/gcp-credentials
              name: gcp-credentials
          imagePullSecrets:
          - name: gcp-registry-credentials
          volumes:
          - name: gcp-credentials
            secret:
              secretName: gcp-credentials

Logs

time="2018-08-02T00:31:38Z" level=info msg="Updating tfjob: t2t-code-search-trainer" filename="controller.v2/controller_tfjob.go:54"
time="2018-08-02T00:31:38Z" level=info msg="Reconcile TFJobs t2t-code-search-trainer" filename="controller.v2/controller.go:379"
time="2018-08-02T00:31:38Z" level=info msg="Need to create new pod: ps-0" filename="controller.v2/controller_pod.go:69" job=kubeflow/t2t-code-search-trainer replica-type=ps uid=9f5d4c8f-95e8-11e8-a7df-42010a80014e
time="2018-08-02T00:31:38Z" level=info msg="reconcilePods error Failed to found the port" filename="controller.v2/controller.go:399"
time="2018-08-02T00:31:38Z" level=info msg="Finished syncing tfjob \"kubeflow/t2t-code-search-trainer\" (883.508µs)" filename="controller.v2/controller.go:340"
E0802 00:31:38.332952       1 controller.go:318] Error syncing tfjob: Failed to found the port

The text was updated successfully, but these errors were encountered:

gaocegege · 2018-08-02T02:06:31Z

The name of the container should be tensorflow

activatedgeek · 2018-08-02T18:23:44Z

Thank You! Closing this and redirecting any future discussion back to #563.

activatedgeek closed this as completed Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error syncing tfjob: Failed to found the port #768

Error syncing tfjob: Failed to found the port #768

activatedgeek commented Aug 2, 2018 •

edited

Loading

gaocegege commented Aug 2, 2018

activatedgeek commented Aug 2, 2018

Error syncing tfjob: Failed to found the port #768

Error syncing tfjob: Failed to found the port #768

Comments

activatedgeek commented Aug 2, 2018 • edited Loading

gaocegege commented Aug 2, 2018

activatedgeek commented Aug 2, 2018

activatedgeek commented Aug 2, 2018 •

edited

Loading