We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFJob v1alpha2 seems to be throwing unknown errors. Here is the TFJob spec which is based on the distributed MNIST example.
Version: 0.2.2 (as used with the deploy.sh script)
deploy.sh
--- apiVersion: kubeflow.org/v1alpha2 kind: TFJob metadata: labels: ksonnet.io/component: t2t-code-search-trainer name: t2t-code-search-trainer namespace: kubeflow spec: tfReplicaSpecs: PS: replicas: 1 template: spec: containers: - args: - /usr/local/sbin/t2t-entrypoint - t2t-trainer - --generate_data - --problem=github_function_docstring - --model=similarity_transformer - --hparams_set=transformer_tiny - --data_dir=gs://kubeflow-examples/t2t-code-search/20180801/data - --output_dir=gs://kubeflow-examples/t2t-code-search/20180801/output - --train_steps=100 - --schedule=run_std_server - --ps_job=/job:ps env: - name: GOOGLE_APPLICATION_CREDENTIALS value: /secret/gcp-credentials/key.json image: gcr.io/kubeflow-dev/code-search:v20180801-784b560-gpu name: PS volumeMounts: - mountPath: /secret/gcp-credentials name: gcp-credentials imagePullSecrets: - name: gcp-registry-credentials volumes: - name: gcp-credentials secret: secretName: gcp-credentials Worker: replicas: 2 template: spec: containers: - args: - /usr/local/sbin/t2t-entrypoint - t2t-trainer - --generate_data - --problem=github_function_docstring - --model=similarity_transformer - --hparams_set=transformer_tiny - --data_dir=gs://kubeflow-examples/t2t-code-search/20180801/data - --output_dir=gs://kubeflow-examples/t2t-code-search/20180801/output - --train_steps=100 - --schedule=train - --ps_gpu=0 - --worker_gpu=1 - --worker_replicas=2 - --ps_replicas=1 - --eval_steps=10 - --worker_job=/job:worker env: - name: GOOGLE_APPLICATION_CREDENTIALS value: /secret/gcp-credentials/key.json image: gcr.io/kubeflow-dev/code-search:v20180801-784b560-gpu name: WORKER resources: limits: nvidia.com/gpu: 1 volumeMounts: - mountPath: /secret/gcp-credentials name: gcp-credentials imagePullSecrets: - name: gcp-registry-credentials volumes: - name: gcp-credentials secret: secretName: gcp-credentials
time="2018-08-02T00:31:38Z" level=info msg="Updating tfjob: t2t-code-search-trainer" filename="controller.v2/controller_tfjob.go:54" time="2018-08-02T00:31:38Z" level=info msg="Reconcile TFJobs t2t-code-search-trainer" filename="controller.v2/controller.go:379" time="2018-08-02T00:31:38Z" level=info msg="Need to create new pod: ps-0" filename="controller.v2/controller_pod.go:69" job=kubeflow/t2t-code-search-trainer replica-type=ps uid=9f5d4c8f-95e8-11e8-a7df-42010a80014e time="2018-08-02T00:31:38Z" level=info msg="reconcilePods error Failed to found the port" filename="controller.v2/controller.go:399" time="2018-08-02T00:31:38Z" level=info msg="Finished syncing tfjob \"kubeflow/t2t-code-search-trainer\" (883.508µs)" filename="controller.v2/controller.go:340" E0802 00:31:38.332952 1 controller.go:318] Error syncing tfjob: Failed to found the port
The text was updated successfully, but these errors were encountered:
The name of the container should be tensorflow
Sorry, something went wrong.
Thank You! Closing this and redirecting any future discussion back to #563.
No branches or pull requests
TFJob v1alpha2 seems to be throwing unknown errors. Here is the TFJob spec which is based on the distributed MNIST example.
Version: 0.2.2 (as used with the
deploy.sh
script)TF Job Spec
Logs
The text was updated successfully, but these errors were encountered: