Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU tests failing; ks env doesn't exist #640

Closed
jlewi opened this issue Jun 12, 2018 · 0 comments
Closed

GPU tests failing; ks env doesn't exist #640

jlewi opened this issue Jun 12, 2018 · 0 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jun 12, 2018

The GPU tests have started failing with the error

INFO|2018-06-12T11:01:18|py/util.py|44| Running: ks param set --env=test-env-5e09 gpu_tfjob name gpu-tfjob
cwd=/mnt/test-data-volume/kubeflow-tf-operator-presubmit-tfjob-e2e-638-e47f9f6-584-f192/src/kubeflow/tf-operator/test/workflows
INFO|2018-06-12T11:01:18|py/util.py|83| Subprocess output:
level=error msg="environment \"test-env-5e09\" does not exist"

I suspect a race condition because the non gpu and gpu tests are both trying to add it at the same time.

This appears to have started when we upgraded to 0.11 in the testing container.
kubeflow/kubeflow#727

My guess is the behavior might have changed.

Adding retries might help; but it looks like we will also get an error if the environment already exists so we will need to fix that.

/assign @jlewi

jlewi added a commit to jlewi/k8s that referenced this issue Jun 12, 2018
* Add retries for ksonnet errors because it looks like with 0.11 we start
  having problems because GPU and non GPU tests both try to add the environment

* If the ksonnet environment already exists this will cause an error;
  we should keep going.

Fix kubeflow#640
k8s-ci-robot pushed a commit that referenced this issue Jun 12, 2018
* Add proper error handling for deploying the tests.

* Add retries for ksonnet errors because it looks like with 0.11 we start
  having problems because GPU and non GPU tests both try to add the environment

* If the ksonnet environment already exists this will cause an error;
  we should keep going.

Fix #640

* * Add retries to test_runner
* Fix lint

* Fix lint.

* Remove YAML files.
yph152 pushed a commit to yph152/tf-operator that referenced this issue Jun 18, 2018
* Add proper error handling for deploying the tests.

* Add retries for ksonnet errors because it looks like with 0.11 we start
  having problems because GPU and non GPU tests both try to add the environment

* If the ksonnet environment already exists this will cause an error;
  we should keep going.

Fix kubeflow#640

* * Add retries to test_runner
* Fix lint

* Fix lint.

* Remove YAML files.
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this issue Jul 9, 2018
* Add proper error handling for deploying the tests.

* Add retries for ksonnet errors because it looks like with 0.11 we start
  having problems because GPU and non GPU tests both try to add the environment

* If the ksonnet environment already exists this will cause an error;
  we should keep going.

Fix kubeflow#640

* * Add retries to test_runner
* Fix lint

* Fix lint.

* Remove YAML files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant