Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E tests leaking GKE clusters #80

Closed
jlewi opened this issue Oct 24, 2017 · 2 comments
Closed

E2E tests leaking GKE clusters #80

jlewi opened this issue Oct 24, 2017 · 2 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Oct 24, 2017

E2E tests seem to be leaking GKE clusters

gcloud --project=mlkube-testing container clusters list
NAME            ZONE           MASTER_VERSION  MASTER_IP        MACHINE_TYPE   NODE_VERSION  NUM_NODES  STATUS
prow            us-central1-f  1.7.6-gke.1     35.202.163.166   n1-standard-4  1.7.6 *       1          RUNNING
v20171017-153b  us-central1-f  1.7.6-gke.1     35.202.214.30    n1-standard-8  1.7.6 *       1          STOPPING
v20171017-4e44  us-central1-f  1.7.6-gke.1     35.188.116.185   n1-standard-8  1.7.6 *       1          STOPPING
v20171017-bbc7  us-central1-f  1.7.6-gke.1     35.202.143.139   n1-standard-8  1.7.6 *       1          STOPPING
v20171017-efab  us-central1-f  1.7.6-gke.1     104.198.197.46   n1-standard-8  1.7.6 *       1          STOPPING
v20171024-083b  us-central1-f  1.7.6-gke.1     130.211.234.145  n1-standard-8  1.7.6 *       1          STOPPING
v20171024-1cfc  us-central1-f  1.7.6-gke.1     35.184.45.20     n1-standard-8  1.7.6 *       1          STOPPING

(Clusters are listed as stopping because I manually deleted them).

@jlewi
Copy link
Contributor Author

jlewi commented Dec 25, 2017

gcloud --project=mlkube-testing container clusters list
NAME                ZONE           MASTER_VERSION                    MASTER_IP       MACHINE_TYPE   NODE_VERSION  NUM_NODES  STATUS
prow                us-central1-f  1.7.8-gke.0                       35.202.163.166  n1-standard-4  1.7.6 *       1          RUNNING
e2e-1222-2328-26f1  us-east1-d     1.8.1-gke.1 ALPHA (27 days left)  35.185.27.8     n1-standard-8  1.8.1-gke.1   1          RUNNING
e2e-1222-2332-5637  us-east1-d     1.8.1-gke.1 ALPHA (27 days left)  35.227.115.42   n1-standard-8  1.8.1-gke.1   1          RUNNING
e2e-1222-2336-d17d  us-east1-d     1.8.1-gke.1 ALPHA (27 days left)  35.185.105.93   n1-standard-8  1.8.1-gke.1   1          RUNNING
e2e-1223-0007-0cc6  us-east1-d     1.8.1-gke.1 ALPHA (27 days left)  35.196.75.120   n1-standard-8  1.8.1-gke.1   1          RUNNING
e2e-1223-0014-7a8e  us-east1-d     1.8.1-gke.1 ALPHA (27 days left)  35.196.88.129   n1-standard-8  1.8.1-gke.1   1          RUNNING
e2e-1225-0538-8810  us-east1-d     1.8.1-gke.1 ALPHA (29 days left)  35.196.245.103  n1-standard-8  1.8.1-gke.1   1          RUNNING
e2e-1225-0547-63be  us-east1-d     1.8.1-gke.1 ALPHA (29 days left)  35.196.206.104  n1-standard-8  1.8.1-gke.1   1          RUNNING
e2e-1225-0638-fcae  us-east1-d     1.8.1-gke.1 ALPHA (29 days left)  35.196.215.51   n1-standard-8  1.8.1-gke.1   1          RUNNING

@jlewi
Copy link
Contributor Author

jlewi commented Dec 25, 2017

Here's one failure mode I see

DAG run 2017-12-25T05:33:44

SetupCluster runs. On attempt #1 the cluster is created and the helm package for the operator is installed; but then there's a problem getting the status while waiting for the deployment to be ready

[2017-12-25 05:42:29,585] {base_task_runner.py:98} INFO - Subtask: INFO:root:Creationg gs://kubernetes-jenkins/pr-logs/pull/tensorflow_k8s/243/tf-k8s-presubmit/273/artifacts/junit_setupcluster.xml
[2017-12-25 05:42:29,585] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2017-12-25 05:42:29,585] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/runpy.py", line 174, in _run_module_as_main
[2017-12-25 05:42:29,585] {base_task_runner.py:98} INFO - Subtask:     "__main__", fname, loader, pkg_name)
[2017-12-25 05:42:29,586] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
[2017-12-25 05:42:29,586] {base_task_runner.py:98} INFO - Subtask:     exec code in run_globals
[2017-12-25 05:42:29,586] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-25T05_33_44/tensorflow_k8s/py/deploy.py", line 210, in <module>
[2017-12-25 05:42:29,586] {base_task_runner.py:98} INFO - Subtask:     main()
[2017-12-25 05:42:29,586] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-25T05_33_44/tensorflow_k8s/py/deploy.py", line 207, in main
[2017-12-25 05:42:29,586] {base_task_runner.py:98} INFO - Subtask:     args.func(args)
[2017-12-25 05:42:29,587] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-25T05_33_44/tensorflow_k8s/py/deploy.py", line 89, in setup
[2017-12-25 05:42:29,587] {base_task_runner.py:98} INFO - Subtask:     util.wait_for_deployment(api_client, "default", "tf-job-operator")
[2017-12-25 05:42:29,587] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-25T05_33_44/tensorflow_k8s/py/util.py", line 268, in wait_for_deployment
[2017-12-25 05:42:29,587] {base_task_runner.py:98} INFO - Subtask:     deploy = ext_client.read_namespaced_deployment(name, namespace)
[2017-12-25 05:42:29,587] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 5538, in read_namespaced_deployment
[2017-12-25 05:42:29,588] {base_task_runner.py:98} INFO - Subtask:     (data) = self.read_namespaced_deployment_with_http_info(name, namespace, **kwargs)
[2017-12-25 05:42:29,588] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 5629, in read_namespaced_deployment_with_http_info
[2017-12-25 05:42:29,588] {base_task_runner.py:98} INFO - Subtask:     collection_formats=collection_formats)
[2017-12-25 05:42:29,588] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 321, in call_api
[2017-12-25 05:42:29,588] {base_task_runner.py:98} INFO - Subtask:     _return_http_data_only, collection_formats, _preload_content, _request_timeout)
[2017-12-25 05:42:29,588] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 155, in __call_api
[2017-12-25 05:42:29,589] {base_task_runner.py:98} INFO - Subtask:     _request_timeout=_request_timeout)
[2017-12-25 05:42:29,589] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 342, in request
[2017-12-25 05:42:29,589] {base_task_runner.py:98} INFO - Subtask:     headers=headers)
[2017-12-25 05:42:29,589] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 231, in GET
[2017-12-25 05:42:29,589] {base_task_runner.py:98} INFO - Subtask:     query_params=query_params)
[2017-12-25 05:42:29,590] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 222, in request
[2017-12-25 05:42:29,590] {base_task_runner.py:98} INFO - Subtask:     raise ApiException(http_resp=r)
[2017-12-25 05:42:29,590] {base_task_runner.py:98} INFO - Subtask: kubernetes.client.rest.ApiException: (404)
[2017-12-25 05:42:29,590] {base_task_runner.py:98} INFO - Subtask: Reason: Not Found
[2017-12-25 05:42:29,590] {base_task_runner.py:98} INFO - Subtask: HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 25 Dec 2017 05:42:29 GMT', 'Content-Length': '244', 'Content-Type': 'application/json'})
[2017-12-25 05:42:29,590] {base_task_runner.py:98} INFO - Subtask: HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.extensions \"tf-job-operator\" not found","reason":"NotFound","details":{"name":"tf-job-operator","group":"extensions","kind":"deployments"},"code":404}
[2017-12-25 05:42:29,591] {base_task_runner.py:98} INFO - Subtask: 

Airflow retries the setup cluster task (attempt 2) and that fails due to a problem creating the GCS output.

[2017-12-25 05:51:56,955] {base_task_runner.py:98} INFO - Subtask: INFO:root:Creationg gs://kubernetes-jenkins/pr-logs/pull/tensorflow_k8s/243/tf-k8s-presubmit/273/artifacts/junit_setupcluster.xml
[2017-12-25 05:51:56,955] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2017-12-25 05:51:56,955] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/runpy.py", line 174, in _run_module_as_main
[2017-12-25 05:51:56,955] {base_task_runner.py:98} INFO - Subtask:     "__main__", fname, loader, pkg_name)
[2017-12-25 05:51:56,956] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
[2017-12-25 05:51:56,956] {base_task_runner.py:98} INFO - Subtask:     exec code in run_globals
[2017-12-25 05:51:56,956] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-25T05_33_44/tensorflow_k8s/py/deploy.py", line 210, in <module>
[2017-12-25 05:51:56,956] {base_task_runner.py:98} INFO - Subtask:     main()
[2017-12-25 05:51:56,957] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-25T05_33_44/tensorflow_k8s/py/deploy.py", line 207, in main
[2017-12-25 05:51:56,957] {base_task_runner.py:98} INFO - Subtask:     args.func(args)
[2017-12-25 05:51:56,957] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-25T05_33_44/tensorflow_k8s/py/deploy.py", line 98, in setup
[2017-12-25 05:51:56,957] {base_task_runner.py:98} INFO - Subtask:     test_util.create_junit_xml_file([t], args.junit_path, gcs_client)
[2017-12-25 05:51:56,957] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-25T05_33_44/tensorflow_k8s/py/test_util.py", line 58, in create_junit_xml_file
[2017-12-25 05:51:56,958] {base_task_runner.py:98} INFO - Subtask:     blob.upload_from_string(b.getvalue())
[2017-12-25 05:51:56,958] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 1028, in upload_from_string
[2017-12-25 05:51:56,958] {base_task_runner.py:98} INFO - Subtask:     content_type=content_type, client=client)
[2017-12-25 05:51:56,958] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 949, in upload_from_file
[2017-12-25 05:51:56,959] {base_task_runner.py:98} INFO - Subtask:     _raise_from_invalid_response(exc)
[2017-12-25 05:51:56,959] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 1735, in _raise_from_invalid_response
[2017-12-25 05:51:56,959] {base_task_runner.py:98} INFO - Subtask:     raise exceptions.from_http_response(error.response)
[2017-12-25 05:51:56,959] {base_task_runner.py:98} INFO - Subtask: google.api_core.exceptions.Forbidden: 403 POST https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?uploadType=multipart: airflow@mlkube-testing.iam.gserviceaccount.com does not have storage.objects.delete access to kubernetes-jenkins/pr-logs/pull/tensorflow_k8s/243/tf-k8s-presubmit/273/artifacts/junit_setupcluster.xml.
[2017-12-25 05:51:56,959] {base_task_runner.py:98} INFO - Subtask: 

Now teardown_cluster runs and that fails with

[2017-12-25 05:52:01,193] {base_task_runner.py:98} INFO - Subtask: [2017-12-25 05:52:01,192] {e2e_tests_dag.py:347} INFO - artifacts_path gs://kubernetes-jenkins/pr-logs/pull/tensorflow_k8s/243/tf-k8s-presubmit/273/artifacts
[2017-12-25 05:52:01,193] {base_task_runner.py:98} INFO - Subtask: [2017-12-25 05:52:01,193] {e2e_tests_dag.py:350} INFO - junit_path gs://kubernetes-jenkins/pr-logs/pull/tensorflow_k8s/243/tf-k8s-presubmit/273/artifacts/junit_teardown.xml
[2017-12-25 05:52:01,232] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2017-12-25 05:52:01,233] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/bin/airflow", line 4, in <module>
[2017-12-25 05:52:01,233] {base_task_runner.py:98} INFO - Subtask:     __import__('pkg_resources').run_script('apache-airflow==1.9.0rc2', 'airflow')
[2017-12-25 05:52:01,233] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 750, in run_script
[2017-12-25 05:52:01,234] {base_task_runner.py:98} INFO - Subtask:     self.require(requires)[0].run_script(script_name, ns)
[2017-12-25 05:52:01,235] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1527, in run_script
[2017-12-25 05:52:01,235] {base_task_runner.py:98} INFO - Subtask:     exec(code, namespace, namespace)
[2017-12-25 05:52:01,235] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/EGG-INFO/scripts/airflow", line 27, in <module>
[2017-12-25 05:52:01,236] {base_task_runner.py:98} INFO - Subtask:     args.func(args)
[2017-12-25 05:52:01,236] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/airflow/bin/cli.py", line 397, in run
[2017-12-25 05:52:01,236] {base_task_runner.py:98} INFO - Subtask:     pool=args.pool,
[2017-12-25 05:52:01,236] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/airflow/utils/db.py", line 50, in wrapper
[2017-12-25 05:52:01,237] {base_task_runner.py:98} INFO - Subtask:     result = func(*args, **kwargs)
[2017-12-25 05:52:01,237] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/airflow/models.py", line 1469, in _run_raw_task
[2017-12-25 05:52:01,237] {base_task_runner.py:98} INFO - Subtask:     result = task_copy.execute(context=context)
[2017-12-25 05:52:01,238] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/airflow/operators/python_operator.py", line 89, in execute
[2017-12-25 05:52:01,238] {base_task_runner.py:98} INFO - Subtask:     return_value = self.execute_callable()
[2017-12-25 05:52:01,238] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/airflow/operators/python_operator.py", line 94, in execute_callable
[2017-12-25 05:52:01,238] {base_task_runner.py:98} INFO - Subtask:     return self.python_callable(*self.op_args, **self.op_kwargs)
[2017-12-25 05:52:01,239] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/airflow/dags/e2e_tests_dag.py", line 354, in teardown_cluster
[2017-12-25 05:52:01,239] {base_task_runner.py:98} INFO - Subtask:     args.append("--cluster=" + cluster)
[2017-12-25 05:52:01,239] {base_task_runner.py:98} INFO - Subtask: TypeError: cannot concatenate 'str' and 'NoneType' objects_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/EGG-INFO/scripts/airflow", line 27, in <module>
[2017-12-25 05:52:01,236] {base_task_runner.py:98} INFO - Subtask:     args.func(args)
[2017-12-25 05:52:01,236] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/airflow/bin/cli.py", line 397, in run
[2017-12-25 05:52:01,236] {base_task_runner.py:98} INFO - Subtask:     pool=args.pool,
[2017-12-25 05:52:01,236] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/airflow/utils/db.py", line 50, in wrapper
[2017-12-25 05:52:01,237] {base_task_runner.py:98} INFO - Subtask:     result = func(*args, **kwargs)
[2017-12-25 05:52:01,237] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/airflow/models.py", line 1469, in _run_raw_task
[2017-12-25 05:52:01,237] {base_task_runner.py:98} INFO - Subtask:     result = task_copy.execute(context=context)
[2017-12-25 05:52:01,238] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/airflow/operators/python_operator.py", line 89, in execute
[2017-12-25 05:52:01,238] {base_task_runner.py:98} INFO - Subtask:     return_value = self.execute_callable()
[2017-12-25 05:52:01,238] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/airflow/operators/python_operator.py", line 94, in execute_callable
[2017-12-25 05:52:01,238] {base_task_runner.py:98} INFO - Subtask:     return self.python_callable(*self.op_args, **self.op_kwargs)
[2017-12-25 05:52:01,239] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/airflow/dags/e2e_tests_dag.py", line 354, in teardown_cluster
[2017-12-25 05:52:01,239] {base_task_runner.py:98} INFO - Subtask:     args.append("--cluster=" + cluster)

So teardown_cluster is failing because it got None for the cluster name.

So there's a couple problems in setup_cluster

  • Each setup_cluster attempt will use a different cluster name
  • We only push the cluster name if setup_cluster succeeds

I think there is a relatively easy fix

  • setup_cluster should try pulling the cluster name to see if it was set by a previous attempt
  • setup_cluster should push the cluster_name before calling deploy so that the name will always be available to teardown_cluster even if the test fails

jlewi added a commit to jlewi/k8s that referenced this issue Dec 27, 2017
* setup_cluster needs to push the cluster name so that it is available to
  the teardown step before we try to setup the cluster so that the name
  is available even if setup_cluster fails.

* setup_cluster also needs to handle the case where setup_cluster might
  already have been attempted; in which case we should reuse that cluster.
jlewi added a commit that referenced this issue Dec 27, 2017
* setup_cluster needs to push the cluster name so that it is available to
  the teardown step before we try to setup the cluster so that the name
  is available even if setup_cluster fails.

* setup_cluster also needs to handle the case where setup_cluster might
  already have been attempted; in which case we should reuse that cluster.

* Fix #80
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant