Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaps in test coverage #280

Closed
jlewi opened this issue Jan 10, 2018 · 9 comments
Closed

Gaps in test coverage #280

jlewi opened this issue Jan 10, 2018 · 9 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jan 10, 2018

Our tests aren't catching certain bugs.

For example #274. In that case the controller was crash looping because the spec was incorrect

logging before flag.Parse: I0110 08:10:47.282907 1 controller.go:78] Setting up event handlers CustomResourceDefinition.apiextensions.k8s.io "tfjobs.tensorflow.org" is invalid: spec.names.singular: Invalid value: "TfJob": a DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name', or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?')
@jlewi
Copy link
Contributor Author

jlewi commented Jan 11, 2018

#278 Fixed a bug that should have been caught by our tests.

The previous commit was adc8ab956605e8167b3597dd63d338b3b08a2fe0

Here's the post submit
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/tensorflow_k8s/tf-k8s-postsubmit/98/

Dag Run ID: 2018-01-10T15:10:22

Here's the logs from the run_tests step. It doesn't look like the actual tests ran; I don't see any commands indicating commands being run (e.g. helm test). Although this could be an issue with stdout/stderr not being properly captured.

Airflow
DAGs
Data Profiling
Browse
Admin
Docs
About
05:33 UTC
OnOff DAG: tf_k8s_tests schedule: None
 Graph View
 Tree View
 Task Duration
 Task Tries
 Landing Times
 Gantt
 Details
 Code
 Refresh
Task Instance: run_tests  
2018-01-10 15:10:22
 Task Instance Details
 Rendered Template
 Log
 XCom
Log by attempts
1
*** Reading local log.
[2018-01-10 15:21:38,339] {cli.py:379} INFO - Running on host airflow-2497854871-0dr9k
[2018-01-10 15:21:38,383] {models.py:1173} INFO - Dependencies all met for <TaskInstance: tf_k8s_tests.run_tests 2018-01-10 15:10:22 [queued]>
[2018-01-10 15:21:38,397] {models.py:1173} INFO - Dependencies all met for <TaskInstance: tf_k8s_tests.run_tests 2018-01-10 15:10:22 [queued]>
[2018-01-10 15:21:38,397] {models.py:1383} INFO - 
--------------------------------------------------------------------------------
Starting attempt 1 of 2
--------------------------------------------------------------------------------

[2018-01-10 15:21:38,410] {models.py:1404} INFO - Executing <Task(PythonOperator): run_tests> on 2018-01-10 15:10:22
[2018-01-10 15:21:38,410] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run tf_k8s_tests run_tests 2018-01-10T15:10:22 --job_id 3420 --raw -sd DAGS_FOLDER/e2e_tests_dag.py']
[2018-01-10 15:21:40,358] {base_task_runner.py:98} INFO - Subtask: [2018-01-10 15:21:40,357] {__init__.py:45} INFO - Using executor LocalExecutor
[2018-01-10 15:21:40,430] {base_task_runner.py:98} INFO - Subtask: [2018-01-10 15:21:40,429] {models.py:189} INFO - Filling up the DagBag from /usr/local/airflow/dags/e2e_tests_dag.py
[2018-01-10 15:21:40,433] {base_task_runner.py:98} INFO - Subtask: /usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/airflow/utils/helpers.py:351: DeprecationWarning: Importing PythonOperator directly from <module 'airflow.operators' from '/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/airflow/operators/__init__.pyc'> has been deprecated. Please import from '<module 'airflow.operators' from '/usr/local/lib/python2.7/site-packages/apache_airflow-1.9.0rc2-py2.7.egg/airflow/operators/__init__.pyc'>.[operator_module]' instead. Support for direct imports will be dropped entirely in Airflow 2.0.
[2018-01-10 15:21:40,434] {base_task_runner.py:98} INFO - Subtask:   DeprecationWarning)
[2018-01-10 15:21:40,731] {base_task_runner.py:98} INFO - Subtask: [2018-01-10 15:21:40,730] {e2e_tests_dag.py:289} INFO - conf={u'PULL_NUMBER': u'', u'PULL_BASE_SHA': u'adc8ab956605e8167b3597dd63d338b3b08a2fe0', u'ARTIFACTS_PATH': u'gs://kubernetes-jenkins/logs/tensorflow_k8s/tf-k8s-postsubmit/98/artifacts', u'PULL_PULL_SHA': u''}
[2018-01-10 15:21:40,732] {base_task_runner.py:98} INFO - Subtask: [2018-01-10 15:21:40,731] {e2e_tests_dag.py:292} INFO - artifacts_path gs://kubernetes-jenkins/logs/tensorflow_k8s/tf-k8s-postsubmit/98/artifacts
[2018-01-10 15:21:40,733] {base_task_runner.py:98} INFO - Subtask: [2018-01-10 15:21:40,731] {e2e_tests_dag.py:294} INFO - junit_path gs://kubernetes-jenkins/logs/tensorflow_k8s/tf-k8s-postsubmit/98/artifacts/junit_e2e.xml
[2018-01-10 15:21:40,754] {base_task_runner.py:98} INFO - Subtask: [2018-01-10 15:21:40,754] {e2e_tests_dag.py:83} INFO - Removing /opt/tensorflow_k8s from PYTHONPATH
[2018-01-10 15:21:40,966] {base_task_runner.py:98} INFO - Subtask: [2018-01-10 15:21:40,966] {e2e_tests_dag.py:104} INFO - Using util located at /opt/tensorflow_k8s/py/util.pyc
[2018-01-10 15:21:40,967] {base_task_runner.py:98} INFO - Subtask: [2018-01-10 15:21:40,967] {util.py:36} INFO - Running: python -m py.deploy test --cluster=e2e-0110-1514-e43d --junit_path=gs://kubernetes-jenkins/logs/tensorflow_k8s/tf-k8s-postsubmit/98/artifacts/junit_e2e.xml --project=mlkube-testing 
[2018-01-10 15:21:40,968] {base_task_runner.py:98} INFO - Subtask: cwd=None
[2018-01-10 15:21:40,968] {base_task_runner.py:98} INFO - Subtask: [2018-01-10 15:21:40,968] {util.py:46} INFO - Running: Environment:
[2018-01-10 15:21:40,969] {base_task_runner.py:98} INFO - Subtask: AIRFLOW_PORT=tcp://10.63.240.90:80
[2018-01-10 15:21:40,969] {base_task_runner.py:98} INFO - Subtask: AIRFLOW_PORT_80_TCP=tcp://10.63.240.90:80
[2018-01-10 15:21:40,970] {base_task_runner.py:98} INFO - Subtask: AIRFLOW_PORT_80_TCP_ADDR=10.63.240.90
[2018-01-10 15:21:40,970] {base_task_runner.py:98} INFO - Subtask: AIRFLOW_PORT_80_TCP_PORT=80
[2018-01-10 15:21:40,970] {base_task_runner.py:98} INFO - Subtask: AIRFLOW_PORT_80_TCP_PROTO=tcp
[2018-01-10 15:21:40,971] {base_task_runner.py:98} INFO - Subtask: AIRFLOW_SERVICE_HOST=10.63.240.90
[2018-01-10 15:21:40,971] {base_task_runner.py:98} INFO - Subtask: AIRFLOW_SERVICE_PORT=80
[2018-01-10 15:21:40,971] {base_task_runner.py:98} INFO - Subtask: CLOUDSDK_CORE_DISABLE_PROMPTS=1
[2018-01-10 15:21:40,972] {base_task_runner.py:98} INFO - Subtask: DEBIAN_FRONTEND=noninteractive
[2018-01-10 15:21:40,972] {base_task_runner.py:98} INFO - Subtask: DECK_PORT=tcp://10.63.247.198:80
[2018-01-10 15:21:40,972] {base_task_runner.py:98} INFO - Subtask: DECK_PORT_80_TCP=tcp://10.63.247.198:80
[2018-01-10 15:21:40,973] {base_task_runner.py:98} INFO - Subtask: DECK_PORT_80_TCP_ADDR=10.63.247.198
[2018-01-10 15:21:40,973] {base_task_runner.py:98} INFO - Subtask: DECK_PORT_80_TCP_PORT=80
[2018-01-10 15:21:40,973] {base_task_runner.py:98} INFO - Subtask: DECK_PORT_80_TCP_PROTO=tcp
[2018-01-10 15:21:40,974] {base_task_runner.py:98} INFO - Subtask: DECK_SERVICE_HOST=10.63.247.198
[2018-01-10 15:21:40,974] {base_task_runner.py:98} INFO - Subtask: DECK_SERVICE_PORT=80
[2018-01-10 15:21:40,974] {base_task_runner.py:98} INFO - Subtask: GOOGLE_APPLICATION_CREDENTIALS=/var/secrets/google/key.json
[2018-01-10 15:21:40,975] {base_task_runner.py:98} INFO - Subtask: GPG_KEY=C01E1CAD5EA2C4F0B8E3571504C367C218ADD4FF
[2018-01-10 15:21:40,975] {base_task_runner.py:98} INFO - Subtask: HOME=/usr/local/airflow
[2018-01-10 15:21:40,975] {base_task_runner.py:98} INFO - Subtask: HOOK_PORT=tcp://10.63.245.64:8888
[2018-01-10 15:21:40,976] {base_task_runner.py:98} INFO - Subtask: HOOK_PORT_8888_TCP=tcp://10.63.245.64:8888
[2018-01-10 15:21:40,976] {base_task_runner.py:98} INFO - Subtask: HOOK_PORT_8888_TCP_ADDR=10.63.245.64
[2018-01-10 15:21:40,976] {base_task_runner.py:98} INFO - Subtask: HOOK_PORT_8888_TCP_PORT=8888
[2018-01-10 15:21:40,977] {base_task_runner.py:98} INFO - Subtask: HOOK_PORT_8888_TCP_PROTO=tcp
[2018-01-10 15:21:40,977] {base_task_runner.py:98} INFO - Subtask: HOOK_SERVICE_HOST=10.63.245.64
[2018-01-10 15:21:40,978] {base_task_runner.py:98} INFO - Subtask: HOOK_SERVICE_PORT=8888
[2018-01-10 15:21:40,978] {base_task_runner.py:98} INFO - Subtask: HOSTNAME=airflow-2497854871-0dr9k
[2018-01-10 15:21:40,978] {base_task_runner.py:98} INFO - Subtask: KUBERNETES_PORT=tcp://10.63.240.1:443
[2018-01-10 15:21:40,979] {base_task_runner.py:98} INFO - Subtask: KUBERNETES_PORT_443_TCP=tcp://10.63.240.1:443
[2018-01-10 15:21:40,979] {base_task_runner.py:98} INFO - Subtask: KUBERNETES_PORT_443_TCP_ADDR=10.63.240.1
[2018-01-10 15:21:40,979] {base_task_runner.py:98} INFO - Subtask: KUBERNETES_PORT_443_TCP_PORT=443
[2018-01-10 15:21:40,980] {base_task_runner.py:98} INFO - Subtask: KUBERNETES_PORT_443_TCP_PROTO=tcp
[2018-01-10 15:21:40,980] {base_task_runner.py:98} INFO - Subtask: KUBERNETES_SERVICE_HOST=10.63.240.1
[2018-01-10 15:21:40,980] {base_task_runner.py:98} INFO - Subtask: KUBERNETES_SERVICE_PORT=443
[2018-01-10 15:21:40,981] {base_task_runner.py:98} INFO - Subtask: KUBERNETES_SERVICE_PORT_HTTPS=443
[2018-01-10 15:21:40,981] {base_task_runner.py:98} INFO - Subtask: LANG=en_US.UTF-8
[2018-01-10 15:21:40,981] {base_task_runner.py:98} INFO - Subtask: LANGUAGE=en_US.UTF-8
[2018-01-10 15:21:40,982] {base_task_runner.py:98} INFO - Subtask: LC_ALL=en_US.UTF-8
[2018-01-10 15:21:40,982] {base_task_runner.py:98} INFO - Subtask: LC_CTYPE=en_US.UTF-8
[2018-01-10 15:21:40,982] {base_task_runner.py:98} INFO - Subtask: LC_MESSAGES=en_US.UTF-8
[2018-01-10 15:21:40,983] {base_task_runner.py:98} INFO - Subtask: PATH=/google-cloud-sdk/bin:/workspace:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/go/bin
[2018-01-10 15:21:40,983] {base_task_runner.py:98} INFO - Subtask: POSTGRES_HOST=localhost
[2018-01-10 15:21:40,983] {base_task_runner.py:98} INFO - Subtask: PWD=/usr/local/airflow
[2018-01-10 15:21:40,984] {base_task_runner.py:98} INFO - Subtask: PYTHONPATH=:/var/lib/data/runs/tf_k8s_tests/2018-01-10T15_10_22/tensorflow_k8s
[2018-01-10 15:21:40,984] {base_task_runner.py:98} INFO - Subtask: PYTHON_PIP_VERSION=9.0.1
[2018-01-10 15:21:40,984] {base_task_runner.py:98} INFO - Subtask: PYTHON_VERSION=2.7.14
[2018-01-10 15:21:40,985] {base_task_runner.py:98} INFO - Subtask: SHLVL=2
[2018-01-10 15:21:40,985] {base_task_runner.py:98} INFO - Subtask: SRC_DIR=/var/lib/data/runs
[2018-01-10 15:21:40,985] {base_task_runner.py:98} INFO - Subtask: TERM=linux
[2018-01-10 15:21:40,986] {base_task_runner.py:98} INFO - Subtask: _=/usr/local/bin/airflow
[2018-01-10 15:21:47,357] {base_task_runner.py:98} INFO - Subtask: [2018-01-10 15:21:47,354] {python_operator.py:90} INFO - Done. Returned value was: None
``

@jlewi
Copy link
Contributor Author

jlewi commented Jan 11, 2018

Looks like a red herring, I think its just an issue of not grabbing logs from the subprocess.

@jlewi
Copy link
Contributor Author

jlewi commented Jan 11, 2018

So the junit test does indicate a test failure

https://storage.googleapis.com/kubernetes-jenkins/logs/tensorflow_k8s/tf-k8s-postsubmit/98/artifacts/junit_e2e.xml

<testsuite failures="1" tests="1" time="3.50871610641"><testcase classname="GKE" failure="helm test failed;&#10;RUNNING: tf-job-tfjob-test-7b7pu2&#10;FAILED: tf-job-tfjob-test-7b7pu2, run `kubectl logs tf-job-tfjob-test-7b7pu2 --namespace default` for more info&#10;Error: 1 test(s) failed&#10;" name="e2e-test" time="3.50871610641" /></testsuite>

But gubernator is reporting all tests as having passed
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/tensorflow_k8s/tf-k8s-postsubmit/98/?log#log

@jlewi
Copy link
Contributor Author

jlewi commented Jan 11, 2018

It looks like we might need to have a test case in the xml
https://storage.googleapis.com/kubernetes-jenkins/logs/spark-periodic-default-gke/64/artifacts/junit_runner.xml

<testsuite failures="1" tests="14" time="2788.034276">
<testcase classname="e2e.go" name="Extract" time="45.502834127"/>
<testcase classname="e2e.go" name="TearDown Previous" time="1.089621431"/>
<testcase classname="e2e.go" name="listResources Before" time="7.196779907"/>
<testcase classname="e2e.go" name="Up" time="308.262075084"/>
<testcase classname="e2e.go" name="Check APIReachability" time="1.1594926349999999"/>
<testcase classname="e2e.go" name="list nodes" time="0.242748208"/>
<testcase classname="e2e.go" name="listResources Up" time="8.15109631"/>
<testcase classname="e2e.go" name="spark-integration" time="1410.056494722"/>
<testcase classname="e2e.go" name="DumpClusterLogs" time="552.375162991"/>
<testcase classname="e2e.go" name="listResources Down" time="8.096108514"/>
<testcase classname="e2e.go" name="TearDown" time="404.551712359">
<failure>
error deleting firewall: error during gcloud compute firewall-rules delete -q e2e-ports-f7875a4a --project=k8s-jkns-gci-gke-alpha-1-4: exit status 1
</failure>
</testcase>
<testcase classname="e2e.go" name="listResources After" time="7.421148199"/>
<testcase classname="e2e.go" name="diffResources" time="0.001849105"/>
<testcase classname="e2e.go" name="Deferred TearDown" time="1.097809187"/>
<testcase classname="e2e.go" name="Timeout" time="7200"/>
</testsuite>

@gaocegege
Copy link
Member

gaocegege commented Jan 11, 2018

FYI, #278 's test still failed: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/tensorflow_k8s/278/tf-k8s-presubmit/340/artifacts/junit_e2e.xml

<testsuite failures="1" tests="1" time="3.81183314323"><testcase classname="GKE" failure="helm test failed;&#10;RUNNING: tf-job-tfjob-test-gtttu5&#10;FAILED: tf-job-tfjob-test-gtttu5, run `kubectl logs tf-job-tfjob-test-gtttu5 --namespace default` for more info&#10;Error: 1 test(s) failed&#10;" name="e2e-test" time="3.81183314323" /></testsuite>

@gaocegege
Copy link
Member

There are so many errors in the master now, could we file a release for the prior commit which could works?

@jlewi
Copy link
Contributor Author

jlewi commented Jan 11, 2018

@gaocegege The test isn't fixed yet.

That's a good idea, I think the last known good release is.

gs://tf-on-k8s-dogfood-releases/v20171223-37af20d/

This was referenced Jan 11, 2018
@jlewi
Copy link
Contributor Author

jlewi commented Jan 12, 2018

Prow still reports failure even if some of the junit files indicate fail. It looks like gubernator's status for the job is based on the exit code of the prow job. So we need to update the E2E tests to set exit code based on whether any tests failed.

@jlewi
Copy link
Contributor Author

jlewi commented Jan 15, 2018

Opened #315
Closing this in favor of #315

@jlewi jlewi closed this as completed Jan 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants