Modify presubmits to support testing with v1alpha2 #632

jlewi · 2018-06-11T21:29:20Z

Changes to support v1alpha2 testing in presubmits.

The tests are currently disabled because they aren't passing yet because
termination policy isn't handled correctly (TFJob not marked as success when master exits but not workers #634)
Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
opposed to using mnist.
mnist causing problems because of issues downloading the data
see Presubmit failures; Timeout waiting for TFJob v1alpha2 job kubeflow#974
We want a simpler test that allows for more direct testing of the distributed
communication pattern
Also mnist is expensive in that it tries to download data.
Add a parameter tfJobVersion to the deploy script so we can control
whether we deploy v1alpha1 or v1alpha2
Parameterize the E2E test workflow by the TFJob version we want to run.
update test-app - We need to pull in a version of the app which
has the TFJobVersion flag.
Create a script to regenerate the test-app for future use.

Related to #589

This change is

coveralls · 2018-06-11T21:47:13Z

Coverage decreased (-0.9%) to 55.067% when pulling 3463fcc on jlewi:fix_tfjob into e164ba5 on kubeflow:master.

jlewi · 2018-06-12T05:46:05Z

pylint failures should be fixed by kubeflow/testing#156

The other error looks like a problem with ks being too old in the test image after we upgraded the test app

This should be fixed by kubeflow/testing#155

jlewi · 2018-06-12T05:46:34Z

/test all

jlewi · 2018-06-12T06:20:23Z

/test all

* The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589

jlewi · 2018-06-12T21:14:51Z

Most recent failure:

 wait_for_deployment
   name, namespace))
kubeflow.testing.util.TimeoutError: Timeout waiting for deployment tf-job-operator in namespace kubeflow

jlewi · 2018-06-12T21:21:11Z

Looking at the event logs I see the error

{
 insertId:  "1t7pnmeg2uox2aj"  
 jsonPayload: {
  apiVersion:  "v1"   
  involvedObject: {…}   
  kind:  "Event"   
  message:  "Failed to apply default image tag "gcr.io/kubeflow-ci/tf_operator:": couldn't parse image reference "gcr.io/kubeflow-ci/tf_operator:": invalid reference format"   
  metadata: {…}   
  reason:  "InspectFailed"   
  source: {…}   
  type:  "Warning"   
 }
 logName:  "projects/kubeflow-ci/logs/events"  
 receiveTimestamp:  "2018-06-12T20:02:03.841513544Z"  
 resource: {…}  
 severity:  "WARNING"  
 timestamp:  "2018-06-12T20:01:59Z"  
}

jlewi · 2018-06-12T21:24:17Z

Exec into debug worker and check the ksonnet test app

ks param list --env=e2e-0612-1959-3cf6
COMPONENT PARAM                   VALUE
========= =====                   =====
core      cloud                   "null"
core      disks                   "null"
core      jupyterHubAuthenticator "null"
core      jupyterHubImage         "gcr.io/kubeflow/jupyterhub-k8s:v20180531-3bb991b1"
core      jupyterHubServiceType   "ClusterIP"
core      jupyterNotebookPVCMount "null"
core      jupyterNotebookRegistry "gcr.io"
core      jupyterNotebookRepoName "kubeflow-images-public"
core      name                    "kubeflow-core"
core      namespace               "kubeflow"
core      reportUsage             "false"
core      tfAmbassadorImage       "quay.io/datawire/ambassador:0.30.1"
core      tfAmbassadorServiceType "ClusterIP"
core      tfDefaultImage          "null"
core      tfJobImage              "gcr.io/kubeflow-ci/tf_operator:"
core      tfJobUiServiceType      "ClusterIP"
core      tfJobVersion            "v1alpha1"
core      tfStatsdImage           "quay.io/datawire/statsd:0.30.1"
core      usageId                 "unknown_cluster"

So looks like tfJobImage wasn't set correctly

Argo logs look like the image isn't set correctly

INFO:root:Running: ks param set --env=e2e-0612-1959-3cf6 core tfJobImage gcr.io/kubeflow-ci/tf_operator:
cwd=/mnt/test-data-volume/kubeflow-tf-operator-presubmit-tfjob-e2e-632-54931b6-595-0276/src/kubeflow/tf-operator/test/test-app

jlewi · 2018-06-12T21:31:25Z

My suspicion is that when we pushed a new testing worker image; we pushed some updates to the run_e2e_workflow.py and that broke things.

jlewi · 2018-06-12T21:37:22Z

If params.versionTag in the workflow isn't set we should use the name
https://github.com/kubeflow/tf-operator/blob/e164ba54c1ffdd750f059711beb65c9a5936c684/test/workflows/components/workflows.libsonnet#L62

I changed versionTag from null to "". That's the problem. I made this change because ks 0.11 was having some problems with null.

empty string.

jlewi · 2018-06-12T23:09:46Z

Tests are passing; this is ready for review.

/assign @ankushagarwal
/assign @gaocegege

ankushagarwal · 2018-06-12T23:26:08Z

/lgtm
/approve

gaocegege · 2018-06-13T02:35:14Z

/approve

k8s-ci-robot · 2018-06-13T02:35:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ankushagarwal, gaocegege

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [gaocegege]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* Changes to support v1alpha2 testing in presubmits. * The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589 * Fix versionTag logic; we need to allow for case where versionTag is an empty string.

k8s-ci-robot requested review from ankushagarwal and willb June 11, 2018 21:29

k8s-ci-robot added the size/XS label Jun 11, 2018

k8s-ci-robot added size/M and removed size/XS labels Jun 11, 2018

jlewi changed the title ~~Presumbit v1alpha2 test should use simple tf job not mnist.~~ [wip] Presumbit v1alpha2 test should use simple tf job not mnist. Jun 11, 2018

k8s-ci-robot added do-not-merge/work-in-progress size/L size/XXL and removed size/M size/L labels Jun 11, 2018

jlewi force-pushed the fix_tfjob branch from bb6d2b5 to 48fcb88 Compare June 12, 2018 04:08

jlewi changed the title ~~[wip] Presumbit v1alpha2 test should use simple tf job not mnist.~~ [wip] Modify presubmits to support testing with v1alpha2 Jun 12, 2018

jlewi mentioned this pull request Jun 12, 2018

Upgrade to ks 0.11 in our tests. kubeflow/testing#155

Merged

jlewi force-pushed the fix_tfjob branch from 782e3a3 to 54931b6 Compare June 12, 2018 19:47

jlewi mentioned this pull request Jun 12, 2018

v1alpha2 E2E tests for termination policy #646

Merged

Fix versionTag logic; we need to allow for case where versionTag is an

3463fcc

empty string.

jlewi changed the title ~~[wip] Modify presubmits to support testing with v1alpha2~~ Modify presubmits to support testing with v1alpha2 Jun 12, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress label Jun 12, 2018

k8s-ci-robot assigned ankushagarwal and gaocegege Jun 12, 2018

k8s-ci-robot added the lgtm label Jun 12, 2018

k8s-ci-robot added the approved label Jun 13, 2018

k8s-ci-robot merged commit f66047b into kubeflow:master Jun 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify presubmits to support testing with v1alpha2 #632

Modify presubmits to support testing with v1alpha2 #632

jlewi commented Jun 11, 2018 •

edited

Loading

coveralls commented Jun 11, 2018 •

edited

Loading

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

ankushagarwal commented Jun 12, 2018

gaocegege commented Jun 13, 2018

k8s-ci-robot commented Jun 13, 2018

Modify presubmits to support testing with v1alpha2 #632

Modify presubmits to support testing with v1alpha2 #632

Conversation

jlewi commented Jun 11, 2018 • edited Loading

coveralls commented Jun 11, 2018 • edited Loading

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

ankushagarwal commented Jun 12, 2018

gaocegege commented Jun 13, 2018

k8s-ci-robot commented Jun 13, 2018

jlewi commented Jun 11, 2018 •

edited

Loading

coveralls commented Jun 11, 2018 •

edited

Loading