Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify presubmits to support testing with v1alpha2 #632

Merged
merged 2 commits into from
Jun 13, 2018

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Jun 11, 2018

Changes to support v1alpha2 testing in presubmits.

  • The tests are currently disabled because they aren't passing yet because
    termination policy isn't handled correctly (TFJob not marked as success when master exits but not workers #634)

  • Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
    opposed to using mnist.
    mnist causing problems because of issues downloading the data
    see Presubmit failures; Timeout waiting for TFJob v1alpha2 job kubeflow#974

  • We want a simpler test that allows for more direct testing of the distributed
    communication pattern

  • Also mnist is expensive in that it tries to download data.

  • Add a parameter tfJobVersion to the deploy script so we can control
    whether we deploy v1alpha1 or v1alpha2

  • Parameterize the E2E test workflow by the TFJob version we want to run.

  • update test-app - We need to pull in a version of the app which
    has the TFJobVersion flag.

  • Create a script to regenerate the test-app for future use.

Related to #589


This change is Reviewable

@coveralls
Copy link

coveralls commented Jun 11, 2018

Coverage Status

Coverage decreased (-0.9%) to 55.067% when pulling 3463fcc on jlewi:fix_tfjob into e164ba5 on kubeflow:master.

@jlewi jlewi changed the title Presumbit v1alpha2 test should use simple tf job not mnist. [wip] Presumbit v1alpha2 test should use simple tf job not mnist. Jun 11, 2018
@jlewi jlewi changed the title [wip] Presumbit v1alpha2 test should use simple tf job not mnist. [wip] Modify presubmits to support testing with v1alpha2 Jun 12, 2018
@jlewi
Copy link
Contributor Author

jlewi commented Jun 12, 2018

pylint failures should be fixed by kubeflow/testing#156

The other error looks like a problem with ks being too old in the test image after we upgraded the test app

This should be fixed by kubeflow/testing#155

@jlewi
Copy link
Contributor Author

jlewi commented Jun 12, 2018

/test all

1 similar comment
@jlewi
Copy link
Contributor Author

jlewi commented Jun 12, 2018

/test all

* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (kubeflow#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to kubeflow#589
@jlewi
Copy link
Contributor Author

jlewi commented Jun 12, 2018

Most recent failure:

 wait_for_deployment
   name, namespace))
kubeflow.testing.util.TimeoutError: Timeout waiting for deployment tf-job-operator in namespace kubeflow

@jlewi
Copy link
Contributor Author

jlewi commented Jun 12, 2018

Looking at the event logs I see the error

{
 insertId:  "1t7pnmeg2uox2aj"  
 jsonPayload: {
  apiVersion:  "v1"   
  involvedObject: {…}   
  kind:  "Event"   
  message:  "Failed to apply default image tag "gcr.io/kubeflow-ci/tf_operator:": couldn't parse image reference "gcr.io/kubeflow-ci/tf_operator:": invalid reference format"   
  metadata: {…}   
  reason:  "InspectFailed"   
  source: {…}   
  type:  "Warning"   
 }
 logName:  "projects/kubeflow-ci/logs/events"  
 receiveTimestamp:  "2018-06-12T20:02:03.841513544Z"  
 resource: {…}  
 severity:  "WARNING"  
 timestamp:  "2018-06-12T20:01:59Z"  
}

@jlewi
Copy link
Contributor Author

jlewi commented Jun 12, 2018

Exec into debug worker and check the ksonnet test app

ks param list --env=e2e-0612-1959-3cf6
COMPONENT PARAM                   VALUE
========= =====                   =====
core      cloud                   "null"
core      disks                   "null"
core      jupyterHubAuthenticator "null"
core      jupyterHubImage         "gcr.io/kubeflow/jupyterhub-k8s:v20180531-3bb991b1"
core      jupyterHubServiceType   "ClusterIP"
core      jupyterNotebookPVCMount "null"
core      jupyterNotebookRegistry "gcr.io"
core      jupyterNotebookRepoName "kubeflow-images-public"
core      name                    "kubeflow-core"
core      namespace               "kubeflow"
core      reportUsage             "false"
core      tfAmbassadorImage       "quay.io/datawire/ambassador:0.30.1"
core      tfAmbassadorServiceType "ClusterIP"
core      tfDefaultImage          "null"
core      tfJobImage              "gcr.io/kubeflow-ci/tf_operator:"
core      tfJobUiServiceType      "ClusterIP"
core      tfJobVersion            "v1alpha1"
core      tfStatsdImage           "quay.io/datawire/statsd:0.30.1"
core      usageId                 "unknown_cluster"

So looks like tfJobImage wasn't set correctly

Argo logs look like the image isn't set correctly

INFO:root:Running: ks param set --env=e2e-0612-1959-3cf6 core tfJobImage gcr.io/kubeflow-ci/tf_operator:
cwd=/mnt/test-data-volume/kubeflow-tf-operator-presubmit-tfjob-e2e-632-54931b6-595-0276/src/kubeflow/tf-operator/test/test-app

@jlewi
Copy link
Contributor Author

jlewi commented Jun 12, 2018

My suspicion is that when we pushed a new testing worker image; we pushed some updates to the run_e2e_workflow.py and that broke things.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 12, 2018

If params.versionTag in the workflow isn't set we should use the name
https://github.com/kubeflow/tf-operator/blob/e164ba54c1ffdd750f059711beb65c9a5936c684/test/workflows/components/workflows.libsonnet#L62

I changed versionTag from null to "". That's the problem. I made this change because ks 0.11 was having some problems with null.

@jlewi jlewi changed the title [wip] Modify presubmits to support testing with v1alpha2 Modify presubmits to support testing with v1alpha2 Jun 12, 2018
@jlewi
Copy link
Contributor Author

jlewi commented Jun 12, 2018

Tests are passing; this is ready for review.

/assign @ankushagarwal
/assign @gaocegege

@ankushagarwal
Copy link

/lgtm
/approve

@gaocegege
Copy link
Member

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ankushagarwal, gaocegege

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit f66047b into kubeflow:master Jun 13, 2018
yph152 pushed a commit to yph152/tf-operator that referenced this pull request Jun 18, 2018
* Changes to support v1alpha2 testing in presubmits.

* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (kubeflow#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to kubeflow#589

* Fix versionTag logic; we need to allow for case where versionTag is an
empty string.
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this pull request Jul 9, 2018
* Changes to support v1alpha2 testing in presubmits.

* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (kubeflow#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to kubeflow#589

* Fix versionTag logic; we need to allow for case where versionTag is an
empty string.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants