Improve the test run time on PRs #3385

girishramnani · 2020-06-18T13:33:52Z

we are considering running all tests on later stages of the PRs
we are considering running the travis CI and latest version openshift tests in the PR and rest in the periodic job
~~we are considering running the e2e tests ( not integration tests ) in the periodic jobs and omit them from the PR tests altogether~~ - this was done in Run periodic job on OpenShift CI #2931

girishramnani · 2020-06-18T13:34:33Z

metacosm · 2020-07-01T12:19:55Z

This should be worked on as quickly as possible, in my opinion as it would help speed up the rest of the development.

amitkrout · 2020-07-05T10:59:14Z

we are considering running all tests on later stages of the PRs

@metacosm This deserves a long description, so that we can analyze the requirement and ofcourse we need to check it (if complicated) with test platform team as our pr run is controlled by the prow bot.

we are considering running the travis CI and latest version openshift tests in the PR and rest in the periodic job

Just want to understand, what is the motive here ? Test time optimization ? it won't help much because CI takes almost the same time for spinning up and running test against each version of cluster. If we are looking for AWS resource optimization then yes, number of AWS resource will be reduced.

we are considering running the e2e tests ( not integration tests ) in the periodic jobs and omit them from the PR tests altogether

+1, but we will keep both test type until we make sure that our test does not hit flakes.

metacosm · 2020-07-06T08:24:33Z

The motivation is to move faster PRs faster from submitted to merged as well as reduce resources usage.

Right now, lots of time/resources is spent on running tests that are not needed: integration tests should only be run when a PR is approved and considered ready to be merged by reviewers. There is no point in running all the tests while the review process is not finished and only unit tests / smoke tests should be run during that period.

This would have the following benefits:

reduce resource usage by not running tests that are not needed
free resources for PRs that require these tests to be run (i.e. the ones that we attempt to merge)
reduce the noise in the PR review loop by removing test output that no one cares about
hopefully keep people engaged in PRs instead of letting them unattended for too long because some tests are failing

Another aspect of this is obviously the unbelievably high occurrence of flakes, which should also be addressed. If a test cannot reliably run then it is useless because if it fails, you don't know if it's positive signal of an issue with the code and if it passes, you're not sure it somehow worked when maybe it shouldn't have…

amitkrout · 2020-07-09T10:54:16Z

@metacosm So basically we are looking for flow like

I need to check with platform team if this is possible with pre submit test type that we are using, otherwise we explore other possibilities.

mohammedzee1000 · 2020-07-09T11:52:57Z

It is highly unlikely that such a workflow will work. Better to confirm with test platform team
/cc @openshift/openshift-team-developer-productivity-test-platform

Assuming it does not, we could also look to split minimum tests (ut + int) into presubmit and have e2e or maybe even all run as post submit. Doing so might also allow us to reliably increase the delay in periodic tests as well

Also how about removing all but the latest and latest-1 version of cluster tests from pr tests and have the other versions in periodic

alvaroaleman · 2020-07-09T13:01:00Z

No it is not possible to have tests that only get triggered after approval and are a merge requirement. I personally also don't think it makes sense, because you are wasting your reviewers time by having them review something for which CI can tell you that its not going to work.

mohammedzee1000 · 2020-07-09T13:15:28Z

Well I guess we are going with what I recommended then, splitting the tests across presubmit, postsubmit and periodic.

wdyt @alvaroaleman

alvaroaleman · 2020-07-09T14:03:40Z

Yeah, sounds like a good idea, you just need to make sure that someone looks at them and acts if there are failures, otherwise its pretty much useless

metacosm · 2020-07-09T18:27:06Z

@alvaroaleman a review should not be contingent on test results so I don't see that as a waste of the reviewer's time. The converse is also true: it doesn't make sense to test something that will be changed because of reviews that's why, imo, only unit tests should always be run before a PR is approved.

alvaroaleman · 2020-07-09T18:33:49Z

Tests have little associated costs, humans have very high associated cost so running a couple more tests if that saves human time is always worth it. Happy for you if you have the capacity to do so many reviews, I rarely look at PRs that have failing tests.

It would be conceptionally very hard to add something like this to Prow and it opens up a nice set of failure modes (how to handle test failures on approved PRs? We currently assume they are flakes and will retest forever).

metacosm · 2020-07-09T19:20:34Z

While I agree with you in principle, that's just not true for odo. The integration tests take several hours to complete and are highly unreliable (lots of flakes) so if you're waiting for tests to pass to perform a review, you will never review anything… :)

I'm not saying that this should be added to Prow, just saying that the odo tests should be set up in such a way that only unit tests (which should cover enough to get a good idea of the soundness of the PR) are run until the PR is approved. If integration tests fail, then the onus is on the author to get them fixed, possibly asking for another review afterwards if needed.

Finally, failing tests pollute the review with lots of useless messages making the review process harder than it needs to be.

mohammedzee1000 · 2020-07-10T07:18:37Z

Nice little discussion there @metacosm @alvaroaleman. I will need to think about this before putting my points :)

However, there is another way we of splitting testing that could improve testing time while avoiding invisibility to an extent.

First, we should probably remove all but latest and latest-1 openshift clusters from pre submit with complete testing relegated to post submit and periodic.

One we have done that we will free up a bunch of clusters we use which could then be used to split tests.

For eg we can do int-1 and int-2 which will run half and half of the integration tests for both versions. So we can end up with 1 test that does unit and then 2 for integration and 1 for e2e per cluster. We can split more of course depending on size of tests and resource availability. Wdyt?

I think the cluster provision time is a constant that will happen once across the board.

That aside the fact does remain that the cleanest way to increase test speed here is to - eliminate provision unless you are testing something that needs it, but that will likely need cluster pooling of some sort which comes with its own associated costs.

The fact remains that the biggest time consuming part is the cluster provision witch takes atleast 40-45 mins on good days.

amitkrout · 2020-07-13T09:41:27Z

However, there is another way we of splitting testing that could improve testing time while avoiding invisibility to an extent.
First, we should probably remove all but latest and latest-1 openshift clusters from pre submit with complete testing relegated to post submit and periodic.

+1, this same ideas has been discussed in one of the weekly standup call for reducing the number of cluster and will use the same test type as it is.

@mohammedzee1000 But for pr testing post submit job is not the best fit IMO because it will unnecessarily increase the effort. For example all the time there will be a fear of broken master code and it will increase the consistent monitoring cost on master after each pr we merge through periodic job.

One we have done that we will free up a bunch of clusters we use which could then be used to split tests.
For eg we can do int-1 and int-2 which will run half and half of the integration tests for both versions. So we can end up with 1 test that does unit and then 2 for integration and 1 for e2e per cluster. We can split more of course depending on size of tests and resource availability. Wdyt?

yes, we can go with this. It will reduce the test time upto 50%.

That aside the fact does remain that the cleanest way to increase test speed here is to - eliminate provision unless you are testing something that needs it, but that will likely need cluster pooling of some sort which comes with its own associated costs.

It's a long term goal, may be we need to create an separate jira to track this.

amitkrout · 2020-07-13T09:55:34Z

I am creating the task list

Use only two cluster version (latest release and upcoming release) for pr validation
Configure the rest supported cluster version for periodic run
Split the test for pr validation. Run half of test on one cluster version and other half on other cluster.

kadel · 2020-07-14T08:17:06Z

Use only two cluster version (latest release and upcoming release) for pr validation

+1 I would even consider just using the latest release

Split the test for pr validation. Run half of test on one cluster version and other half on other cluster.

-1 this makes me a little bit worried I would rather run a full test suite on each cluster.

There should be one additional point:

Clean up integrations tests - remove duplicated tests, optimize tests, minimalize external dependencies

metacosm · 2020-07-15T07:49:58Z

I would also add: investigate and fix or remove unstable tests (flakes) :)

mohammedzee1000 · 2020-07-15T14:01:50Z

Use only two cluster version (latest release and upcoming release) for pr validation

+1 I would even consider just using the latest release

Split the test for pr validation. Run half of test on one cluster version and other half on other cluster.

@amitkrout @kadel I did not mean that. I meant doing all the tests on every version that we test but splitting them as instances like test1-4.5 runs 1 half and test2-4.5 runs second half for example

-1 this makes me a little bit worried I would rather run a full test suite on each cluster.

There should be one additional point:

Clean up integrations tests - remove duplicated tests, optimize tests, minimalize external dependencies

amitkrout · 2020-07-16T09:39:50Z

test1-4.5 runs 1 half and test2-4.5 runs second half for example

@mohammedzee1000 do you mean we need two 4.5 cluster to in CI, right ?

mohammedzee1000 · 2020-07-16T09:42:47Z

Yes one running one half and other running second half.

Don't worry we will get some clusters back by not testing a version or 2.

While this may not balance things out, it will still be an overall win

We could do for eg 4.5 and 4.4 or even just 4.5 in regular pr ci

But split tests for what we test, thereby achieving a semblance of parallelism. Remember cluster provision will be constant time

metacosm · 2020-07-16T10:14:40Z

I still think that integration tests should only be run after the PR is approved by default, though it should be possibly to manually trigger such a run if needed on a case by case basis.

amitkrout · 2020-07-24T05:48:51Z

I don't know the test infrastructure at all so maybe what I'm suggesting is indeed not feasible if, for example, the bot that handles lgtm/approve label is tied to the testing infrastructure. Otherwise, another option would be to have a bot specifically used to trigger tests based on different conditions. I have no idea how complex that would be to put in place, though…

As far as I know prow doesn't support that. We would have to write our own plugin to achieve that.

Writing a custom Prow plugin should be pretty simple, you can listen to the events on PRs you care about and comment with /test to trigger things. The PRs will sit around, requiring tests that have not yet run, until that happens, though.

I am lost on what was the ask and what are we heading to

@stevekuznetsov The initial ask was we don't want to run our CI after the pr is triggered but CI can run only if /approve label is applied. I understand that through plugin you can control the job run but does it help us to hold the CI run when the the pr is triggered for first time ? Does it has side effect like after each commit i need to apply the label /test <name> to trigger the job.

prietyc123 · 2020-08-04T08:47:24Z

I don't know the test infrastructure at all so maybe what I'm suggesting is indeed not feasible if, for example, the bot that handles lgtm/approve label is tied to the testing infrastructure. Otherwise, another option would be to have a bot specifically used to trigger tests based on different conditions. I have no idea how complex that would be to put in place, though…

As far as I know prow doesn't support that. We would have to write our own plugin to achieve that.

Writing a custom Prow plugin should be pretty simple, you can listen to the events on PRs you care about and comment with /test to trigger things. The PRs will sit around, requiring tests that have not yet run, until that happens, though.

I am lost on what was the ask and what are we heading to

@stevekuznetsov The initial ask was we don't want to run our CI after the pr is triggered but CI can run only if /approve label is applied. I understand that through plugin you can control the job run but does it help us to hold the CI run when the the pr is triggered for first time ? Does it has side effect like after each commit i need to apply the label /test <name> to trigger the job.

@stevekuznetsov @amitkrout @mohammedzee1000 Can you please share the Prow plugin documentation.

kadel · 2020-08-04T09:34:50Z

we are considering running the travis CI and latest version openshift tests in the PR and rest in the periodic job
we are considering running the e2e tests ( not integration tests ) in the periodic jobs and omit them from the PR tests altogether

This was done in #2931 👍

we are considering running all tests on later stages of the PRs

To be honest I don't think that this part is that important right now.
Our biggest problem is test stability.
We need to invest some time in fixing flaky tests!
I have a feeling that we are just figuring out new ways how to avoid properly fixing our tests.

/priority low
/remove-priority high

prietyc123 · 2020-08-04T16:15:12Z

we are considering running the travis CI and latest version openshift tests in the PR and rest in the periodic job - done in #2931

Executing only the latest version of openshift CI in the PR and rest in periodic jobs are done with this issue itself PR - openshift/release#10312 and openshift/release#10346

prietyc123 · 2020-08-22T13:37:14Z

Test clean up still in progress and working on it individually. Tests covered till now

- create test: test-cmd-devfile-create #3618
- debug test: #3636
- catalog test: #3642
- push test: #3645

openshift-bot · 2020-11-20T14:50:42Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-12-20T16:46:58Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-01-19T18:35:59Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2021-01-19T18:36:21Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

girishramnani added the kind/test label Jun 18, 2020

kadel added area/testing Issues or PRs related to testing, Quality Assurance or Quality Engineering and removed kind/test labels Jul 10, 2020

amitkrout assigned prietyc123 Jul 13, 2020

kadel added the priority/High Important issue; should be worked on before any other issues (except priority/Critical issue(s)). label Jul 14, 2020

prietyc123 mentioned this issue Jul 23, 2020

Clean up and customisation for delete tests #3625

Merged

3 tasks

amitkrout closed this as completed Jul 24, 2020

amitkrout reopened this Jul 24, 2020

This was referenced Jul 24, 2020

Clean up and customise debug test #3636

Merged

Removing duplicates from devfile catalog test #3642

Merged

Test script customisation and removing duplicates from push test #3645

Merged

openshift-ci-robot added priority/Low Nice to have issue. It's not immediately on the project roadmap to get it done. and removed priority/High Important issue; should be worked on before any other issues (except priority/Critical issue(s)). labels Aug 4, 2020

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 20, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 20, 2020

openshift-ci-robot closed this as completed Jan 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the test run time on PRs #3385

Improve the test run time on PRs #3385

girishramnani commented Jun 18, 2020 •

edited by prietyc123

Loading

girishramnani commented Jun 18, 2020

metacosm commented Jul 1, 2020

amitkrout commented Jul 5, 2020

metacosm commented Jul 6, 2020

amitkrout commented Jul 9, 2020

mohammedzee1000 commented Jul 9, 2020 •

edited

Loading

alvaroaleman commented Jul 9, 2020

mohammedzee1000 commented Jul 9, 2020

alvaroaleman commented Jul 9, 2020

metacosm commented Jul 9, 2020

alvaroaleman commented Jul 9, 2020

metacosm commented Jul 9, 2020

mohammedzee1000 commented Jul 10, 2020 •

edited

Loading

amitkrout commented Jul 13, 2020 •

edited

Loading

amitkrout commented Jul 13, 2020

kadel commented Jul 14, 2020

metacosm commented Jul 15, 2020

mohammedzee1000 commented Jul 15, 2020 •

edited

Loading

amitkrout commented Jul 16, 2020

mohammedzee1000 commented Jul 16, 2020 •

edited

Loading

metacosm commented Jul 16, 2020

amitkrout commented Jul 24, 2020 •

edited

Loading

prietyc123 commented Aug 4, 2020

kadel commented Aug 4, 2020

prietyc123 commented Aug 4, 2020

prietyc123 commented Aug 22, 2020

openshift-bot commented Nov 20, 2020

openshift-bot commented Dec 20, 2020

openshift-bot commented Jan 19, 2021

openshift-ci-robot commented Jan 19, 2021

Improve the test run time on PRs #3385

Improve the test run time on PRs #3385

Comments

girishramnani commented Jun 18, 2020 • edited by prietyc123 Loading

girishramnani commented Jun 18, 2020

metacosm commented Jul 1, 2020

amitkrout commented Jul 5, 2020

metacosm commented Jul 6, 2020

amitkrout commented Jul 9, 2020

mohammedzee1000 commented Jul 9, 2020 • edited Loading

alvaroaleman commented Jul 9, 2020

mohammedzee1000 commented Jul 9, 2020

alvaroaleman commented Jul 9, 2020

metacosm commented Jul 9, 2020

alvaroaleman commented Jul 9, 2020

metacosm commented Jul 9, 2020

mohammedzee1000 commented Jul 10, 2020 • edited Loading

amitkrout commented Jul 13, 2020 • edited Loading

amitkrout commented Jul 13, 2020

kadel commented Jul 14, 2020

metacosm commented Jul 15, 2020

mohammedzee1000 commented Jul 15, 2020 • edited Loading

amitkrout commented Jul 16, 2020

mohammedzee1000 commented Jul 16, 2020 • edited Loading

metacosm commented Jul 16, 2020

amitkrout commented Jul 24, 2020 • edited Loading

prietyc123 commented Aug 4, 2020

kadel commented Aug 4, 2020

prietyc123 commented Aug 4, 2020

prietyc123 commented Aug 22, 2020

openshift-bot commented Nov 20, 2020

openshift-bot commented Dec 20, 2020

openshift-bot commented Jan 19, 2021

openshift-ci-robot commented Jan 19, 2021

girishramnani commented Jun 18, 2020 •

edited by prietyc123

Loading

mohammedzee1000 commented Jul 9, 2020 •

edited

Loading

mohammedzee1000 commented Jul 10, 2020 •

edited

Loading

amitkrout commented Jul 13, 2020 •

edited

Loading

mohammedzee1000 commented Jul 15, 2020 •

edited

Loading

mohammedzee1000 commented Jul 16, 2020 •

edited

Loading

amitkrout commented Jul 24, 2020 •

edited

Loading