Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the test run time on PRs #3385

Closed
girishramnani opened this issue Jun 18, 2020 · 39 comments
Closed

Improve the test run time on PRs #3385

girishramnani opened this issue Jun 18, 2020 · 39 comments
Assignees
Labels
area/testing Issues or PRs related to testing, Quality Assurance or Quality Engineering lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/Low Nice to have issue. It's not immediately on the project roadmap to get it done.

Comments

@girishramnani
Copy link
Contributor

girishramnani commented Jun 18, 2020

  • we are considering running all tests on later stages of the PRs
  • we are considering running the travis CI and latest version openshift tests in the PR and rest in the periodic job
  • we are considering running the e2e tests ( not integration tests ) in the periodic jobs and omit them from the PR tests altogether - this was done in Run periodic job on OpenShift CI #2931
@girishramnani
Copy link
Contributor Author

cc @metacosm

@metacosm
Copy link
Contributor

metacosm commented Jul 1, 2020

This should be worked on as quickly as possible, in my opinion as it would help speed up the rest of the development.

@amitkrout
Copy link
Contributor

we are considering running all tests on later stages of the PRs

@metacosm This deserves a long description, so that we can analyze the requirement and ofcourse we need to check it (if complicated) with test platform team as our pr run is controlled by the prow bot.

we are considering running the travis CI and latest version openshift tests in the PR and rest in the periodic job

Just want to understand, what is the motive here ? Test time optimization ? it won't help much because CI takes almost the same time for spinning up and running test against each version of cluster. If we are looking for AWS resource optimization then yes, number of AWS resource will be reduced.

we are considering running the e2e tests ( not integration tests ) in the periodic jobs and omit them from the PR tests altogether

+1, but we will keep both test type until we make sure that our test does not hit flakes.

@metacosm
Copy link
Contributor

metacosm commented Jul 6, 2020

The motivation is to move faster PRs faster from submitted to merged as well as reduce resources usage.

Right now, lots of time/resources is spent on running tests that are not needed: integration tests should only be run when a PR is approved and considered ready to be merged by reviewers. There is no point in running all the tests while the review process is not finished and only unit tests / smoke tests should be run during that period.

This would have the following benefits:

  • reduce resource usage by not running tests that are not needed
  • free resources for PRs that require these tests to be run (i.e. the ones that we attempt to merge)
  • reduce the noise in the PR review loop by removing test output that no one cares about
  • hopefully keep people engaged in PRs instead of letting them unattended for too long because some tests are failing

Another aspect of this is obviously the unbelievably high occurrence of flakes, which should also be addressed. If a test cannot reliably run then it is useless because if it fails, you don't know if it's positive signal of an issue with the code and if it passes, you're not sure it somehow worked when maybe it shouldn't have…

@amitkrout
Copy link
Contributor

@metacosm So basically we are looking for flow like

Screen Shot 2020-07-09 at 4 20 49 PM

I need to check with platform team if this is possible with pre submit test type that we are using, otherwise we explore other possibilities.

@mohammedzee1000
Copy link
Contributor

mohammedzee1000 commented Jul 9, 2020

It is highly unlikely that such a workflow will work. Better to confirm with test platform team
/cc @openshift/openshift-team-developer-productivity-test-platform

Assuming it does not, we could also look to split minimum tests (ut + int) into presubmit and have e2e or maybe even all run as post submit. Doing so might also allow us to reliably increase the delay in periodic tests as well

Also how about removing all but the latest and latest-1 version of cluster tests from pr tests and have the other versions in periodic

@alvaroaleman
Copy link

No it is not possible to have tests that only get triggered after approval and are a merge requirement. I personally also don't think it makes sense, because you are wasting your reviewers time by having them review something for which CI can tell you that its not going to work.

@mohammedzee1000
Copy link
Contributor

Well I guess we are going with what I recommended then, splitting the tests across presubmit, postsubmit and periodic.

wdyt @alvaroaleman

@alvaroaleman
Copy link

Yeah, sounds like a good idea, you just need to make sure that someone looks at them and acts if there are failures, otherwise its pretty much useless

@metacosm
Copy link
Contributor

metacosm commented Jul 9, 2020

@alvaroaleman a review should not be contingent on test results so I don't see that as a waste of the reviewer's time. The converse is also true: it doesn't make sense to test something that will be changed because of reviews that's why, imo, only unit tests should always be run before a PR is approved.

@alvaroaleman
Copy link

Tests have little associated costs, humans have very high associated cost so running a couple more tests if that saves human time is always worth it. Happy for you if you have the capacity to do so many reviews, I rarely look at PRs that have failing tests.

It would be conceptionally very hard to add something like this to Prow and it opens up a nice set of failure modes (how to handle test failures on approved PRs? We currently assume they are flakes and will retest forever).

@metacosm
Copy link
Contributor

metacosm commented Jul 9, 2020

While I agree with you in principle, that's just not true for odo. The integration tests take several hours to complete and are highly unreliable (lots of flakes) so if you're waiting for tests to pass to perform a review, you will never review anything… :)

I'm not saying that this should be added to Prow, just saying that the odo tests should be set up in such a way that only unit tests (which should cover enough to get a good idea of the soundness of the PR) are run until the PR is approved. If integration tests fail, then the onus is on the author to get them fixed, possibly asking for another review afterwards if needed.

Finally, failing tests pollute the review with lots of useless messages making the review process harder than it needs to be.

@mohammedzee1000
Copy link
Contributor

mohammedzee1000 commented Jul 10, 2020

Nice little discussion there @metacosm @alvaroaleman. I will need to think about this before putting my points :)

However, there is another way we of splitting testing that could improve testing time while avoiding invisibility to an extent.

First, we should probably remove all but latest and latest-1 openshift clusters from pre submit with complete testing relegated to post submit and periodic.

One we have done that we will free up a bunch of clusters we use which could then be used to split tests.

For eg we can do int-1 and int-2 which will run half and half of the integration tests for both versions. So we can end up with 1 test that does unit and then 2 for integration and 1 for e2e per cluster. We can split more of course depending on size of tests and resource availability. Wdyt?

I think the cluster provision time is a constant that will happen once across the board.

That aside the fact does remain that the cleanest way to increase test speed here is to - eliminate provision unless you are testing something that needs it, but that will likely need cluster pooling of some sort which comes with its own associated costs.

The fact remains that the biggest time consuming part is the cluster provision witch takes atleast 40-45 mins on good days.

@kadel kadel added area/testing Issues or PRs related to testing, Quality Assurance or Quality Engineering and removed kind/test labels Jul 10, 2020
@amitkrout
Copy link
Contributor

amitkrout commented Jul 13, 2020

However, there is another way we of splitting testing that could improve testing time while avoiding invisibility to an extent.
First, we should probably remove all but latest and latest-1 openshift clusters from pre submit with complete testing relegated to post submit and periodic.

+1, this same ideas has been discussed in one of the weekly standup call for reducing the number of cluster and will use the same test type as it is.

@mohammedzee1000 But for pr testing post submit job is not the best fit IMO because it will unnecessarily increase the effort. For example all the time there will be a fear of broken master code and it will increase the consistent monitoring cost on master after each pr we merge through periodic job.

One we have done that we will free up a bunch of clusters we use which could then be used to split tests.
For eg we can do int-1 and int-2 which will run half and half of the integration tests for both versions. So we can end up with 1 test that does unit and then 2 for integration and 1 for e2e per cluster. We can split more of course depending on size of tests and resource availability. Wdyt?

yes, we can go with this. It will reduce the test time upto 50%.

That aside the fact does remain that the cleanest way to increase test speed here is to - eliminate provision unless you are testing something that needs it, but that will likely need cluster pooling of some sort which comes with its own associated costs.

It's a long term goal, may be we need to create an separate jira to track this.

@amitkrout
Copy link
Contributor

I am creating the task list

  • Use only two cluster version (latest release and upcoming release) for pr validation
  • Configure the rest supported cluster version for periodic run
  • Split the test for pr validation. Run half of test on one cluster version and other half on other cluster.

@kadel
Copy link
Member

kadel commented Jul 14, 2020

  • Use only two cluster version (latest release and upcoming release) for pr validation

+1 I would even consider just using the latest release

  • Split the test for pr validation. Run half of test on one cluster version and other half on other cluster.

-1 this makes me a little bit worried I would rather run a full test suite on each cluster.

There should be one additional point:

  • Clean up integrations tests - remove duplicated tests, optimize tests, minimalize external dependencies

@kadel kadel added the priority/High Important issue; should be worked on before any other issues (except priority/Critical issue(s)). label Jul 14, 2020
@metacosm
Copy link
Contributor

I would also add: investigate and fix or remove unstable tests (flakes) :)

@mohammedzee1000
Copy link
Contributor

mohammedzee1000 commented Jul 15, 2020

  • Use only two cluster version (latest release and upcoming release) for pr validation

+1 I would even consider just using the latest release

  • Split the test for pr validation. Run half of test on one cluster version and other half on other cluster.

@amitkrout @kadel I did not mean that. I meant doing all the tests on every version that we test but splitting them as instances like test1-4.5 runs 1 half and test2-4.5 runs second half for example

-1 this makes me a little bit worried I would rather run a full test suite on each cluster.

There should be one additional point:

  • Clean up integrations tests - remove duplicated tests, optimize tests, minimalize external dependencies

@amitkrout
Copy link
Contributor

test1-4.5 runs 1 half and test2-4.5 runs second half for example

@mohammedzee1000 do you mean we need two 4.5 cluster to in CI, right ?

@mohammedzee1000
Copy link
Contributor

mohammedzee1000 commented Jul 16, 2020

Yes one running one half and other running second half.

Don't worry we will get some clusters back by not testing a version or 2.

While this may not balance things out, it will still be an overall win

We could do for eg 4.5 and 4.4 or even just 4.5 in regular pr ci

But split tests for what we test, thereby achieving a semblance of parallelism. Remember cluster provision will be constant time

@metacosm
Copy link
Contributor

I still think that integration tests should only be run after the PR is approved by default, though it should be possibly to manually trigger such a run if needed on a case by case basis.

@amitkrout
Copy link
Contributor

amitkrout commented Jul 24, 2020

I don't know the test infrastructure at all so maybe what I'm suggesting is indeed not feasible if, for example, the bot that handles lgtm/approve label is tied to the testing infrastructure. Otherwise, another option would be to have a bot specifically used to trigger tests based on different conditions. I have no idea how complex that would be to put in place, though…

As far as I know prow doesn't support that. We would have to write our own plugin to achieve that.

Writing a custom Prow plugin should be pretty simple, you can listen to the events on PRs you care about and comment with /test to trigger things. The PRs will sit around, requiring tests that have not yet run, until that happens, though.

I am lost on what was the ask and what are we heading to

@stevekuznetsov The initial ask was we don't want to run our CI after the pr is triggered but CI can run only if /approve label is applied. I understand that through plugin you can control the job run but does it help us to hold the CI run when the the pr is triggered for first time ? Does it has side effect like after each commit i need to apply the label /test <name> to trigger the job.

@prietyc123
Copy link
Contributor

I don't know the test infrastructure at all so maybe what I'm suggesting is indeed not feasible if, for example, the bot that handles lgtm/approve label is tied to the testing infrastructure. Otherwise, another option would be to have a bot specifically used to trigger tests based on different conditions. I have no idea how complex that would be to put in place, though…

As far as I know prow doesn't support that. We would have to write our own plugin to achieve that.

Writing a custom Prow plugin should be pretty simple, you can listen to the events on PRs you care about and comment with /test to trigger things. The PRs will sit around, requiring tests that have not yet run, until that happens, though.

I am lost on what was the ask and what are we heading to

@stevekuznetsov The initial ask was we don't want to run our CI after the pr is triggered but CI can run only if /approve label is applied. I understand that through plugin you can control the job run but does it help us to hold the CI run when the the pr is triggered for first time ? Does it has side effect like after each commit i need to apply the label /test <name> to trigger the job.

@stevekuznetsov @amitkrout @mohammedzee1000 Can you please share the Prow plugin documentation.

@kadel
Copy link
Member

kadel commented Aug 4, 2020

we are considering running the travis CI and latest version openshift tests in the PR and rest in the periodic job
 we are considering running the e2e tests ( not integration tests ) in the periodic jobs and omit them from the PR tests altogether

This was done in #2931 👍

  • we are considering running all tests on later stages of the PRs

To be honest I don't think that this part is that important right now.
Our biggest problem is test stability.
We need to invest some time in fixing flaky tests!
I have a feeling that we are just figuring out new ways how to avoid properly fixing our tests.

/priority low
/remove-priority high

@openshift-ci-robot openshift-ci-robot added priority/Low Nice to have issue. It's not immediately on the project roadmap to get it done. and removed priority/High Important issue; should be worked on before any other issues (except priority/Critical issue(s)). labels Aug 4, 2020
@prietyc123
Copy link
Contributor

we are considering running the travis CI and latest version openshift tests in the PR and rest in the periodic job - done in #2931

Executing only the latest version of openshift CI in the PR and rest in periodic jobs are done with this issue itself PR - openshift/release#10312 and openshift/release#10346

@prietyc123
Copy link
Contributor

Test clean up still in progress and working on it individually. Tests covered till now

- create test: test-cmd-devfile-create #3618
- debug test: #3636
- catalog test: #3642
- push test: #3645

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 20, 2020
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 20, 2020
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link
Collaborator

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Issues or PRs related to testing, Quality Assurance or Quality Engineering lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/Low Nice to have issue. It's not immediately on the project roadmap to get it done.
Projects
None yet
Development

No branches or pull requests

10 participants