Re-triggering failed gradle build should run only failed tests #5010

sachinpkale · 2024-09-10T02:41:06Z

Is your feature request related to a problem? Please describe

Currently, majority of the build failures for OpenSearch repo are due to flaky tests.
Re-triggering the build runs all the tests again (40+ mins) and it is possible that next time build fails due to another flaky test.
This impacts contributor's productivity and slows down the feature work.
Build re-trigger also adds extra stress on the build infra and this gets amplified as we get closer to code freeze date for a release.

Describe the solution you'd like

If a build fails due to test failure, re-triggering the build should run only the failed tests from the previous run.
It would be safe to re-run only failed tests. If the change in PR is actually causing a test to fail, it would always fail.
This will help in reducing the build re-trigger time considerably.
The incremental test run should only be applicable to re-trigger of failed build. Any new change pushed to PR should run the entire build.

Describe alternatives you've considered

Alternative solution to the mentioned issue is to bring down number of flaky tests to 0.
Even though we continuously try to reduce flaky tests, with each new feature, new tests are added and it is possible that new flaky tests are introduced (this happens even with multiple runs of tests on local setup)
Tests with random wait (or assertBusy) are more susceptible to being flaky as most of the these tests do not show any symptom on local and fail when running on GitHub build system (mostly due to overloaded build servers).
We will continue our efforts to reduce flaky tests, but aiming for 0 flaky tests may take months and does not guarantee that a new flaky test will not be introduced.

Additional context

No response

The text was updated successfully, but these errors were encountered:

rishabh6788 · 2024-09-10T07:06:33Z

Thanks for creating this issue @sachinpkale, I was thinking about creating a similar issue in OpenSearch repo to gather feedback from all the contributors of OS repo on how to tackle this problem.
While refactoring gradle-check and break it into parallel ci runs seems to be most optimal solution in the long run, we do need some mechanism to reduce the churn due to flakey test failures.

The solution I was thinking about is as follows:

Approach 1

The solution could be we catch the failure status in the existing gradle-check yml workflow file and add new steps to only retry the failing tests, if the retry passes the gradle-check would show success. This will not require any custom action from developer side.
This will not help in genuine test failures and will extend gradle-check execution time.

Approach 2

gradle-check fails on the PR.
The developer determines whether it is genuine failure or flakey test failures.
If flakey test, the dev adds a comment or label that says re-run failed tests or something similar.
This triggers a workflow (combination of GHA and Jenkins) which gets failing test list from the last failed gradle-check run.
Runs those tests only and publishes result back on the PR.

There will be other checks like if a new commit was published which caused a new gradle-check run and then a comment/label was added then it wouldn't run stating that a gradle-check is already in progress.

But before we go ahead and start talking about solutions, I have a question on what happens to the last failed gradle-check ci rule on the PR. For now a passing gradle-check is a must for the PR to be merged.
In the solution proposed, even if we re-run the failing tests and update on the PR that they passed, the status of gradle-check would still show failed.

Do we then relax this rule? Not required if we go with approach 1.

@reta @getsaurabh02 @prudhvigodithi @gaiksaya @dblock @peterzhuamazon Thoughts?
Both approaches are in theory and need to be tested for feasibility.
Would like to hear any other ideas to tackle this problem more efficiently.

shiv0408 · 2024-09-10T09:07:24Z

+1 for breaking the gradle-check into multiple parallel ci runs

reta · 2024-09-10T12:15:01Z

Thanks @sachinpkale @rishabh6788 , I would also agree with @shiv0408 that breaking the Gradle check into separate tasks / phases (which could be run in parallel and retried individually) looks like a good first step, if I remember correctly, @andrross also brought this idea some time ago. Unrelated to flaky tests, it would also help with getting better test coverage reports.

If a build fails due to test failure, re-triggering the build should run only the failed tests from the previous run.

This will not work as of today (monolithic Gradle check), for example when there are failures in unit tests in any module, the build fails right away (example here [1]):

* What went wrong:
Execution failed for task ':modules:reindex:test'.
> There were failing tests. See the report at: file:///var/jenkins/workspace/gradle-check/search/modules/reindex/build/reports/tests/test/index.html

Retrying such tests would not be useful since the majority of tests weren't even run. To reliable implement such feature we need to make sure none of the test related tasks / phases were skipped.

[1] https://build.ci.opensearch.org/job/gradle-check/47638/console

prudhvigodithi · 2024-09-10T16:35:39Z

Thanks @rishabh6788 @reta @sachinpkale @shiv0408, here are some issues from past to Optimize the Gradle check
opensearch-project/OpenSearch#1975
opensearch-project/OpenSearch#4053
opensearch-project/OpenSearch#12410
opensearch-project/OpenSearch#2496
#4810
#1572

I would vote for breaking the Gradle check into separate tasks / phases which will improve the developer productivity and eventually improve the Core contributions.

I'm also ok with other approaches to run in incremental or just re-run the failed Gradle tests, but again with Gradle task graph dependencies we might eventually trigger more tests that just the targeted tests part of re-try.

As one quick fix we can update the gradle-check workflow to just run for the latest head commit and cancel all the other running jobs for the same PR, this will reduce some noise on the PR (with failing gradle check comments) and stress on infra (by cancelling the long running old commit gradle runs).

Once we split the gradle check and optimize it, down the lane (or in parallel) we can tackle the existing and take a call if we have to mute them for short term and make them the entry criteria for the upcoming release and get them fixed. Some more thoughts here #4810 (comment).

@cwperks You may be interested in this conversation as well.

rishabh6788 · 2024-09-10T16:41:56Z

As one quick fix we can update the gradle-check workflow to just run for the latest head commit and cancel all the other running jobs for the same PR, this will reduce some noise on the PR (with failing gradle check comments) and stress on infra (by cancelling the long running old commit gradle runs).

I have created an issue for the same to discuss potential solutions in #5008. @prudhvigodithi @reta appreciate your feedback and comments.

sachinpkale added enhancement New Enhancement untriaged Issues that have not yet been triaged labels Sep 10, 2024

github-project-automation bot added this to Engineering Effectiveness Board Sep 10, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Sep 10, 2024

sachinpkale changed the title ~~Re-triggering failed gradle check should run only failed tests~~ Re-triggering failed gradle build should run only failed tests Sep 10, 2024

reta mentioned this issue Sep 10, 2024

Optimize gradle-check CI Pipeline to Handle Rapid PR Updates #5008

Closed

5 tasks

prudhvigodithi removed the untriaged Issues that have not yet been triaged label Sep 12, 2024

prudhvigodithi assigned rishabh6788 Sep 12, 2024

peterzhuamazon moved this to 📦 Backlog in Engineering Effectiveness Board Dec 4, 2024

rishabh6788 removed their assignment Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-triggering failed gradle build should run only failed tests #5010

Re-triggering failed gradle build should run only failed tests #5010

sachinpkale commented Sep 10, 2024

rishabh6788 commented Sep 10, 2024 •

edited

Loading

shiv0408 commented Sep 10, 2024

reta commented Sep 10, 2024

prudhvigodithi commented Sep 10, 2024

rishabh6788 commented Sep 10, 2024

Re-triggering failed gradle build should run only failed tests #5010

Re-triggering failed gradle build should run only failed tests #5010

Comments

sachinpkale commented Sep 10, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

rishabh6788 commented Sep 10, 2024 • edited Loading

Approach 1

Approach 2

shiv0408 commented Sep 10, 2024

reta commented Sep 10, 2024

prudhvigodithi commented Sep 10, 2024

rishabh6788 commented Sep 10, 2024

rishabh6788 commented Sep 10, 2024 •

edited

Loading