-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression testing updates #2459
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2459 +/- ##
==========================================
- Coverage 48.13% 48.10% -0.03%
==========================================
Files 110 110
Lines 30629 30627 -2
Branches 7989 7989
==========================================
- Hits 14743 14734 -9
- Misses 14355 14364 +9
+ Partials 1531 1529 -2
... and 3 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
17cb2ed
to
887e287
Compare
I think this is working as intended. A problem is that the CI has been not working properly for a while, and the last successful upload it's finding is from May 18:
so the regressions are run against that. Some differences are detected (demonstrating that these tests work?) I'm not 100% sure this PR will fix that uploading issue (it may be that the scheduled runs don't have the expected metadata to count as reference cases) but hard to know until we push this (or some test commits) to main. |
Before it would upload the test folder on _any_ failure. Now it's just if the regression tests fail.
Checking recent successful runs this takes 20-45 minutes 120 minute should be a generous upper bound. Any more than that and it has hung for ever and should be killed.
This changes it so that only the ubuntu runner on the official main branch is the reference case. It also stores that result in an environment variable so we only need to evaluate the logic once. For regression testing we need to determine if we're the reference case (env.REFERENCE_JOB=='true') or the comparison case (env.REFERENCE_JOB=='false') Because it becomes a string, 'false' is then read as a non-empty string and evaluates as true. So we have to do string comparison not negation. i.e. instead of ${{ !( env.REFERENCE_JOB ) }} we must do ${{ env.REFERENCE_JOB == 'false' }}
These overwrite each other if you don't specify the name uniquely, and if multiple jobs write at once you can get corrupted files.
This should facilitate the CI testing actually noticing if the tests pass or not.
This will allow us to easily see if there's a difference while using the script on the command line (or in the CI workflow)
…sion tests. Now the regression tests compare the core and edge, not just the observables from the simulations.
Could be helpful to know code coverage even in cases when the regression tests are failing - sometimes (hopfully often!) a "regression" is an improvement, and a "fail" is thus a success.
even though we have git logs
Mostly changing whitespace and blank lines so the logs are easier to read.
As it turns out this is a known bug in |
Otherwise, if it fails (eg. the julia step hangs and times out) then that automatically cancels the ubuntu-latest run, which was running along fine.
We're using a patched version of the action-download-artifact action to try to make sure we get the latest results before comparing the regression.
I switched to the patched download-artifact action you suggested and turned on I suggest that, once someone approves this, we merge it to main and see if it uploads a new regression result as reference - including the new RMS tests that were just merged? |
Ah, the upload results step has a syntax error in the if statement. This PR fixes that.
Agreed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these updates - LGTM.
@@ -181,63 +187,81 @@ jobs: | |||
name: stable_regression_results | |||
path: stable_regression_results | |||
search_artifacts: true # retrieves the last run result, either scheduled daily or on push to main | |||
ensure_latest: true # ensures that the latest run is retrieved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JacksonBurns do you know how this is meant to be used? It's reporting Warning: Unexpected input(s) 'ensure_latest', valid inputs are ['github_token', 'workflow', 'workflow_conclusion', 'repo', 'pr', 'commit', 'branch', 'event', 'run_id', 'run_number', 'name', 'name_is_regexp', 'path', 'check_artifacts', 'search_artifacts', 'skip_unpack', 'dry_run', 'if_no_artifact_found']
(but anyway, it's using a newer artifact now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why it's doing that, but it is presumably something we can blame on having used a random version of a forked action. We could try removing this line and see if this action still works?
The overnight scheduled job uploaded a new reference result 🥳 , and the CI test on matt's PR fetched them for comparison 🥳 . Unfortunately there seems to be a bug causing a crash when comparing some of the results |
🎉
I can't comment on the science behind this, but I will say that a 'failure' in the old testing scheme was never established either. The output from the regression tests was just left to the user to read. The change you made in 1a424ef makes sense, but I think its up to us now to decide what constitutes failure. |
Motivation or Problem
I suspect regression tests are often not looked at because the results are buried in the logs.
Description of Changes
The biggest change is that regression tests actually report a failure if a model changes significantly - either the core or edge changes or an observable in the simulation changes significantly.
Many changes to RMG are deliberate improvements and ought to change things, so sometimes we want these "regression" tests to pass - but it should probably be a deliberate act of an admin approving the pull request despite the regression, not just everyone ignoring the regression tests because nobody checks the logs. If there are too many failures and so we have a bottle-neck of approvals, or admins over-riding tests too much, we can revisit the policy. At least these changes now give us the framework to detect the regressions during the testing.
Other changes include
Testing
Tested it in the CI workflow.
Reviewer Tips
I tried to keep each commit self-contained with commit messages, and rebased a few times merging fixups.