Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Cram for tests #787

Merged
merged 7 commits into from
Sep 27, 2022
Merged

Use Cram for tests #787

merged 7 commits into from
Sep 27, 2022

Conversation

sim642
Copy link
Member

@sim642 sim642 commented Jul 18, 2022

Here's a wild idea I toyed with, which we can either make complete or just close this PR.

Basically, dune's Cram tests are designed for testing executables and could have some nice benefits for us:

  1. No // PARAM: magic, those are directly in the Cram test command line.
  2. The entire output of is checked for verbatim equality with the expected output in the Cram test. Therefore it's not limited by what our annotations can check, but everything is checked. This also means that we immediately notice changes that we aren't currently even testing for!
  3. Diffs are printed for any changes.
  4. It would be insane to manually write the expected outputs, but dune promote automatically takes the current output and adds it as expected. Of course this should only be done if the output change is manually checked to be correct.
  5. Cram tests make it trivial to do multiple analyses on the same program without copying the entire program just to change // PARAM:. Just add multiple goblint commands into the same test!
  6. As a cherry on top, the combination of multiple goblint commands and entire output checking means that we have per-parameters expected results for the same program for free!

@sim642 sim642 added the testing label Jul 18, 2022
@michael-schwarz
Copy link
Member

I think cram is a really useful tool for testing in general, but I am not really a big fan of using it here, mainly for the following reasons:

  • It means that every miniscule change to the analyzer, it's solvers or how it outputs things amounts to re-checking all the generated diffs for the benchmarks and carefully verifying that it did not become unsound in any way or loose significant precision before promoting the changed files. This is an incredibly manual process and requires inspecting hundreds (thousands?) of programs (that are sometimes obscure and odd) and their outputs and is thus very error-prone and time-consuming.
  • It makes it harder for us to collaborate with students (or other newcomers to the code base):
    • Currently they can use the regression tests to see if their changes break anything about the analysis and if they do, they get a report about the exact verdict that changed. With cram, probably a lot of tests would fail (e.g. because of some new debug output, ...) and the students would have to manually review all the changed outputs and verify they are correct (or more realistically, they'd just look at 2-3 and assume that they're all ok and just promote everything, leading to bugs being discovered very late or not at all)
    • Instead of just seeing that the tests pass and are not modified we would have to manually check for every PR the diff of expected verdicts and re-verify that the changes that happened are benign

To sum it up, I think this type of testing is a step in the wrong direction, and I'd suggest we stick with the semantic testing that we are doing currently.

@michael-schwarz
Copy link
Member

What are your stances here @vesalvojdani @stilscher @jerhard ?

@sim642
Copy link
Member Author

sim642 commented Jul 19, 2022

It means that every miniscule change to the analyzer, it's solvers or how it outputs things amounts to re-checking all the generated diffs for the benchmarks

Anything that is not an intentional change to the output should not randomly change the output and Cram tests catch any such change that we currently have no way of testing. For example, changing the solver shouldn't change the set of warnings we produce. It should only change when some solver unsoundness is fixed (so it's intentional) or some solver unsoundness is introduced (when it's accidental).

This is an incredibly manual process and requires inspecting hundreds (thousands?) of programs (that are sometimes obscure and odd) and their outputs and is thus very error-prone and time-consuming.

Given that catastrophic output changes should not happen, not necessarily. Cram tests scale to much bigger projects than Goblint, including dune itself.

If the outputs are too massive to manually understand for our tiny regression test programs, then it clearly highlights the usability problem of our analyzer. This would actually encourage us to consider the actual outputs, which should concisely list all the warnings and no other garbage (e.g. spurious print_endlines left around).

With cram, probably a lot of tests would fail (e.g. because of some new debug output, ...)

And that's precisely the point, there shouldn't be new debug output at all if the particular test case doesn't explicitly --enable warn.debug. Rarely do we want to test debugging information.

Instead of just seeing that the tests pass and are not modified we would have to manually check for every PR the diff of expected verdicts and re-verify that the changes that happened are benign

There is no reason to remove any of the existing testing infrastructure though, so all the existing semantic checks would still be done (and are being done on this branch). Moreover, we have an unknown number of "regression" tests that actually check absolutely nothing and whose output is supposed to be manually inspected. The latest such testing being added in #785. Nobody is ever manually re-running those and for many of them it's not even clear what the expected output should be, making it impossible to make any conclusions from a manual run as well.

This actually is a good point I initially didn't consider myself: instead of porting all existing testing to Cram, we leave them as-is, but introduce Cram tests for things that we otherwise cannot automatically test for.

@sim642 sim642 changed the title Use Cram for regression tests Use Cram for tests Aug 24, 2022
@sim642 sim642 marked this pull request as ready for review August 24, 2022 14:04
@sim642 sim642 requested a review from vesalvojdani August 24, 2022 14:09
@sim642
Copy link
Member Author

sim642 commented Aug 25, 2022

Yet another thing Cram tests would be really useful for is witness generation, both YAML and GraphML, which are currently completely untested. It would just need an option for slightly filtered output, which doesn't include timestamps and version numbers to be stable.

Copy link
Member

@michael-schwarz michael-schwarz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the scope where they now have been introduced, it makes sense to have cram tests!

@sim642 sim642 merged commit a8e476e into master Sep 27, 2022
@sim642 sim642 deleted the cram branch September 27, 2022 08:00
@sim642 sim642 mentioned this pull request Oct 6, 2022
3 tasks
@sim642 sim642 added this to the v2.1.0 milestone Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants