Use Cram for tests #787

sim642 · 2022-07-18T19:20:18Z

Here's a wild idea I toyed with, which we can either make complete or just close this PR.

Basically, dune's Cram tests are designed for testing executables and could have some nice benefits for us:

No // PARAM: magic, those are directly in the Cram test command line.
The entire output of is checked for verbatim equality with the expected output in the Cram test. Therefore it's not limited by what our annotations can check, but everything is checked. This also means that we immediately notice changes that we aren't currently even testing for!
Diffs are printed for any changes.
It would be insane to manually write the expected outputs, but dune promote automatically takes the current output and adds it as expected. Of course this should only be done if the output change is manually checked to be correct.
Cram tests make it trivial to do multiple analyses on the same program without copying the entire program just to change // PARAM:. Just add multiple goblint commands into the same test!
As a cherry on top, the combination of multiple goblint commands and entire output checking means that we have per-parameters expected results for the same program for free!

michael-schwarz · 2022-07-19T07:40:35Z

I think cram is a really useful tool for testing in general, but I am not really a big fan of using it here, mainly for the following reasons:

It means that every miniscule change to the analyzer, it's solvers or how it outputs things amounts to re-checking all the generated diffs for the benchmarks and carefully verifying that it did not become unsound in any way or loose significant precision before promoting the changed files. This is an incredibly manual process and requires inspecting hundreds (thousands?) of programs (that are sometimes obscure and odd) and their outputs and is thus very error-prone and time-consuming.
It makes it harder for us to collaborate with students (or other newcomers to the code base):
- Currently they can use the regression tests to see if their changes break anything about the analysis and if they do, they get a report about the exact verdict that changed. With cram, probably a lot of tests would fail (e.g. because of some new debug output, ...) and the students would have to manually review all the changed outputs and verify they are correct (or more realistically, they'd just look at 2-3 and assume that they're all ok and just promote everything, leading to bugs being discovered very late or not at all)
- Instead of just seeing that the tests pass and are not modified we would have to manually check for every PR the diff of expected verdicts and re-verify that the changes that happened are benign

To sum it up, I think this type of testing is a step in the wrong direction, and I'd suggest we stick with the semantic testing that we are doing currently.

michael-schwarz · 2022-07-19T07:41:20Z

What are your stances here @vesalvojdani @stilscher @jerhard ?

sim642 · 2022-07-19T12:17:31Z

It means that every miniscule change to the analyzer, it's solvers or how it outputs things amounts to re-checking all the generated diffs for the benchmarks

Anything that is not an intentional change to the output should not randomly change the output and Cram tests catch any such change that we currently have no way of testing. For example, changing the solver shouldn't change the set of warnings we produce. It should only change when some solver unsoundness is fixed (so it's intentional) or some solver unsoundness is introduced (when it's accidental).

This is an incredibly manual process and requires inspecting hundreds (thousands?) of programs (that are sometimes obscure and odd) and their outputs and is thus very error-prone and time-consuming.

Given that catastrophic output changes should not happen, not necessarily. Cram tests scale to much bigger projects than Goblint, including dune itself.

If the outputs are too massive to manually understand for our tiny regression test programs, then it clearly highlights the usability problem of our analyzer. This would actually encourage us to consider the actual outputs, which should concisely list all the warnings and no other garbage (e.g. spurious print_endlines left around).

With cram, probably a lot of tests would fail (e.g. because of some new debug output, ...)

And that's precisely the point, there shouldn't be new debug output at all if the particular test case doesn't explicitly --enable warn.debug. Rarely do we want to test debugging information.

Instead of just seeing that the tests pass and are not modified we would have to manually check for every PR the diff of expected verdicts and re-verify that the changes that happened are benign

There is no reason to remove any of the existing testing infrastructure though, so all the existing semantic checks would still be done (and are being done on this branch). Moreover, we have an unknown number of "regression" tests that actually check absolutely nothing and whose output is supposed to be manually inspected. The latest such testing being added in #785. Nobody is ever manually re-running those and for many of them it's not even clear what the expected output should be, making it impossible to make any conclusions from a manual run as well.

This actually is a good point I initially didn't consider myself: instead of porting all existing testing to Cram, we leave them as-is, but introduce Cram tests for things that we otherwise cannot automatically test for.

In GitHub Actions vid-s differ.

sim642 · 2022-08-25T07:33:30Z

Yet another thing Cram tests would be really useful for is witness generation, both YAML and GraphML, which are currently completely untested. It would just need an option for slightly filtered output, which doesn't include timestamps and version numbers to be stable.

michael-schwarz

I think in the scope where they now have been introduced, it makes sense to have cram tests!

Add a few cram regression tests

f5d56f4

sim642 added the testing label Jul 18, 2022

sim642 mentioned this pull request Jul 25, 2022

Precondition Invariant: Disjunct over matching state invariants #786

Merged

sim642 mentioned this pull request Aug 8, 2022

Liveness analysis of multi-threaded C with POSIX threads #220

Merged

sim642 added 6 commits August 22, 2022 10:11

Merge branch 'master' into cram

2e82f9f

Convert extraction tests to cram

1752e73

Add regression cram tests to CI

5fb0a28

Hide goblint output in extraction tests for reproducibility

1a0bf26

In GitHub Actions vid-s differ.

Remove unnecessary parenthesis from extraction tests

b309d09

Fix regression cram tests in CI

d9f5173

sim642 changed the title ~~Use Cram for regression tests~~ Use Cram for tests Aug 24, 2022

sim642 marked this pull request as ready for review August 24, 2022 14:04

sim642 requested a review from vesalvojdani August 24, 2022 14:09

michael-schwarz approved these changes Sep 23, 2022

View reviewed changes

sim642 merged commit a8e476e into master Sep 27, 2022

sim642 deleted the cram branch September 27, 2022 08:00

sim642 mentioned this pull request Oct 6, 2022

Automate manual tests #847

Merged

3 tasks

sim642 added this to the v2.1.0 milestone Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Cram for tests #787

Use Cram for tests #787

sim642 commented Jul 18, 2022

michael-schwarz commented Jul 19, 2022

michael-schwarz commented Jul 19, 2022

sim642 commented Jul 19, 2022

sim642 commented Aug 25, 2022

michael-schwarz left a comment

Use Cram for tests #787

Use Cram for tests #787

Conversation

sim642 commented Jul 18, 2022

michael-schwarz commented Jul 19, 2022

michael-schwarz commented Jul 19, 2022

sim642 commented Jul 19, 2022

sim642 commented Aug 25, 2022

michael-schwarz left a comment

Choose a reason for hiding this comment