-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce measure-complexity command #3061
Introduce measure-complexity command #3061
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Top level comment: we've never collectively been happy with the term "stability", and I've got a mostly-written proposal for what I feel is much better terminology. I'd prefer to block before we lock down that term in the actually tool chain. Happy to prioritize resolving the proposal ASAP so that's not a big delay.
That's fine. Consider this PR a way to make that upcoming decision more concrete :-D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very excited to have this feature available in the new CLI! Most of my comments are just wording changes, but I'm also wondering whether this should be a set of additional flags to verify
instead of a separate command.
public static readonly FormatOption Instance = new(); | ||
public override object DefaultValue => Enumerable.Empty<string>(); | ||
|
||
public override string LongName => "format"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe name this option --log-format
instead of just --format
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's make you think of the word log
? This command is expected to return data on stability, and this command changes how that data is shown.
@@ -45,6 +45,7 @@ static class CommandRegistry { | |||
AddCommand(new RunCommand()); | |||
AddCommand(new BuildCommand()); | |||
AddCommand(new TranslateCommand()); | |||
AddCommand(new StabilityCommand()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer this to be a set of options to the verify
command rather than a separate command, though I'm open to arguments to the contrary. At the very least, though, it will be valuable to be able to analyze stability (or whatever we call it) in the context of any other set of verification options, so this command (if it's a separate command) should accept all (or almost all) of the same options as the verify
command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this command should accept all (or almost all) of the same options as the verify command.
Agreed
I think I'd prefer this to be a set of options to the verify command rather than a separate command, though I'm open to arguments to the contrary
Why do you prefer that?
Having them be separate commands means that the options that only relate to stability, of which I think there can be quite a few, can be shown at the top of the help for the stability command, which makes it easier to use.
The stability command also has different output than the verify command since it also includes a statistics report, which I would print to stdout before any diagnostics, and when printing diagnostics it might be good to include the random seed required to get that diagnostic. Having it be a separate command prepares the user for a difference in output.
}.Concat(CommandRegistry.CommonOptions); | ||
|
||
public Command Create() { | ||
var result = new Command("check-stability", "(experimental) Check the stability of the program proofs. Be aware that the name, options and output of this command will change in the future."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it would make sense in the new CLI framework to allow certain commands or options to be programmatically marked as experimental. This would make it possible to disallow experimental features in settings where change is particularly disruptive.
public override string ArgumentName => "seed"; | ||
|
||
public override string Description => | ||
$"Turn on randomization of the input that Boogie passes to the SMT solver and turn on randomization in the SMT solver itself. Certain Boogie inputs are unstable in the sense that changes to the input that preserve its meaning may cause the output to change. This option simulates meaning-preserving changes to the input without requiring the user to actually make those changes. The input changes are renaming variables and reordering declarations in the input, and setting solver options that have similar effects."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this mention Boogie instead of Dafny? I think replacing each instance of "Boogie" with "Dafny" would be similarly useful and wouldn't require knowledge of Boogie.
public override string ArgumentName => "n"; | ||
|
||
public override string Description => | ||
$"Attempt to verify each proof n times with n random seeds. If {RandomSeedOption.Instance.LongName} is used, each proof attempt will use a new random seed derived from this one. If not, it will use a different seed between iterations."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text "a different seed between iterations" seems a little under-specified to me. We could mention that the base random seed is 0 when not otherwise specified.
Co-authored-by: Aaron Tomb <aarotomb@amazon.com>
Besides the terminology comment, I have one other observation about moving We should ensure we maintain support somehow for distributing the verification work to multiple independent invocations of The |
Why? I think we can let a single |
That's one option but it's definitely more limiting, because now |
I see, thanks! How about a |
A shard X/Y option seems like a great idea. The Dafny CI is already relying on that functionality from Getting a single unified report across shards would still be better IMO, so you can easily see the most expensive batches in one place rather than reading across multiple reports. At the same time I do like the fact that we have the |
Sounds good to me |
DafnyOptions.RegisterLegacyBinding(Format, (o, v) => o.VerificationLoggerConfigs = v); | ||
} | ||
private static readonly Option<uint> Iterations = new("iterations", () => 10U, | ||
$"Attempt to verify each proof n times with n random seeds, each seed derived from the previous one. {RandomSeed.Name} can be used to specify the first seed, which will otherwise be 0.") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NullReferenceException in CI is because you're referencing RandomSeed
before it's defined below. :) Swapping the ordering of the two options fixed it for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not yet convinced we want this as a separate command. Since it's not actually looking at the logged verification data, I don't feel like it's actually "check"ing anything. That makes it feel closer to just "verifying with extra parameters".
DafnyOptions.RegisterLegacyBinding(IsolateAssertions, (o, v) => o.VcsSplitOnEveryAssert = v); | ||
DafnyOptions.RegisterLegacyBinding(Format, (o, v) => o.VerificationLoggerConfigs = v); | ||
} | ||
private static readonly Option<uint> Iterations = new("iterations", () => 10U, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I try dafny check-proof-command --help
I see these options as iterations
instead of --iterations
, so I think you need to include the dashes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made sure all the options are used in tests.
}; | ||
|
||
public Command Create() { | ||
var result = new Command("check-proof-durability", "(experimental) Check the durability of the program proofs. Be aware that the name, options and output of this command will change in the future."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now, but should definitely explain what "durability" means here somehow before we take it out of "experimental" mode. I'd suggest we address #3174 and then link to the reference manual.
|
||
private static readonly Option<bool> IsolateAssertions = new("isolate-assertions", @"Verify each assertion in isolation."); | ||
private static readonly Option<List<string>> Format = new("format", $@" | ||
Logs verification results using the given test result format. The currently supported formats are `trx`, `csv`, and `text`. These are: the XML-based format commonly used for test results for .NET languages, a custom CSV schema, and a textual format meant for human consumption. You can provide configuration using the same string format as when using the --logger option for dotnet test, such as: --format ""trx;LogFileName=<...>.z"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logs verification results using the given test result format. The currently supported formats are `trx`, `csv`, and `text`. These are: the XML-based format commonly used for test results for .NET languages, a custom CSV schema, and a textual format meant for human consumption. You can provide configuration using the same string format as when using the --logger option for dotnet test, such as: --format ""trx;LogFileName=<...>.z"" | |
Logs verification results using the given test result format. The currently supported formats are `trx`, `csv`, and `text`. These are: the XML-based format commonly used for test results for .NET languages, a custom CSV schema, and a textual format meant for human consumption. You can provide configuration using the same string format as when using the --logger option for dotnet test, such as: --format ""trx;LogFileName=<...>"" |
ArgumentHelpName = "n" | ||
}; | ||
|
||
private static readonly Option<uint> RandomSeed = new("random-seed", () => 0U, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we do want the separate command, shouldn't these be in the shared VerificationOptions
bucket instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean RandomSeed
and IsolateAssertions
? What's the use-case?
Do you think a Dafny project could be developed with --isolate-assertions
on at all times? If not, how would you use it?
If the durability/complexity checker complains that a proof is too complex (saying a proof is not durable doesn't sound right), then you could think a user might want to do dafny verify --random-seed=<problematic-seed>
to improve the proof. However, in that case I'd suggest they use dafny check-proof-durability --random-seed=<problematic-seed>
instead, because then they'll be able to see whether they introduce other problematic seeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think a Dafny project could be developed with --isolate-assertions on at all times? If not, how would you use it?
It certainly could be, even though it's unlikely to scale well enough for typical projects. But I could even imagine degenerate cases of a small project with just a small number of really large methods where that would work well.
(saying a proof is not durable doesn't sound right)
Agreed and see my top-level comment.
then you could think a user might want to do dafny verify --random-seed= to improve the proof
Yeah, I don't think we'd want to encourage that except as a short-term hack, since it's likely not addressing the root cause of proofs that are too complex. But it would be useful in those cases, especially if you really just want to know if your code is valid and aren't yet concerned with durability.
Do you want me to add the checking to it? I was planning on keeping that out of scope for this PR. |
One final thought: I've actually wondered if verifying with multiple random seeds should be the default verification behavior, since it will be obviously slower but encourage more durable verification. We could offer the single pass as a "fast mode" you switch on for development, but with the understanding that it's like running code without enough tests: something you do as part of development but then catch up on before you cut a PR. Or perhaps we just ensure that when you use multiple iterations by default you still get the results of the first pass early, especially when verification succeeds. Perhaps the ideal UX for verification should be a gradual ramp up of throwing more and more resources at the problem. You could get a quick "Success!" message or a "Hang on, still trying..." message, and in the background failing assertion batches could be quickly retried with different random seeds, or automatically split into smaller batches, or even farmed out to more powerful hosts. All of this suggests to me that perhaps we really do just want to think of this as extra configuration on |
I think I agree with you on all of the terminology above. (Other thoughts in another comment below!) |
Yeah, I agree here. I was leaning toward this making more sense as an option to |
Works for me.
I don't think we want this. There should be a command to verify Dafny programs during development that tries to provide immediate feedback. Any unnecessary slowdown for it is undesired. Right now, Also, I don't think every Dafny developer will want to measure their proof complexity. It's an advanced concept that is only relevant when you're interested in developing software over time, not when you're playing around with Dafny. Any Dafny developer that is not in a development phase where they want to measure complexity, or that doesn't want to measure complexity at all, shouldn't be confronted with options related to it. Also, I don't like it if there's options that don't make sense unless you also use other options. For example, with The last two options don't make sense unless you use a non-default value for the first one. Putting all of these options in a single command with I would design Another thing that concerns me is that for
You need to decide when to return a success value, and once you do that you can't go back and say wait no it wasn't a success, because by that time the user is already doing something else. |
Okay I think I'm convinced, and the primary reason is that as you point out,
I think there are good applications of iteration for
That's fair as long as we make it very clear that every actual Dafny project should use it, so I wouldn't label it as "advanced" only. If and when we have a
If we're talking about |
I think maybe I'm convinced by the idea that One thing worth pointing out is that it's sometimes worthwhile to have complexity measurement succeed even if some proof attempts fail, as a way of using the mechanism to gate on things becoming worse. If you allow proofs to occasionally fail but still have a limit on variability of resource use then you can prevent drastic increases in complexity without requiring that all issues be resolved immediately. Currently, Dafny fails if any of the iterations fail, but it probably makes sense to change that behavior so that it succeeds if any of them succeed (in "verify" mode). So far, we've been delegating whether the verification is too complex by having the verification always succeed (in "complexity measurement" mode, sometimes by hacking it up with |
@atomb good point about the exit code, I'd say it's reasonable to have We can also then later have |
So that would essentially mean moving more of the |
Some of the discussion has been about future work. Are there any remaining changes required from this PR right now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to add an --allow-different-outcomes
flag, as we discussed, but that could wait until a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the discussion has been about future work. Are there any remaining changes required from this PR right now?
Nope I think this is a good first step. I'm not yet sure exactly what the UX will look like for the use case of just doing the reportgenerator
aggregation and testing based on existing log files, but I'm comfortable figuring that out in the future.
Thanks for enduring the naming bikeshedding! :)
ArgumentHelpName = "n" | ||
}; | ||
|
||
private static readonly Option<bool> IsolateAssertions = new("--isolate-assertions", @"Verify each assertion in isolation."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like a bit more detail here personally ("Verify each assertion in its own assertion batch. See (refman section) for details.") but not enough to block on.
Arriving a bit late in the debate, thanks for explaining all the reasoning everyone that was helpful. On the other side "measure-complexity", "measure" gives the impression that the outcome would be a number between 0 and 100 about whether the proof are straightforward (0) or they actually take too complex (100). If we are not returning a score but a boolean, then I believe "check" is more appropriate than "measure". We can always output more information with specific flags about how to check and what to display. Now, there is another mental model problem that the discussion went into without discussing it deeply. What this PR is doing is comparing the complexity across runs. If this complexity What adjectives would you use to describe a process whose outcome has a low standard deviation when executed under slightly different environments? I asked this question to ChatGPT which is a very good language model. Here are his answers:
We can't use consistency because it has another meaning when something is axiomatized. Uniformity seems strange to me. Predictability assumes that we guess the outcome in advance. So now, I understand better why "stability" was much more appropriate for this measure. |
@MikaelMayer No worries about being late, we intentionally labelled the command experimental and even warned the name might change for this reason. :) Here are my thoughts. I still strongly reject "stability" because the lack of it implies something that may be unreliable, non-deterministic, fragile, etc. None of these things apply to something that is successfully verified but may not be durable under future changes. Labelling programs or verification as unstable plants the seed that they should not be trusted. I will take this argument up with ChatGPT if necessary. :D
This is completely fair, and we are intentionally under-specifying what "complexity" means here. My interpretation is that we're trying to measure the underlying complexity of the problem, because our hypothesis is that it correlates with low durability. If it involves a search through a huge space, indeed one seed might succeed quickly and another in a very long time. But we want to say that if we get this kind of wide variance, it's a clue that the underlying complexity is high. As for "check" vs. "measure", you do have a point. My concern was that as is, this command only outputs data and not any kind of boolean decision. We would probably like to move the |
The question is, do I reject "complexity" as much as you reject "stability" ? :-) I explained my reasons why" complexity" is just not appropriate in this context. Synonyms of that would have been "measure-resources" "measure-time".... is this what this command does? If its main purpose is to just provide the result without aggregating them (like a summary), then, "complexity" is fine. |
It's only measuring the dimensions you name and producing the data. It could easily measure other metrics that approximate "complexity" in the future. |
I don't like stability because I think the Dafny code itself is not unstable. I think the SMT solver is unstable. We want this command to tell the user something about their Dafny code, not about the SMT solver. So I think the question your should ask ChatGPT is: how do you call an input that causes a process to become unstable? For fun:
I prefer check as well for the reason that you mention: that this command will return a boolean (once we update it). |
We need to avoid this statement too. It makes it sound like the solver is in alpha mode and shouldn't be used in production. And if we use "unstable" to refer to a fundamental limitation of solvers, we imply they should never be used in production. |
Indeed there's the risk of someone interpreting this as the solver switching between sat and unsat. Maybe (un)decisiveness is a clearer description of the solver's problem. |
But you have a point: I think that solvers won't be relied if ever called from on-line code in production, i.e. code that is supposed to run every millisecond. Solvers are great to solve offline challenges, such as verifying code, but their unpredictable performance makes them poor candidates to solve online challenges when other specialized solutions exists (and they almost always exist for online code). If you worry about the marketing effect, I would say that not saying anything about the solver's core instability to perturbations in large inputs is even worse. The only verification engines that don't suffer from this instability are proof assistants, because proofs can be verified in a polynomial and predictable amount of time - but at the expense of providing all the details and tactics. If proof assistants were rails on which trains can't deviate, Dafny is more like a self-driving car. It can lead you much safer to any place way than human drivers ever could, and easier than trains, but if you try to ask minimum speed (i.e. your code is too complex, proofs are not straightforward), then Dafny will sometimes move correctly because it can find a path that satisfies your desires, but sometimes, it's too complex to find this path and the Dafny car will just not move (says it timeouts when trying to verify your code). Safety first. Even if the only difference was a butterfly on a stop sign. If I had a car with such unstability, I would still enjoy riding it because it provides guarantees around this unstability (i.e. safety). Or take a GPS app, for example. Users know that the path is unstable (if you don't go in the provided direction, you might end up on a completely different path), but all paths lead to the goal. So the unstability is not the problem. Solvers are unstable with respect to their input when it is large, but their output is always trusted as it is consistent. At worst, a solver will just say "don't know" and that's ok. Solver outputs have a partial order (don't know < unsat) and (don't know < sat) and the solver tries to maximize the output, as a constraint solver would try to maximize the number of constraints solved. And it's easy to get into NP-hard problems. I'm not suggesting |
Changes
measure-complexity
command and related options, see below for its help output. This command is described as experimental, since its name, options and output is expected to change in the future. We expect to mergehttps://github.com/dafny-lang/dafny-reportgenerator
into this repository and include its features in this command.Testing
Data
By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.