-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Construct fuzzer input from known result (AntiGenerator?) #91
Comments
To expand motivation here a bit: our typed value is WebAssembly, and we use arbitrary+wasmsmith combo to generate that, where this basically means that we take unstructured With the current infrastructure, bolero wants to persist this seed as a failure test case. It would be cool if we could save the actual wasm instead:
|
Just a bit more brain-storming about the potential benefits of this feature:
|
Another thought on the potential benefits of this feature: In the situation where you're running a 24/7 fuzzing server, you want to be able to pause fuzzing, update to a new commit, and resume. But the way Arbitrary/Generators tend to work here, updating is "fragile" and likely to lead to the existing corpus of inputs to fail to parse or parse differently (losing value). With this feature, you could pause fuzzing, "reverse-generate" the corpus to a "stable corpus", update to the new commit for the software under test, "forward-generate" the stable corpus back to a fuzzing corpus, and resume fuzzing without losing the progress previously made. |
I agree; I think this would be a really powerful feature to have. If someone wants to write up a proposal of how it would fit with everything, that's probably the next step here. |
The more I think about it, the harder I think this is. For instance, when using a Rng driver, then this feature is essentially impossible to implement. Basically, the best idea I could think of would be to compile a fuzzer that runs But well… especially for migrating whole corpuses, this would be pretty bad, as it'd mean running the fuzzer for a long time just to find what inputs were generated before. The only thing I think could make sense to do this, would be to preserve crashes, and not the whole corpus: this way regression tests stay alive. But well, it's probably not worth the effort writing all the infrastructure to make that work. For this specific scenario it'd probably be better to have a cargo-bolero command that automatically generates a regression test from the crash's Debug output or similar. (Yes such a solution would often fail, but actually running the fuzzer to generate data that gens the same input as before an update would also hit the same issues) |
I don't quite follow. Certainly you have a point that it could be difficult to reverse a generator, if you think of them as black-boxes, but I think the point here is that the generators aren't black boxes. In fact, many of them are automatically derived with For example, if a generator says (in vague rust-y pseudo-code):
Then we know how to write a reverse:
|
I wanted to write up what I think this feature might look like from the perspective of the use case I have in mind. A normal bolero harness looks something like this (from the readme):
I’m imagining this feature adding something where you can do:
(Or, to use the property test generators, instead of concrete values, a The semantics would be to take each example and backwards-generate bytes, and to emit each of those examples as files in the corpus as a step before invoking the fuzzing engine. (For The problem this would solve is one faced by structural fuzz harnesses (i.e. ones accepting an input of a generator type, not just flat bytes as input). Because the "parsing" step is not stable, we can't really rely on having a corpus that remains useful. If anything about the type changes (new field, new variant of the enum, or any other code change), then the “old” bytes become useless. They no longer parse, or they don’t parse as the thing you wanted (to get the coverage you wanted from the fuzzer). With this feature, you can have the stable corpus written as examples in code, so it’s maintained normally, and it uses this “reverse” feature to emit it to the files into the corpus before starting the fuzzer. If the “parsing” changes, the reversed bytes change in tandem, and things stay stable. (Or, maybe you don't even need to maintain a corpus, because the property test generator can generate a seed one for you!) |
I'm completely with you about the advantages of having an anti-generator, as it's actually the reason why I opened this feature idea 😅 But there are three fundamental issues with the idea of an anti-generator. First is, it just cannot work with the rng-based drivers (used by the proptest generator). (It can't work with kani either but I guess this would be expected). Maybe not too big a deal as the proptest generator could just first run the examples and then start actually generating values, like it does with the fuzzer corpus. Second, and more deep one: even for the byte slice driver, used by fuzzers, it is very possible that the value for which you're trying to generate a byte slice just cannot be generated by the driver. For instance, if there's a vec that gets generated with length
And then there is the third and maybe most important thing. What should happen for the things that are just hard to anti-generate? Anything that is currently being derived without attributes can probably be anti-generated, with enough work put into the macro work (right now it would seem relatively ok for DriverMode::Direct ByteSliceDrivers, but the code is missing depth handling, which breaks generating recursive types). But not everything can be derived, like we already have stuff that we need to generate from It's a huge amount of work, even if ignoring the issues of anything that relies on a non-bolero generator; hence my previous message. Now if you feel ready to try tackling this problem, it'd be awesome! (Also, FWIW, in practice adding a variant to an enum does not seem to usually break the corpus in my experience, though I have not tried actually verifying it) |
Ah we were maybe talking past each other, because I think we agree on this. I don't think of this as a problem because anti-generation seems like a fuzzing-only (or at least, "dealing with corpuses"-only) feature to me.
Agreed, failure is an option. I think failure can be usefully and effectively "contained" though. For instance, if you have to use e.g. For constraints like the length limit, I think that'll generally not be a problem. You typically wouldn't get a value that violates those constraints in the first place. (e.g. the fuzzer wouldn't ever generate one) If the user manually writes a value that violates constraints, I'd suggest the most useful (and safest initial) semantics is actually that it'd error out eagerly. (e.g. (I do wonder if the failure error type might be able to return a
All good questions. I agree we'd want to see an anti-generator be just as adaptable as generators are at dealing with ad-hoc additions. It might be unpleasant (and perhaps unstable) to figure out how to write an anti-generator for a type from another crate, but if that capability doesn't presently exist, then someone's got to do it... Or perhaps the capability will appear in Arbitrary someday as well: rust-fuzz/arbitrary#44 |
That'd be cool too! FWIW the issues listed in rust-fuzz/arbitrary#94 could be helped by #90 ; as such a generator might be much more easy to write anti-generators for than TLV-based ones, because the API encodes much more information :) But now #90 also has its own questions of API (feel free to comment there too!), so that's not a silver bullet (yet) 😅 (this, plus implementing this feature even after #90 is done will still be a quite complex project) Overall, for anti-generation, I think that as soon as the call graph of Actually this makes me think maybe the whole Oh and another problem that just came to mind: if a generator can |
This may be a far-fetched idea, but I'm just dropping it here in case it'd make sense: what would you think about the idea of making an "anti-generator" function in the generator traits, that'd look like:
Then, one could use this
generate_test_case
funciton so as to build a crash fuzzer test from a known-bad value of that type.TBH I'm not sure it's actually worth the hassle (as theoretically just running the fuzzer should find the input automagically), but I wanted to drop this idea so we could discuss it :)
The text was updated successfully, but these errors were encountered: