-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue for the Cranelift CLIF-level differential fuzzer #3050
Comments
Subscribe to Label Actioncc @fitzgen
This issue or pull request has been labeled: "fuzzing"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
We've been having performance issues with fuzzgen. So I decided to implement a statistics framework and measure 🔬 a bunch of stuff! Be warned, this comment is really long. These are the notes that I took while doing these improvements. TL;DR: We have issues with input formats, we waste about 60% of our inputs. Did some improvements that doubled fuzzgen's performance. All of these measurements are done on AArch64 (Ampere Altra) running on a single core. Mostly because it was easier for me to run them there.
|
I am so excited about all this work you did over the weekend! I need to dig into it all. But until then, I'll reply to #4824 (comment) here:
Any instructions with at least one operand which is not a "value". For example, stack load/store, with their need for a stack slot operand. Or function calls, which need a function reference, and also have a type signature that depends on the selected function reference. My WIP patch handles stack slots in 0c9689b#diff-226e0f113ddbbcedaa5265109a951e52d30aa35307537684f56a5e6bedc9ba0eR369-R390 but I didn't get to handling calls yet.
I hope #4863 by itself didn't affect any of these numbers. It may have changed which stack slot is selected in some tests but otherwise if it changed how things are generated then I probably introduced a bug. 😅 But it looks like you measured #4862 and #4863 together and saw a 1% increase in well-formed inputs. I don't think that's surprising from either #4862 or random noise, so that's reassuring. As for "Better Instruction Selection": by my count, going from 38k well-formed inputs to 41k is an 8% increase (40927 / 37680 is 108%). Going from 38% to 41% is a difference of 3 percentage points but I'm not sure that's the best way to measure that change. The "Mostly Forward Branching" work has very promising-looking results!
How about continuing on to the remaining test inputs for the given function? If any of them don't trap then at least we've successfully tested something. As a bigger project, maybe we should try doing something like wasm-smith's new no-trapping mode.
I'm not sure yet; I'm thinking about it. Some observations that might help:
|
So I did this last one, I recorded backtraces for See this branch for the backtrace printing. With 150k execs we got the following results: (Reordered for better readability)
I also tried grouping by the first two stack frames, and about 30% come from
|
This is so cool! The numbers in #4894 are confusing me though. The number of valid inputs doesn't agree with the number of stack traces. For example, the baseline reported in that PR had 31,622 invalid inputs according to the stats, but only 20,139 stack traces. Were those from different runs, or are there invalid inputs that are not being measured? Also: of failures during the baseline run, roughly 27% were due to choosing an impossible terminator, and the other 73% were from instruction selection. After the PR, you captured about 71% as many stack traces, which is close enough to 73% to seem plausibly like variation due to random chance. So I'd assumed the PR would have 71-73% as many invalid inputs. Instead it had about 92% as many invalid inputs. So the effect of the PR on the number of valid inputs was rather smaller than I thought it would be. I think this is consistent with a bunch of invalid inputs not capturing stack traces. I've written a throwaway patch for the You should get similar results as your current branch without having to wrap your This still misses stack traces for errors which originated from By the way, do you need to limit the number of stack frames at all? I'd expect comparing the entire stack trace to work fine in this case because there aren't many control flow paths in cranelift-fuzzgen. So on Linux/Cygwin/WSL I'd do something like
It may be easy when you're doing the rest of the instruction selection work, but I agree, at 0.5% there's no rush. That said, in #4894 you measured double-digit percentages of failures in
That all sounds great. I don't have a strong attachment to the style I used. The main thing I liked about it was capturing borrows of the appropriate slices in each option. I liked being able to avoid looking up resources when generating an instruction, by remembering what resources I already looked up when generating the list of choices. If there's a lookup in both places, I worry that they could get out of sync. So if you find that it's easy to do that while filtering options, I'd like that; but if it's not easy, I don't care that much.
I don't understand: how do we catch interpreter bugs if we silently throw out cases where the interpreter traps? Are these bugs where the interpreter should have trapped but it didn't, so we only saw the trap during native-code execution? To make sure we're talking about the same thing: "no-trapping mode" means generating extra code to ensure that the inputs are always valid for instructions that may trap. It doesn't mean disabling instructions that could trap. For example, you could bitwise-or 1 into the denominator of a division, so that it's never 0 regardless of how the value was generated. |
I think this was an issue with #4894. I've re run the new benchmarks with your patch and it looks like we are now getting the expected results. See: #4894 (comment)
No issues, it all seemed to work fine, although it was a bit slow with 70k inputs, but no big deal.
I have no explanation for this. The 150k run above did start from a slightly larger corpus that I kept using while developing. But the one on #4894 is the OSS-Fuzz original corpus, maybe that makes a difference? That one is also only 50k runs and maybe that also figures into it? To be honest I'm kind of at a loss here.
I'll try to work something out where we can preserver those constraints, I agree, I don't like the lookup in both places.
Yes, that's exactly it. So far it has happened 2 or 3 times (not sure) and it was in Another reason is that eventually I want to work on correct trapping semantics for the interpreter and testing those via the fuzzer and building the possible test cases now is easier than trying to do it all later.
Yeah, and that is pretty much how I'm planing to fix |
Here's what I played around with regarding generating the Opcodes, I tried to use the constraints encoded in the opcodes to generate valid signatures. IIRC I think they were too underspecified and at some point I stopped looking into it, but I don't remember a lot. It does work for some opcodes, I do remember that. It also needs a custom branch of cranelift with some additional interfaces on some types. Repository: Custom Branch |
In #3038 we introduced the initial version of the Cranelift CLIF-level differential fuzzer.
This fuzzer generates CLIF modules that are run on the interpreter and subsequently on the host machine (assuming no traps / invalid memory accesses) comparing the outputs of each run.
Roadmap:
br_table
's and other jump table jumps (Cranelift CLIF Fuzzer add jump tables andbr_table
#3299)table_addr
in interpreter #4433)notrap
MemFlagsaligned
MemFlagsreadonly
MemFlagscranelift-meta
crate to provide a table of acceptable opcodes and typesSourceLoc
to instructionsThe text was updated successfully, but these errors were encountered: