Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for the Cranelift CLIF-level differential fuzzer #3050

Open
10 of 32 tasks
afonso360 opened this issue Jul 1, 2021 · 7 comments
Open
10 of 32 tasks

Tracking issue for the Cranelift CLIF-level differential fuzzer #3050

afonso360 opened this issue Jul 1, 2021 · 7 comments
Labels
cranelift Issues related to the Cranelift code generator fuzzing Issues related to our fuzzing infrastructure

Comments

@afonso360
Copy link
Contributor

afonso360 commented Jul 1, 2021

In #3038 we introduced the initial version of the Cranelift CLIF-level differential fuzzer.

This fuzzer generates CLIF modules that are run on the interpreter and subsequently on the host machine (assuming no traps / invalid memory accesses) comparing the outputs of each run.

Roadmap:

@github-actions
Copy link

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "fuzzing"

Thus the following users have been cc'd because of the following labels:

  • fitzgen: fuzzing

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@afonso360
Copy link
Contributor Author

afonso360 commented Sep 4, 2022

We've been having performance issues with fuzzgen. So I decided to implement a statistics framework and measure 🔬 a bunch of stuff!

Be warned, this comment is really long. These are the notes that I took while doing these improvements.

TL;DR: We have issues with input formats, we waste about 60% of our inputs. Did some improvements that doubled fuzzgen's performance.

Setup

All of these measurements are done on AArch64 (Ampere Altra) running on a single core. Mostly because it was easier for me to run them there.

All the measurements are started from the OSS-Fuzz corpus which is 10744 files (50MB). I also don't pass any -max_len flags since the fuzzer is selecting a default of about 1MB which IIRC was what Alex Crichton mentioned OSS-Fuzz uses.

#4868 introduces a statistics framework for fuzzgen to allow us to measure what is going on and where we are losing performance.

All of these steps cause a fuzzer input format change, but so does the baseline due to #4867 so hopefully they are all somewhat on equal standing? I'm not sure how to avoid / compensate for this but I don't think it turned out to be too relevant.

For all of these I ran them with the following command:

cargo fuzz run cranelift-fuzzgen --no-default-features -- -runs=100010 -seed=1588797004

Baseline

Baseline is main with the following PR's on top:

The aarch64 ones were necessary for stable execution.

#99818  REDUCE cov: 30635 ft: 178125 corp: 5452/5451Kb lim: 48702 exec/s: 18 rss: 1510Mb L: 58/45498 MS: 2 EraseBytes-PersAutoDict- DE: "\x01\x00\x00\x00\x00\x00\x00\x0d"-
== FuzzGen Statistics  ====================
Total Inputs: 100000
Well Formed Inputs: 48458 (48.5%)
Total Runs: 76648
Successful Runs: 38549 (50.3% of Total Runs)
Timed out Runs: 32679 (42.6% of Total Runs)
Traps:
        stk_ovf: 0 (0.0% of Total Runs)
        heap_oob: 0 (0.0% of Total Runs)
        icall_null: 0 (0.0% of Total Runs)
        int_ovf: 0 (0.0% of Total Runs)
        unreachable: 0 (0.0% of Total Runs)
        interrupt: 0 (0.0% of Total Runs)
        heap_misaligned: 0 (0.0% of Total Runs)
        bad_sig: 0 (0.0% of Total Runs)
        int_divz: 5420 (7.1% of Total Runs)
        bad_toint: 0 (0.0% of Total Runs)
        table_oob: 0 (0.0% of Total Runs)

Note
Each "Well Formed Input" can have multiple "Runs", each run is an input to a fuzzgen generated function by the interpreter and if successful also a host execution with that same input.

This is surprising to me in a bunch of ways!

  • We are indeed dropping a lot of inputs as malformed (52.5%)

  • 42.6% of runs are timeouts!

    • I did not expect this!
  • Traps are about 7.1% of runs

    • Not great, not terrible, lets look at the 50%'s first
    • Also important to note that when a run trap's the entire input is dropped, so this also roughly translates into 7% of inputs being dropped.
  • 0% Invalid memory accesses! Awesome!

Applying @jameysharp recent patches

The previous PR's plus:

#99897  NEW    cov: 30628 ft: 177683 corp: 5448/5433Kb lim: 48702 exec/s: 19 rss: 1735Mb L: 6707/32810 MS: 1 ChangeBinInt-
== FuzzGen Statistics  ====================
Total Inputs: 100000
Well Formed Inputs: 49089 (49.1%)
Total Runs: 155175
Successful Runs: 116304 (75.0% of Total Runs)
Timed out Runs: 33061 (21.3% of Total Runs)
Traps:
        heap_oob: 0 (0.0% of Total Runs)
        heap_misaligned: 0 (0.0% of Total Runs)
        bad_sig: 0 (0.0% of Total Runs)
        bad_toint: 0 (0.0% of Total Runs)
        icall_null: 0 (0.0% of Total Runs)
        int_ovf: 5 (0.0% of Total Runs)
        unreachable: 0 (0.0% of Total Runs)
        int_divz: 5805 (3.7% of Total Runs)
        stk_ovf: 0 (0.0% of Total Runs)
        table_oob: 0 (0.0% of Total Runs)
        interrupt: 0 (0.0% of Total Runs)

This increases the amount of runs that we do by a lot! It also seems neutral on execs/s and coverage.

The percentage of div by zero traps probably goes down because we stop running the input function as soon as one trap is found, and don't execute the rest of the inputs. Timed out runs have the same issue.

Also we now hit a int_ovf trap, I suspect we could also reach that without these changes but it makes it more likeley due to more inputs, so that a nice win!

Mostly Forward Branching

I haven't yet submitted a PR for this since I want to clean it up a little bit before submitting (WIP commit here).

The idea is to try to always generate forward branches with a small amount of backwards branches. This should reduce the amount of infinite loops we produce.

#99983  REDUCE cov: 29993 ft: 166091 corp: 4265/5351Kb lim: 393617 exec/s: 44 rss: 1316Mb L: 850/393581 MS: 1 EraseBytes-
== FuzzGen Statistics  ====================
Total Inputs: 100000
Well Formed Inputs: 37680 (37.7%)
Total Runs: 792534
Successful Runs: 785981 (99.2% of Total Runs)
Timed out Runs: 893 (0.1% of Total Runs)
Traps:
        heap_oob: 0 (0.0% of Total Runs)
        heap_misaligned: 0 (0.0% of Total Runs)
        bad_sig: 0 (0.0% of Total Runs)
        bad_toint: 0 (0.0% of Total Runs)
        int_divz: 5660 (0.7% of Total Runs)
        unreachable: 0 (0.0% of Total Runs)
        stk_ovf: 0 (0.0% of Total Runs)
        table_oob: 0 (0.0% of Total Runs)
        icall_null: 0 (0.0% of Total Runs)
        interrupt: 0 (0.0% of Total Runs)
        int_ovf: 0 (0.0% of Total Runs)

A 130% improvement in execs/s (19 -> 44)! We were indeed wasting a lot of time in timed out runs.

We now timeout on about 2.3% of well formed inputs, which is a lot better. The 0.1% number is misleading because as soon as a timeout is found we instantly drop the entire input, so we find much fewer of these cases when comparing to runs.

The number of runs has exploded (785k)! Eventually we may want to tune this if we find we are spending too much time in runs vs inputs.

Better Instruction Selection

This is @jameysharp's WIP commit that tries to make the input format more resilient by never trying to insert opcodes unless we have the input data types for it.

See this comment for a longer explanation

#99998  REDUCE cov: 30503 ft: 168926 corp: 4522/5062Kb lim: 1048576 exec/s: 29 rss: 1548Mb L: 367/394877 MS: 5 ChangeBit-ChangeBit-ChangeByte-CMP-EraseBytes- DE: "\x01\x00\x00q"-
== FuzzGen Statistics  ====================
Total Inputs: 100000
Well Formed Inputs: 40927 (40.9%)
Total Runs: 739299
Successful Runs: 726769 (98.3% of Total Runs)
Timed out Runs: 416 (0.1% of Total Runs)
Traps:
        int_ovf: 9 (0.0% of Total Runs)
        bad_sig: 0 (0.0% of Total Runs)
        icall_null: 0 (0.0% of Total Runs)
        table_oob: 0 (0.0% of Total Runs)
        int_divz: 12105 (1.6% of Total Runs)
        heap_oob: 0 (0.0% of Total Runs)
        heap_misaligned: 0 (0.0% of Total Runs)
        stk_ovf: 0 (0.0% of Total Runs)
        unreachable: 0 (0.0% of Total Runs)
        bad_toint: 0 (0.0% of Total Runs)
        interrupt: 0 (0.0% of Total Runs)

A 3% improvement in Well Formed Inputs (I also ran this a few more times but it was always within 3-4%)

I don't understand why our int_divz numbers exploded here (Aprox 29% of "Well Formed Inputs"!), but its something we definitely have to improve.

Please don't derive any sort of performance measurements from this, I implemented it in probably the worst way possible. It was just to check the "Well Formed Inputs" difference.

End notes

Here's the branch that I used for all of these, In the mean time I've iterated on some of those and separated them into individual PR's.

  • Generating "Well Formed Inputs" remains a big issue and I think its probably going to take a few iterations improving various parts of our input format, like @jameysharp did for instruction selection (Thanks!)

    • We lose about 60% of our inputs to this!
    • Is there a good way to collect stats about where we are trying to select something that is not available? Or why arbitrary decided our input wasn't valid?
  • With some of the other issues solved int_divz is now also a big issue.

    • We get 12105 on a Well Formed Input size of 40927 (29%)!
    • This means that 29% of input programs were dropped due to a div by zero, and I would guess that it was on the first execution for a lot of them. Which means that we potentially never even run them in the host.

@jameysharp
Copy link
Contributor

I am so excited about all this work you did over the weekend! I need to dig into it all. But until then, I'll reply to #4824 (comment) here:

You could still use something like the current opcodes table, as long as you have a way of filtering it to ensure that all the resources needed are available for each opcode. I don't think just input/output type signatures are sufficient for that.

Is there any opcode that you think could be problematic? I'm not seeing why we couldn't do that based on signature.

Any instructions with at least one operand which is not a "value". For example, stack load/store, with their need for a stack slot operand. Or function calls, which need a function reference, and also have a type signature that depends on the selected function reference. My WIP patch handles stack slots in 0c9689b#diff-226e0f113ddbbcedaa5265109a951e52d30aa35307537684f56a5e6bedc9ba0eR369-R390 but I didn't get to handling calls yet.

I think this is important because I suspect that we're currently discarding a lot of inputs due to, say, not generating any I128 variables up front but then trying to insert an I128 "add", or things like that. I'm betting that we'll find more bugs with the same number of test executions if we fix this. But I haven't done any measurements to check this guess.
Either way, I think it's nice to apply the patches that allow just borrowing slices everywhere. So I've opened #4863 with those. The remaining work-in-progress part is no longer on that branch but is still accessible as 0c9689b.

I did some measurements on these! With #4863 and the WIP commit we get a 3-4% improvement in the number of valid inputs that we generate (vs total inputs). While these are not earth shattering numbers, I think we definitely should either commit this or something like it.

I hope #4863 by itself didn't affect any of these numbers. It may have changed which stack slot is selected in some tests but otherwise if it changed how things are generated then I probably introduced a bug. 😅

But it looks like you measured #4862 and #4863 together and saw a 1% increase in well-formed inputs. I don't think that's surprising from either #4862 or random noise, so that's reassuring.

As for "Better Instruction Selection": by my count, going from 38k well-formed inputs to 41k is an 8% increase (40927 / 37680 is 108%). Going from 38% to 41% is a difference of 3 percentage points but I'm not sure that's the best way to measure that change.

The "Mostly Forward Branching" work has very promising-looking results!

Also important to note that when a run trap's the entire input is dropped, so this also roughly translates into 7% of inputs being dropped.

How about continuing on to the remaining test inputs for the given function? If any of them don't trap then at least we've successfully tested something.

As a bigger project, maybe we should try doing something like wasm-smith's new no-trapping mode.

Is there a good way to collect stats about where we are trying to select something that is not available? Or why arbitrary decided our input wasn't valid?

I'm not sure yet; I'm thinking about it. Some observations that might help:

  • There's a general goal in arbitrary that it shouldn't fail (rust-fuzz/arbitrary@1d2f653). Unstructured::choose may be the only method that ever actually fails, though I haven't checked. If so, we primarily need to audit for calls to choose where we might pass in an empty slice, and avoid doing that. We can also try replacing ? operators with .unwrap() and see if any fail. Or maybe print a stack trace before returning the error, with std::backtrace::Backtrace.
  • I've observed that we discard an input if CLIF validation fails after the NaN canonicalization pass. I think we should defer canonicalization until we're about to run the function, instead of while we're generating it. And probably panic instead of ignoring the error.
  • In general, let's try to minimize the number of ways that test-case generation can fail, so that cargo fuzz fmt is reliable.

@afonso360
Copy link
Contributor Author

afonso360 commented Sep 11, 2022

There's a general goal in arbitrary that it shouldn't fail (rust-fuzz/arbitrary@1d2f653). Unstructured::choose may be the only method that ever actually fails, though I haven't checked. If so, we primarily need to audit for calls to choose where we might pass in an empty slice, and avoid doing that. We can also try replacing ? operators with .unwrap() and see if any fail. Or maybe print a stack trace before returning the error, with std::backtrace::Backtrace.

So I did this last one, I recorded backtraces for Unstructured::choose where we fail due to a empty choice. And from the numbers it appears to be most of the cases where fail to generate an input. It tracks with previous measurements of being about 50% of the inputs.

See this branch for the backtrace printing.
I used this script to group and print the backtraces.

With 150k execs we got the following results: (Reordered for better readability)

Loaded 71331 stacks  
                     
First Line duplicates
55.8 - Count: 39787 - Line:    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
15.0 - Count: 10731 - Line:    0: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_target_block
13.4 - Count:  9554 - Line:    0: cranelift_fuzzgen::function_generator::FunctionGenerator::stack_slot_with_size
 9.8 - Count:  6974 - Line:    0: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_jumptables
 3.3 - Count:  2381 - Line:    0: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_switch    
 2.2 - Count:  1561 - Line:    0: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_br_table  
 0.5 - Count:   343 - Line:    0: cranelift_fuzzgen::function_generator::insert_call
I also tried grouping by the first two stack frames, and about 30% come from insert_opcode which is not unexpected.
Two Line duplicates
29.6 - Count: 21108 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
    1: cranelift_fuzzgen::function_generator::insert_opcode
12.8 - Count: 9163 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_target_block
    1: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_bricmp
9.8 - Count: 6974 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_jumptables
    1: cranelift_fuzzgen::function_generator::FunctionGenerator::generate
7.9 - Count: 5615 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::stack_slot_with_size
    1: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_load_store_address
7.3 - Count: 5209 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
    1: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_bricmp
4.8 - Count: 3401 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::stack_slot_with_size
    1: cranelift_fuzzgen::function_generator::insert_stack_store
4.0 - Count: 2833 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
    1: cranelift_fuzzgen::function_generator::insert_stack_store
3.7 - Count: 2642 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
    1: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
3.3 - Count: 2381 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_switch
    1: cranelift_fuzzgen::function_generator::FunctionGenerator::finalize_block
2.2 - Count: 1603 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
    1: cranelift_fuzzgen::function_generator::insert_load_store
2.2 - Count: 1561 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_br_table
    1: cranelift_fuzzgen::function_generator::FunctionGenerator::finalize_block
1.9 - Count: 1389 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
    1: alloc::vec::Vec<T,A>::extend_desugared
1.7 - Count: 1236 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
    1: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_switch
1.6 - Count: 1144 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
    1: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_br_table
1.3 - Count: 940 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_target_block
    1: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_br
1.1 - Count: 769 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
    1: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_br
0.9 - Count: 628 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_target_block
    1: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_jump
0.8 - Count: 564 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
    1: cranelift_fuzzgen::function_generator::insert_const
0.8 - Count: 550 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
    1: cranelift_fuzzgen::function_generator::insert_cmp
0.8 - Count: 538 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::stack_slot_with_size
    1: cranelift_fuzzgen::function_generator::insert_stack_load
0.7 - Count: 526 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
    1: cranelift_fuzzgen::function_generator::FunctionGenerator::generate
0.5 - Count: 343 - Line:
    0: cranelift_fuzzgen::function_generator::insert_call
    1: cranelift_fuzzgen::function_generator::FunctionGenerator::generate_instructions
0.3 - Count: 214 - Line:
    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
    1: cranelift_fuzzgen::function_generator::insert_stack_load

Fixing get_variable_of_type

Most of the issues with get_variable_of_type should go away when we start doing proper instruction selection.

If there are tail uses elsewhere I think we can change get_variable_of_type to generate a variable on demand with a random default value and making sure this variable is not used anywhere else (by not including it in the variable pool).

This should pretty much cover all scenarios.

Fixing issues with invalid blocks / jumptables

I've updated the "Mostly forward branching" PR to entirely fix these issues by being smarter about which terminators we can insert.

See: #4894

insert_call

I think we can easily fix this by not generating a call if we don't have a func ref, but its only about 0.5% so I'm going to leave this for later.

Any instructions with at least one operand which is not a "value". For example, stack load/store, with their need for a stack slot operand. Or function calls, which need a function reference, and also have a type signature that depends on the selected function reference. My WIP patch handles stack slots in 0c9689b#diff-226e0f113ddbbcedaa5265109a951e52d30aa35307537684f56a5e6bedc9ba0eR369-R390 but I didn't get to handling calls yet.

What do you think about doing something similar to what you did in the WIP patch but with a base on the opcode signature table.

After generating the resources for the function we filter the opcodes based on signature. Most opcodes are fairly easy, we just need to ensure that we have the right variable types.

For call and call_indirect, we can also filter out function references where we don't have the correct types. If we end up with 0 valid references we can also filter out these opcodes from ever being selected.

stack_load/stack_stores are also similar in that we can filter them out based on if we have stack slots with enough space for the type and if we have variables to store or load those types.

I think this is pretty much what you were doing in the WIP patch, but instead of generating the options we filter the options. This is also closer to how I imagine it working when we move to generating the opcodes from cranelift-meta.

I'll try to implement something along these lines during the week.

I've observed that we discard an input if CLIF validation fails after the NaN canonicalization pass. I think we should defer canonicalization until we're about to run the function, instead of while we're generating it. And probably panic instead of ignoring the error.

Panicking sounds like a good idea because it should never fail there. (I've opened #4896 for this)

I'd still like to keep canonicalization on the fuzzgen crate. If anyone else wants to run random clif it would be nice to be able to depend on only that crate. Inside fuzzgen we can run those passes later, but not by much we only run them after fully generating the function.

Edit: Reading #4896 (review) made me realize why this would be important.

As a bigger project, maybe we should try doing something like wasm-smith's new no-trapping mode.

I think we should still allow some degree of trapping, we've caught a bunch of issues with the interpreter due to this. However I also think we should be able to configure how many traps we allow generating. I did something similar in #4894 where we can configure how many Timeouts we allow by proxy of backwards branches. And I'm thinking of doing something similar to the integer division by zero cases.

@jameysharp
Copy link
Contributor

So I did this last one, I recorded backtraces for Unstructured::choose where we fail due to a empty choice. [...] With 150k execs we got the following results:

Loaded 71331 stacks  
                     
First Line duplicates
55.8 - Count: 39787 - Line:    0: cranelift_fuzzgen::function_generator::FunctionGenerator::get_variable_of_type
...

I also tried grouping by the first two stack frames, and about 30% come from insert_opcode which is not unexpected.

This is so cool!

The numbers in #4894 are confusing me though. The number of valid inputs doesn't agree with the number of stack traces. For example, the baseline reported in that PR had 31,622 invalid inputs according to the stats, but only 20,139 stack traces. Were those from different runs, or are there invalid inputs that are not being measured?

Also: of failures during the baseline run, roughly 27% were due to choosing an impossible terminator, and the other 73% were from instruction selection. After the PR, you captured about 71% as many stack traces, which is close enough to 73% to seem plausibly like variation due to random chance.

So I'd assumed the PR would have 71-73% as many invalid inputs. Instead it had about 92% as many invalid inputs. So the effect of the PR on the number of valid inputs was rather smaller than I thought it would be. I think this is consistent with a bunch of invalid inputs not capturing stack traces.

I've written a throwaway patch for the arbitrary crate that applies the same backtrace logic you used, but does it in the few places where that crate can actually construct a new error. Apply trace-arbitrary-errors.txt to a local clone of rust-fuzz/arbitrary, then tell Cargo to override dependencies with your copy.

You should get similar results as your current branch without having to wrap your choose! macro around every single thing, except for a couple changes that I think might be informative. For one, this means the stack trace includes calls inside arbitrary, so you'll want to compare more stack frames. I also changed it to report which error occurred (EmptyChoose, NotEnoughData, or IncorrectFormat), although you can also tell from the stack trace, so feel free to remove that.

This still misses stack traces for errors which originated from DataValue::from_integer (which I think shouldn't fail) or other calls that don't invoke arbitrary (but with #4896 merged, I can't find any others). But I hope it will be very close to getting a stack trace for every invalid input.

By the way, do you need to limit the number of stack frames at all? I'd expect comparing the entire stack trace to work fine in this case because there aren't many control flow paths in cranelift-fuzzgen. So on Linux/Cygwin/WSL I'd do something like sha1sum stack/* | sort | uniq -cw40 | sort -rn to get a list of representative stack traces sorted by how often they occurred.

I think we can easily fix [insert_call] by not generating a call if we don't have a func ref, but its only about 0.5% so I'm going to leave this for later.

It may be easy when you're doing the rest of the instruction selection work, but I agree, at 0.5% there's no rush. That said, in #4894 you measured double-digit percentages of failures in insert_call. Maybe in the run above you happened to get really lucky?

What do you think about doing something similar to what you did in the WIP patch but with a base on the opcode signature table. [...] I think this is pretty much what you were doing in the WIP patch, but instead of generating the options we filter the options.

That all sounds great. I don't have a strong attachment to the style I used. The main thing I liked about it was capturing borrows of the appropriate slices in each option. I liked being able to avoid looking up resources when generating an instruction, by remembering what resources I already looked up when generating the list of choices. If there's a lookup in both places, I worry that they could get out of sync. So if you find that it's easy to do that while filtering options, I'd like that; but if it's not easy, I don't care that much.

As a bigger project, maybe we should try doing something like wasm-smith's new no-trapping mode.

I think we should still allow some degree of trapping, we've caught a bunch of issues with the interpreter due to this.

I don't understand: how do we catch interpreter bugs if we silently throw out cases where the interpreter traps? Are these bugs where the interpreter should have trapped but it didn't, so we only saw the trap during native-code execution?

To make sure we're talking about the same thing: "no-trapping mode" means generating extra code to ensure that the inputs are always valid for instructions that may trap. It doesn't mean disabling instructions that could trap. For example, you could bitwise-or 1 into the denominator of a division, so that it's never 0 regardless of how the value was generated.

@afonso360
Copy link
Contributor Author

The numbers in #4894 are confusing me though. The number of valid inputs doesn't agree with the number of stack traces. For example, the baseline reported in that PR had 31,622 invalid inputs according to the stats, but only 20,139 stack traces. Were those from different runs, or are there invalid inputs that are not being measured?

I think this was an issue with #4894. I've re run the new benchmarks with your patch and it looks like we are now getting the expected results. See: #4894 (comment)

By the way, do you need to limit the number of stack frames at all? I'd expect comparing the entire stack trace to work fine in this case because there aren't many control flow paths in cranelift-fuzzgen. So on Linux/Cygwin/WSL I'd do something like sha1sum stack/* | sort | uniq -cw40 | sort -rn to get a list of representative stack traces sorted by how often they occurred.

No issues, it all seemed to work fine, although it was a bit slow with 70k inputs, but no big deal.

It may be easy when you're doing the rest of the instruction selection work, but I agree, at 0.5% there's no rush. That said, in #4894 you measured double-digit percentages of failures in insert_call. Maybe in the run above you happened to get really lucky?

I have no explanation for this. The 150k run above did start from a slightly larger corpus that I kept using while developing. But the one on #4894 is the OSS-Fuzz original corpus, maybe that makes a difference? That one is also only 50k runs and maybe that also figures into it? To be honest I'm kind of at a loss here.

That all sounds great. I don't have a strong attachment to the style I used. The main thing I liked about it was capturing borrows of the appropriate slices in each option. I liked being able to avoid looking up resources when generating an instruction, by remembering what resources I already looked up when generating the list of choices. If there's a lookup in both places, I worry that they could get out of sync. So if you find that it's easy to do that while filtering options, I'd like that; but if it's not easy, I don't care that much.

I'll try to work something out where we can preserver those constraints, I agree, I don't like the lookup in both places.

I think we should still allow some degree of trapping, we've caught a bunch of issues with the interpreter due to this.

I don't understand: how do we catch interpreter bugs if we silently throw out cases where the interpreter traps? Are these bugs where the interpreter should have trapped but it didn't, so we only saw the trap during native-code execution?

Yes, that's exactly it. So far it has happened 2 or 3 times (not sure) and it was in div, where we didn't properly catch div by zero and in rem where the remainder of -1 should also trap. In these cases the interpreter actually crashed, but even if it didn't hopefully we would have caught those when comparing the results with the native execution.

Another reason is that eventually I want to work on correct trapping semantics for the interpreter and testing those via the fuzzer and building the possible test cases now is easier than trying to do it all later.

To make sure we're talking about the same thing: "no-trapping mode" means generating extra code to ensure that the inputs are always valid for instructions that may trap. It doesn't mean disabling instructions that could trap. For example, you could bitwise-or 1 into the denominator of a division, so that it's never 0 regardless of how the value was generated.

Yeah, and that is pretty much how I'm planing to fix int_divz issues, however I want to add a config where sometimes we don't insert those. Or insert them only for some instructions.

@afonso360
Copy link
Contributor Author

@jameysharp @elliottt 👋

Here's what I played around with regarding generating the Opcodes, I tried to use the constraints encoded in the opcodes to generate valid signatures. IIRC I think they were too underspecified and at some point I stopped looking into it, but I don't remember a lot.

It does work for some opcodes, I do remember that.

It also needs a custom branch of cranelift with some additional interfaces on some types.

Repository:
https://github.com/afonso360/cranelift-opcode-gen

Custom Branch
https://github.com/afonso360/wasmtime/commits/fuzzgen-gen-inst

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator fuzzing Issues related to our fuzzing infrastructure
Projects
None yet
Development

No branches or pull requests

3 participants