Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzing: ClusterFuzz integration #7079

Merged
merged 87 commits into from
Nov 19, 2024
Merged

Conversation

kripken
Copy link
Member

@kripken kripken commented Nov 14, 2024

The main addition here is a bundle_clusterfuzz.py script which will package up
the exact files that should be uploaded to ClusterFuzz. It also documents the
process and bundling and testing. You can do

bundle.py OUTPUT_FILE.tgz

That bundles wasm-opt from ./bin., which is enough for local testing. For
actually uploading to ClusterFuzz, we need a portable build, and @dschuff
had the idea to reuse the emsdk build, which works nicely. Doing

bundle.py OUTPUT_FILE.tgz --build-dir=/path/to/emsdk/upstream/

will bundle wasm-opt (+libs) from the emsdk. I verified that those builds
work on ClusterFuzz.

I added several forms of testing here. First, our main fuzzer fuzz_opt.py now
has a ClusterFuzz testcase handler, which simulates a ClusterFuzz environment.
Second, there are smoke tests that run in the unit test suite, and can also be
run separately:

python -m unittest test/unit/test_cluster_fuzz.py

Those unit tests can also run on a given bundle, e.g. one created from an
emsdk build, for testing right before upload:

BINARYEN_CLUSTER_FUZZ_BUNDLE=/path/to/bundle.tgz python -m unittest test/unit/test_cluster_fuzz.py

A third piece of testing is to add a --fuzz-passes test. That is a mode for
-ttf (translate random data into a valid wasm fuzz testcase) that uses random
data to pick and run a set of passes, to further shape the wasm. (--fuzz-passes
had no previous testing, and this PR fixes it and tidies it up a little, adding some
newer passes too).

Otherwise this PR includes the key run.py script that is bundled and then
executed by ClusterFuzz, basically a python script that runs wasm-opt -ttf [..]
to generate testcases, sets up their JS, and emits them.

fuzz_shell.js, which is the JS to execute testcases, will now check if it is
provided binary data of a wasm file. If so, it does not read a wasm file from
argv[1]. (This is needed because ClusterFuzz expects a single file for the
testcase, so we make a JS file with bundled wasm inside it.)

@@ -85,7 +85,7 @@
# Delete the argument, as importing |shared| scans it.
sys.argv.pop()

from test import shared
from test import shared # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor the shared argument parsing to use less global state so we don't have to dodge the linter like this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be a very large refactoring. shared.py depends on parsing the arguments synchonously (it uses their results immediately), so putting it all in a function to call later wouldn't be enough. And I'm not sure how to add a "plugin" interface to add more things for that argparse code to handle.

I do agree that it is weird that this script has its own argument parsing in addition to the core parsing, but we do need that core parsing (for the flags to set the bin dir). We'd need to either duplicate that code, or do some kind of big refactoring that I don't have a good idea for.


./emsdk install tot

after which ./upstream/ (from the emsdk dir) will contain portable builds of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "portable" mean in this context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the sense of being able to run on as many targets as possible. For Linux, that means not depending on specific versions of system libc etc. The emsdk makes such builds, for example. Is there a better word for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "hermetic," but that's a stronger property than what we mean here. How about instead of just saying "portable," we mention that the emsdk builds don't depend on system libc, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusts glasses and pocket protector...
well actually, the emsdk builds do depend on system libc. Just not on libc++ (and the libc is fairly old, so it doesn't depend on very new libc symbol versions).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, ok, I adjusted the comment to mention that. I think now it's general and accurate enough.


2. Run the unit tests, which include smoke tests for our ClusterFuzz support:

python -m unittest test/unit/test_cluster_fuzz.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this script should run these smoke tests automatically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly, but given that the tests have some logging output that the user should review manually, it seems best to me to separate the two tasks in a clean way. In particular, the user might want to run those tests multiple times on a single bundle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user is supposed to inspect logged output, I think that makes it even better to have the bundler script run them. We can still allow the tests to be run separately as well, and could even print instructions for that in the bundler output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that still feels a little less simple/unixey to me. The script would no longer be a bundler, but a "bundle-and-test" script, that does more than one thing. How about just printing the instructions after bundling?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine to me 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

scripts/clusterfuzz/run.py Outdated Show resolved Hide resolved
scripts/clusterfuzz/run.py Show resolved Hide resolved
test/unit/test_cluster_fuzz.py Show resolved Hide resolved
test/unit/test_cluster_fuzz.py Outdated Show resolved Hide resolved
test/unit/test_cluster_fuzz.py Show resolved Hide resolved
test/unit/test_cluster_fuzz.py Outdated Show resolved Hide resolved
test/unit/test_cluster_fuzz.py Outdated Show resolved Hide resolved
@kripken kripken merged commit b0e999a into WebAssembly:main Nov 19, 2024
13 checks passed
@kripken kripken deleted the clusterfuzz branch November 19, 2024 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants