Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzing: ClusterFuzz integration #7079

Merged
merged 87 commits into from
Nov 19, 2024
Merged
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
9926504
start
kripken Nov 8, 2024
8d201ca
work
kripken Nov 11, 2024
fa633e9
work
kripken Nov 12, 2024
d29bb70
prep
kripken Nov 12, 2024
17e6e94
work
kripken Nov 12, 2024
bc9a1d1
work
kripken Nov 12, 2024
eb91fd3
work
kripken Nov 12, 2024
e1d5be0
work
kripken Nov 12, 2024
c9b057c
work
kripken Nov 12, 2024
fe8b47a
work
kripken Nov 12, 2024
cc22c7a
work
kripken Nov 12, 2024
1b97501
work
kripken Nov 12, 2024
6fb3e45
work
kripken Nov 12, 2024
ae2f663
work
kripken Nov 12, 2024
b940d34
work
kripken Nov 12, 2024
794980c
work
kripken Nov 12, 2024
823f146
work
kripken Nov 12, 2024
1ed21d5
work
kripken Nov 12, 2024
1657555
work
kripken Nov 12, 2024
586bad8
work
kripken Nov 12, 2024
ad6f5ee
work
kripken Nov 12, 2024
66e56db
work
kripken Nov 12, 2024
02a89b7
work
kripken Nov 12, 2024
156f6b6
fix
kripken Nov 12, 2024
07e1033
text
kripken Nov 13, 2024
f0cab01
oops
kripken Nov 13, 2024
a694dd7
restore
kripken Nov 13, 2024
af7b2d5
finish
kripken Nov 13, 2024
a0da68b
moar
kripken Nov 13, 2024
faf380c
oops.in.advance
kripken Nov 13, 2024
c9546a2
fix
kripken Nov 13, 2024
1d69074
prep
kripken Nov 13, 2024
a1e8257
test
kripken Nov 13, 2024
7769825
test
kripken Nov 13, 2024
69ce873
test
kripken Nov 13, 2024
b107a8b
test
kripken Nov 13, 2024
12b6324
test
kripken Nov 13, 2024
855d882
test
kripken Nov 13, 2024
e90bfbc
test
kripken Nov 13, 2024
aa4134b
test
kripken Nov 13, 2024
d93c615
dynamic
kripken Nov 13, 2024
1519588
dynamic
kripken Nov 13, 2024
076aa57
dynamic
kripken Nov 13, 2024
7852327
dynamic
kripken Nov 13, 2024
3d183d4
dynamic
kripken Nov 13, 2024
10ee7c4
work
kripken Nov 13, 2024
41c3e32
work
kripken Nov 13, 2024
23d0006
work
kripken Nov 13, 2024
a3f1b39
work
kripken Nov 14, 2024
fb6e8a8
work
kripken Nov 14, 2024
b6c0543
work
kripken Nov 14, 2024
0f998a8
test
kripken Nov 14, 2024
c30122c
fixes
kripken Nov 14, 2024
693f56c
fix
kripken Nov 14, 2024
838983a
fix
kripken Nov 14, 2024
a9c5a2e
test
kripken Nov 14, 2024
5525b36
work
kripken Nov 14, 2024
c423d35
fix
kripken Nov 14, 2024
e24ee9c
more
kripken Nov 14, 2024
8568cf8
test
kripken Nov 14, 2024
8fb0b69
fix
kripken Nov 14, 2024
5a87183
work
kripken Nov 14, 2024
d8aa63e
works
kripken Nov 14, 2024
46bca52
Merge remote-tracking branch 'origin/main' into clusterfuzz
kripken Nov 14, 2024
b440b65
notes
kripken Nov 14, 2024
53cec85
fix
kripken Nov 14, 2024
e0fb922
format
kripken Nov 14, 2024
23ae5a4
text
kripken Nov 14, 2024
d0b254d
note
kripken Nov 14, 2024
ccf4683
note
kripken Nov 14, 2024
6487be1
lint
kripken Nov 14, 2024
5fcf347
lint
kripken Nov 14, 2024
b3859df
lint
kripken Nov 14, 2024
2b3e0f7
lint
kripken Nov 14, 2024
e17046b
update
kripken Nov 14, 2024
51cff4d
try to fix macos
kripken Nov 15, 2024
9b08a40
Make the test use the right build dir, which varies on CI
kripken Nov 15, 2024
e3c9915
find build dir properly
kripken Nov 15, 2024
e3b905e
Update scripts/clusterfuzz/run.py
kripken Nov 18, 2024
99ba1ee
use unittest asserts
kripken Nov 18, 2024
aa9bb5c
Avoid regex-capturing stuff we don't need
kripken Nov 18, 2024
f4d79b1
assert on having one line per regex
kripken Nov 18, 2024
8de3f10
Update test/unit/test_cluster_fuzz.py
kripken Nov 18, 2024
8977b39
comment
kripken Nov 18, 2024
60e2f97
get build dir in all tests in the same, correct, manner
kripken Nov 19, 2024
310e161
Skip on windows
kripken Nov 19, 2024
d713d6e
comments
kripken Nov 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions scripts/bundle_clusterfuzz.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
#!/usr/bin/python3

'''
Bundle files for uploading to ClusterFuzz.

Usage:

bundle.py OUTPUT_FILE.tgz [--build-dir=BUILD_DIR]

The output file will be a .tgz file.

if a build directory is provided, we will look under there to find bin/wasm-opt
and lib/libbinaryen.so. A useful place to get builds from is the Emscripten SDK,
as you can do

./emsdk install tot

after which ./upstream/ (from the emsdk dir) will contain builds of wasm-opt and
libbinaryen.so (that are designed to run on as many systems as possible, by not
depending on newer libc symbols, etc., as opposed to a normal local build).
Thus, the full workflow could be

cd emsdk
./emsdk install tot
cd ../binaryen
python3 scripts/bundle_clusterfuzz.py binaryen_wasm_fuzzer.tgz --build-dir=../emsdk/upstream

When using --build-dir in this way, you are responsible for ensuring that the
wasm-opt in the build dir is compatible with the scripts in the current dir
(e.g., if run.py here passes a flag that is only in a new/older version of
wasm-opt, a problem can happen).

Before uploading to ClusterFuzz, it is worth doing the following:

1. Run the local fuzzer (scripts/fuzz_opt.py). That includes a ClusterFuzz
testcase handler, which simulates what ClusterFuzz does.

2. Run the unit tests, which include smoke tests for our ClusterFuzz support:

python -m unittest test/unit/test_cluster_fuzz.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this script should run these smoke tests automatically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly, but given that the tests have some logging output that the user should review manually, it seems best to me to separate the two tasks in a clean way. In particular, the user might want to run those tests multiple times on a single bundle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user is supposed to inspect logged output, I think that makes it even better to have the bundler script run them. We can still allow the tests to be run separately as well, and could even print instructions for that in the bundler output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that still feels a little less simple/unixey to me. The script would no longer be a bundler, but a "bundle-and-test" script, that does more than one thing. How about just printing the instructions after bundling?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine to me 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Look at the logs, which will contain statistics on the wasm files the
fuzzer emits, and see that they look reasonable.

You should run the unit tests on the bundle you are about to upload, by
setting the proper env var like this (using the same filename as above):

BINARYEN_CLUSTER_FUZZ_BUNDLE=`pwd`/binaryen_wasm_fuzzer.tgz python -m unittest test/unit/test_cluster_fuzz.py

Note that you must pass an absolute filename (e.g. using pwd as shown).

The unittest logs should reflect that that bundle is being used at the
very start ("Using existing bundle: ..." rather than "Making a new
bundle"). Note that some of the unittests also create their own bundles, to
test the bundling script itself, so later down you will see logging of
bundle creation even if you provide a bundle.

After uploading to ClusterFuzz, you can wait a while for it to run, and then:

1. Inspect the log to see that we generate all the testcases properly, and
their sizes look reasonably random, etc.

2. Inspect the sample testcase and run it locally, to see that

d8 --wasm-staging testcase.js

properly runs the testcase, emitting logging etc.

3. Check the stats and crashes page (known crashes should at least be showing
up). Note that these may take longer to show up than 1 and 2.
'''

import os
import sys
import tarfile

# Read the filenames first, as importing |shared| changes the directory.
output_file = os.path.abspath(sys.argv[1])
print(f'Bundling to: {output_file}')
assert output_file.endswith('.tgz'), 'Can only generate a .tgz'

build_dir = None
if len(sys.argv) >= 3:
assert sys.argv[2].startswith('--build-dir=')
build_dir = sys.argv[2].split('=')[1]
build_dir = os.path.abspath(build_dir)
# Delete the argument, as importing |shared| scans it.
sys.argv.pop()

from test import shared # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor the shared argument parsing to use less global state so we don't have to dodge the linter like this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be a very large refactoring. shared.py depends on parsing the arguments synchonously (it uses their results immediately), so putting it all in a function to call later wouldn't be enough. And I'm not sure how to add a "plugin" interface to add more things for that argparse code to handle.

I do agree that it is weird that this script has its own argument parsing in addition to the core parsing, but we do need that core parsing (for the flags to set the bin dir). We'd need to either duplicate that code, or do some kind of big refactoring that I don't have a good idea for.


# Pick where to get the builds
if build_dir:
binaryen_bin = os.path.join(build_dir, 'bin')
binaryen_lib = os.path.join(build_dir, 'lib')
else:
binaryen_bin = shared.options.binaryen_bin
binaryen_lib = shared.options.binaryen_lib

with tarfile.open(output_file, "w:gz") as tar:
# run.py
run = os.path.join(shared.options.binaryen_root, 'scripts', 'clusterfuzz', 'run.py')
print(f' .. run: {run}')
tar.add(run, arcname='run.py')

# fuzz_shell.js
fuzz_shell = os.path.join(shared.options.binaryen_root, 'scripts', 'fuzz_shell.js')
print(f' .. fuzz_shell: {fuzz_shell}')
tar.add(fuzz_shell, arcname='scripts/fuzz_shell.js')

# wasm-opt binary
wasm_opt = os.path.join(binaryen_bin, 'wasm-opt')
print(f' .. wasm-opt: {wasm_opt}')
tar.add(wasm_opt, arcname='bin/wasm-opt')

# For a dynamic build we also need libbinaryen.so and possibly other files.
# Try both .so and .dylib suffixes for more OS coverage.
for suffix in ['.so', '.dylib']:
libbinaryen = os.path.join(binaryen_lib, f'libbinaryen{suffix}')
if os.path.exists(libbinaryen):
print(f' .. libbinaryen: {libbinaryen}')
tar.add(libbinaryen, arcname=f'lib/libbinaryen{suffix}')

# The emsdk build also includes some more necessary files.
for name in [f'libc++{suffix}', f'libc++{suffix}.2', f'libc++{suffix}.2.0']:
path = os.path.join(binaryen_lib, name)
if os.path.exists(path):
print(f' ......... : {path}')
tar.add(path, arcname=f'lib/{name}')

print('Done.')
print('To run the tests on this bundle, do:')
print()
print(f'BINARYEN_CLUSTER_FUZZ_BUNDLE={output_file} python -m unittest test/unit/test_cluster_fuzz.py')
print()
163 changes: 163 additions & 0 deletions scripts/clusterfuzz/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
#
# Copyright 2024 WebAssembly Community Group participants
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

'''
ClusterFuzz run.py script: when run by ClusterFuzz, it uses wasm-opt to generate
a fixed number of testcases. This is a "blackbox fuzzer", see

https://google.github.io/clusterfuzz/setting-up-fuzzing/blackbox-fuzzing/

This file should be bundled up together with the other files it needs, see
bundle_clusterfuzz.py.
'''

import os
import getopt
import random
import subprocess
import sys

# The V8 flags we put in the "fuzzer flags" files, which tell ClusterFuzz how to
# run V8. By default we apply all staging flags.
FUZZER_FLAGS_FILE_CONTENTS = '--wasm-staging'

# Maximum size of the random data that we feed into wasm-opt -ttf. This is
# smaller than fuzz_opt.py's INPUT_SIZE_MAX because that script is tuned for
# fuzzing large wasm files (to reduce the overhead we have of launching many
# processes per file), which is less of an issue on ClusterFuzz.
MAX_RANDOM_SIZE = 15 * 1024

# The prefix for fuzz files.
FUZZ_FILENAME_PREFIX = 'fuzz-'

# The prefix for flags files.
FLAGS_FILENAME_PREFIX = 'flags-'

# The name of the fuzzer (appears after FUZZ_FILENAME_PREFIX /
# FLAGS_FILENAME_PREFIX).
FUZZER_NAME_PREFIX = 'binaryen-'

# The root directory of the bundle this will be in, which is the directory of
# this very file.
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))

# The path to the wasm-opt binary that we run to generate testcases.
FUZZER_BINARY_PATH = os.path.join(ROOT_DIR, 'bin', 'wasm-opt')

# The path to the fuzz_shell.js script that will execute the wasm in each
# testcase.
JS_SHELL_PATH = os.path.join(ROOT_DIR, 'scripts', 'fuzz_shell.js')

# The arguments we provide to wasm-opt to generate wasm files.
FUZZER_ARGS = [
# Generate a wasm from random data.
'--translate-to-fuzz',
# Run some random passes, to further shape the random wasm we emit.
'--fuzz-passes',
# Enable all features but disable ones not yet ready for fuzzing. This may
# be a smaller set than fuzz_opt.py, as that enables a few experimental
# flags, while here we just fuzz with d8's --wasm-staging.
'-all',
'--disable-shared-everything',
'--disable-fp16',
]


# Returns the file name for fuzz or flags files.
def get_file_name(prefix, index):
return f'{prefix}{FUZZER_NAME_PREFIX}{index}.js'


# Returns the contents of a .js fuzz file, given particular wasm contents that
# we want to be executed.
def get_js_file_contents(wasm_contents):
# Start with the standard JS shell.
with open(JS_SHELL_PATH) as file:
js = file.read()

# Prepend the wasm contents, so they are used (rather than the normal
# mechanism where the wasm file's name is provided in argv).
wasm_contents = ','.join([str(c) for c in wasm_contents])
js = f'var binary = new Uint8Array([{wasm_contents}]);\n\n' + js
return js


def main(argv):
# Parse the options. See
# https://google.github.io/clusterfuzz/setting-up-fuzzing/blackbox-fuzzing/#uploading-a-fuzzer
output_dir = '.'
num = 100
expected_flags = ['input_dir=', 'output_dir=', 'no_of_files=']
optlist, _ = getopt.getopt(argv[1:], '', expected_flags)
for option, value in optlist:
if option == '--output_dir':
output_dir = value
elif option == '--no_of_files':
num = int(value)

for i in range(1, num + 1):
input_data_file_path = os.path.join(output_dir, f'{i}.input')
wasm_file_path = os.path.join(output_dir, f'{i}.wasm')

# wasm-opt may fail to run in rare cases (when the fuzzer emits code it
# detects as invalid). Just try again in such a case.
for attempt in range(0, 100):
# Generate random data.
random_size = random.SystemRandom().randint(1, MAX_RANDOM_SIZE)
with open(input_data_file_path, 'wb') as file:
file.write(os.urandom(random_size))
Comment on lines +119 to +120
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of the ClusterFuzz documentation was that ClusterFuzz supplies input files. Should we be using those instead of generating new input files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it can provide an input corpus, yeah. My understanding is that each fuzz target provides its own, so I imagine V8's contains a bunch of JS testcases that a blackbox fuzzer could mutate.

For us, in theory we'd want a corpus of wasm files, and we could use those as initial content. I do plan to look into that, either adding it at the V8 level, or adding our existing corpus in Binaryen (which fuzz_opt.py uses) into the bundle. I'm not sure which is right yet, but I'd like to leave it as a followup (I have a list of such ideas already).


# Generate wasm from the random data.
cmd = [FUZZER_BINARY_PATH] + FUZZER_ARGS
cmd += ['-o', wasm_file_path, input_data_file_path]
try:
subprocess.check_call(cmd)
except subprocess.CalledProcessError:
# Try again.
print('(oops, retrying wasm-opt)')
attempt += 1
if attempt == 99:
# Something is very wrong!
raise
continue
# Success, leave the loop.
break

# Generate a testcase from the wasm
with open(wasm_file_path, 'rb') as file:
wasm_contents = file.read()
testcase_file_path = os.path.join(output_dir,
get_file_name(FUZZ_FILENAME_PREFIX, i))
js_file_contents = get_js_file_contents(wasm_contents)
with open(testcase_file_path, 'w') as file:
file.write(js_file_contents)

# Emit a corresponding flags file.
flags_file_path = os.path.join(output_dir,
get_file_name(FLAGS_FILENAME_PREFIX, i))
with open(flags_file_path, 'w') as file:
file.write(FUZZER_FLAGS_FILE_CONTENTS)

print(f'Created testcase: {testcase_file_path}, {len(wasm_contents)} bytes')

# Remove temporary files.
os.remove(input_data_file_path)
os.remove(wasm_file_path)

print(f'Created {num} testcases.')


if __name__ == '__main__':
main(sys.argv)
Loading