-
Notifications
You must be signed in to change notification settings - Fork 118
Add Bytecode Instrumentation to Atheris #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Removed sys.settrace stuff
Removed assertion of stacksize
README.md
Outdated
| The `atheris` module provides three key functions: `Instrument()`, `Setup()` and `Fuzz()`. | ||
|
|
||
| In your source file, define a fuzzer entry point function, and pass it to `atheris.Setup()`, along with the fuzzer's arguments (typically `sys.argv`). Finally, call `atheris.Fuzz()` to start fuzzing. Here's an example: | ||
| In your source file, when you import your target library make sure that this happens inside a `with atheris.Instrument():`-block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unusual, so provide an example. Also, replace "when you import your target library make sure that this happens inside" with "Import all libraries you wish to fuzz inside".
TheShiftedBit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent! Thank you much for this. I left a few local comments, but overall this looks good.
I'm going to confirm that dropping support for Python 2.7 is something we can do. While almost all of Google's code runs on Python 3 now, most of our library code supports both because occasional programs are still written in 2.7. If we can't drop 2.7, we might be able to do something on our end to use "old Atheris" with Python 2.7, or it may be necessary to include both instrumentation techniques in Atheris. I'm hoping we can just drop 2.7 support and call it a day though.
atheris/__init__.py
Outdated
|
|
||
| from .atheris import * | ||
| from .atheris import _loc, _reg, _cmp | ||
| from .import_hook import instrument as Instrument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't change the name like this - Instrument looks like a class. Just leave it lowercase.
atheris/version_dependent.py
Outdated
| PYTHON_VERSION = sys.version_info[:2] | ||
|
|
||
| if PYTHON_VERSION < (3,6) or PYTHON_VERSION > (3,9): | ||
| raise RuntimeError(f"You are fuzzing on an unsupported python version: {PYTHON_VERSION[0]}.{PYTHON_VERSION[1]}. Only 3.6 - 3.9 are supported.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a breaking change, I'll make it version 2.0. Maybe mention that atheris 1.0 can be used with Python 2.7.
|
|
||
| # Here atheris.Instrument() is not necessary | ||
| # because ujson is just an extension. | ||
| # Only python code can be instrumented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Only python code is instrumented with atheris.instrument; extensions are instrumented at compile-time."
libfuzzer.cc
Outdated
| bool setup_called = false; | ||
|
|
||
| unsigned long long num_counters = 0; | ||
| unsigned char* counters = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nullptr
libfuzzer.cc
Outdated
| } | ||
|
|
||
| NO_SANITIZE | ||
| void _reg(unsigned long long num) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have better names for these functions? _cmp is kinda obvious to me, but _loc and _reg aren't.
atheris/import_hook.py
Outdated
| """ | ||
| This function temporarily installs an import hook which instruments | ||
| all imported modules. | ||
| The arguments to this function are names of modules or packages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't actually say what those module names do.
I'd actually recommend taking two arguments: include and exclude. Specifying both wouldn't make sense, unless we end up supporting globs or some sort of hierarchical specification someday. (e.g. include all of foo.bar except foo.bar.baz.
| all imported modules. | ||
| The arguments to this function are names of modules or packages. | ||
| If it is a fully qualified module name, the name of its package will be used. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document what trace_dataflow does.
atheris/instrument_bytecode.py
Outdated
| old_reference = self.reference | ||
| old_size = self.get_size() | ||
|
|
||
| if changed_offset == old_offset + 0.5: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Floating-point equality? Make sure this is safe.
| Builds the bytecode that calls atheris._cmp(). | ||
| Only call this if one of the objects being compared is a constant | ||
| coming from co_consts. | ||
| If `switch` is true the constant is the second argument and needs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. This explains the "only left is ever const" in my other comment. Yeah, please document over there too.
libfuzzer.cc
Outdated
| int args_size = args_global.size(); | ||
|
|
||
| if (num_counters) { | ||
| counters = new unsigned char[num_counters]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation means that you couldn't instrument lazy imports that happen after fuzzing is started, right?
Is there a way around that? Perhaps by telling libFuzzer there are multiple "modules" and adding a new module when full?
If this can't be resolved, please make errors like this extremely obvious. If a module is imported and attempted to be instrumented after fuzzing is started, the program should exit with an error.
|
Alright, after discussion, we are willing to drop support for Python 2.7! So, no need to try to maintain both implementations or anything. |
…ent to atheris.Setup() to get rid of atheris_no_libfuzzer as a separate package
|
Hi, When testing this, I get the following errors on build: That's with clang, gcc produces similar but slightly different errors. The exact command invoked by setup.py was: |
|
Should be fixed now. |
|
Thanks! Still one error that can be fixed with a |
|
Sorry but I could not reproduce the latest error. Did the commit fix it? |
|
It did! Everything seems to be working fine. Merging! There's a couple small things, mostly to do with Google style, I want to change; but I'll avoid the back-and-forth and just make those changes myself. I'll make those today and then release the new version. |
|
Hey, I came across a few questions while working on this. Tell me about the reason you replace the 3 instrumentation functions, rather than just change their behavior, after fuzzing has started. This is leading to a few problems. The first is that inside of Google we link all extensions together, so this fails with our build system. That's something I can easily fix by giving the the before-fuzz and after-fuzz functions different names. But the second problem is that since coverage isn't being collected until Fuzz() is called, libFuzzer will get mad if the first couple fuzz attempts don't produce coverage. Do you think it would be reasonable to (a) only change the behavior of |
|
Unfortunately, some additional issues came up during integration into Google that require review. Should hopefully be done tomorrow. |
My reasoning was that if we want to support instrumentation of lazy imports some day then
This is absolutely possible for
I'm not quite sure what you mean. How can it make fuzz attempts before calling |
You can't, but you can still collect coverage. When libFuzzer starts, it runs a couple fuzz attempts with inputs like the empty string, and if those don't produce coverage, libFuzzer exits. This can happen if the fuzzer driver is a bit more complicated and doesn't call into the instrumented libraries on the first couple inputs. If coverage was collected before fuzzing started, users could easily resolve this by e.g. calling |
|
Could you provide an example fuzzing target where libfuzzer complains about missing coverage? Also to be completely honest this merge was a little bit premature.
I had plans to fix 1., 2. and 3. before you merged this (I was not aware of 4. though). Do you plan to fix these now yourself? I would be perfectly fine with that. |
Hmmm, it looks like I can't reproduce this in regular Python, only in Google's weird Python setup. So nevermind :)
Sure, I'll take on these. I don't know if #2 will be finished before we cut a release, but I'll get #1 and #3 done. Regarding the package structure: I moved all of the Python up to the root directory to maximize compatibility with existing fuzzers inside of Google, but I'll change it to a much more sensible structure soon. I'm also writing tests. One huge advantage of this instrumentation: testing is way easier! Now, it's practical to test the tracing code. I also made a couple changes to improve compatibility and clarity. |
Add Bytecode Instrumentation to Atheris
This pull request adds functionality to instrument modules at runtime
for coverage collection and dataflow tracing.
It enables atheris to get rid of
sys.settraceand its huge runtime overheadto double the execution speed of the fuzzer (see issue #16).
Changes from a user perspective
A new function was added to atheris called
atheris.Instrument()that installsa temporary import hook. This hook instruments the underlying code
objects of modules at import-time. The function has to be used as follows:
This will cause the
target_libraryto get instrumented. Every other library not imported insidethe
with-block will not get instrumented.It is possible to filter which modules get instrumented by supplying a whitelist of
module names to
atheris.Instrument()like this:This may be necessary to stop instrumentation of modules in the python standard library
(e.g. if
target_library_aimportsstruct, thenstructwould also get instrumented).For the sake of simplicity this filter is optional.
Changes in the code
atherischanged from being a single extension to being a package of the following structure:The same applies to
atheris_no_libfuzzer.atheris/import_hook.pyatheris.Instrument()and the import hook are defined hereatheris/instrument_bytecode()provides the function
patch_code(code_object)and the classInstrumentor, which does the heavy lifting of the instrumentationprocedure.
atheris/version_dependent.pycontains version-specific behaviour and data.
Supported python versions
Support for version 3.5 was dropped.
About the bytecode instrumentation
Each python module consists of a hierachy of code objects. At the top level is the code object
for the module itself and below are code objects for classes, functions, lambdas, etc.
atherisgoes through each code object and builds a CFG of the bytecode.If a basic block has two outgoing edges, a function invocation of
atheris._loc()getsinserted at both branch ends. The argument of
atheris._loc()is an id of the branch that was taken.After all code objects have been processed the number of overall instrumented branches in the
module is known and a call to
atheris._reg(num_instrumented)gets inserted at the very beginning ofthe module.
While all target modules get imported
atheriscollects all calls toatheris._reg()and stores the overall numberof counters needed for all modules.
In
atheris.Fuzz()a memory-region of an appropriate size is allocated and used as a region for the counters.atheris._loc()tellsatherisat which index to increment a counter in the counter region.In order to trace the dataflow each
COMPARE_OPgets replaced by a call toatheris._cmp().