Skip to content

Commit

Permalink
Merge pull request #3366 from trailofbits/matching-refactor
Browse files Browse the repository at this point in the history
Matching Refactor
  • Loading branch information
ESultanik authored Feb 11, 2022
2 parents 7be7409 + 6342553 commit 9c2d20b
Show file tree
Hide file tree
Showing 28 changed files with 4,156 additions and 3,416 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.6, 3.7, 3.8, 3.9]
python-version: [3.7, 3.8, 3.9, "3.10"]

steps:
- uses: actions/checkout@v2
Expand Down
139 changes: 93 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,34 +34,45 @@ This will automatically install the `polyfile` and `polymerge` executables in yo

```
usage: polyfile [-h] [--filetype FILETYPE] [--list] [--html HTML]
[--try-all-offsets] [--only-match] [--debug] [--quiet]
[--version] [-dumpversion]
[--only-match-mime] [--only-match] [--require-match]
[--max-matches MAX_MATCHES] [--debug] [--trace] [--debugger]
[--no-debug-python] [--quiet] [--version] [-dumpversion]
[FILE]
A utility to recursively map the structure of a file.
positional arguments:
FILE The file to analyze; pass '-' or omit to read from
FILE the file to analyze; pass '-' or omit to read from
STDIN
optional arguments:
-h, --help show this help message and exit
--filetype FILETYPE, -f FILETYPE
Explicitly match against the given filetype (default
is to match against all filetypes)
explicitly match against the given filetype or
filetype wildcard (default is to match against all
filetypes)
--list, -l list the supported filetypes (for the `--filetype`
argument) and exit
--html HTML, -t HTML Path to write an interactive HTML file for exploring
--html HTML, -t HTML path to write an interactive HTML file for exploring
the PDF
--try-all-offsets, -a
Search for a file match at every possible offset; this
can be very slow for larger files
--only-match, -m Do not attempt to parse known filetypes; only match
--only-match-mime, -I
just print out the matching MIME types for the file,
one on each line
--only-match, -m do not attempt to parse known filetypes; only match
against file magic
--debug, -d Print debug information
--quiet, -q Suppress all log output (overrides --debug)
--version, -v Print PolyFile's version information to STDERR
-dumpversion Print PolyFile's raw version information to STDOUT and
--require-match if no matches are found, exit with code 127
--max-matches MAX_MATCHES
stop scanning after having found this many matches
--debug, -d print debug information
--trace, -dd print extra verbose debug information
--debugger, -db drop into an interactive debugger for libmagic file
definition matching and PolyFile parsing
--no-debug-python by default, the `--debugger` option will break on
custom matchers and prompt to debug using PDB. This
option will suppress those prompts.
--quiet, -q suppress all log output (overrides --debug)
--version, -v print PolyFile's version information to STDERR
-dumpversion print PolyFile's raw version information to STDOUT and
exit
```

Expand All @@ -76,6 +87,12 @@ You can optionally have PolyFile output an interactive HTML page containing a la
polyfile INPUT_FILE --html output.html > output.json
```

### Interactive Debugger

PolyFile has an interactive debugger both for its file matching and parsing. It can be used to debug a libmagic pattern
definition, determine why a specific file fails to be classified as the expected MIME type, or step through a parser.
You can run PolyFile with the debugger enabled using the `-db` option.

### File Support

PolyFile has a cleanroom, [pure Python implementation of the libmagic file classifier](#libmagic-implementation), and
Expand All @@ -102,6 +119,12 @@ TrID matching code is still shipped with PolyFile and can be invoked programmati

PolyFile outputs its mapping in an extension of the [SBuD](https://github.com/corkami/sbud) JSON format described [in the documentation](docs/json_format.md).

PolyFile can also emit a standalone HTML document that contains an interactive hex viewer as well as syntax trees for
the discovered file formats. Simply pass the `--html` argument to PolyFile with an output path:
```console
$ polyfile input_file --html output.html
```

### libMagic Implementation

PolyFile has a cleanroom implementation of [libmagic (used in the `file` command)](https://github.com/file/file).
Expand All @@ -125,6 +148,32 @@ with open("file_to_test", "rb") as f:
...
```

### Debugging the libmagic DSL
`libmagic` has an esoteric, poorly documented doman-specific language (DSL) for specifying its matching signatures.
You can read the minimal and—as we have discovered in our cleanroom implementation—_incomplete_ documentation by running
`man 5 magic`. PolyFile implements an interactive debugger for stepping through the DSL specifications, modeled after
GDB. You can enter this debugger by passing the `--debugger` or `-db` argument to PolyFile. It is useful for both
implementing new `libmagic` DSLs, as well as figuring out why an existing DSL fails to match against a given file.
```console
$ polyfile -db input_file
PolyFile 0.3.5
Copyright ©2021 Trail of Bits
Apache License Version 2.0 https://www.apache.org/licenses/

For help, type "help".
(polyfile) help
help ....... print this message
continue ... continue execution until the next breakpoint is hit
step ....... step through a single magic test
next ....... continue execution until the next test that matches
where ...... print the context of the current magic test (aliases: info stack and backtrace)
test ....... test the following libmagic DSL test at the current position
print ...... print the computed absolute offset of the following libmagic DSL offset
breakpoint . list the current breakpoints or add a new one
delete ..... delete a breakpoint
quit ....... exit the debugger
```

## Merging Output From PolyTracker

[PolyTracker](https://github.com/trailofbits/polytracker) is PolyFile’s sister utility for automatically instrumenting
Expand All @@ -138,42 +187,41 @@ A separate utility called `polymerge` is installed with PolyFile specifically de
tools.

```
usage: polyfile [-h] [--filetype FILETYPE] [--list] [--html HTML]
[--only-match-mime] [--only-match] [--require-match]
[--max-matches MAX_MATCHES] [--debug] [--trace] [--quiet]
[--version] [-dumpversion]
[FILE]
usage: polymerge [-h] [--cfg CFG] [--cfg-pdf CFG_PDF]
[--dataflow [DATAFLOW ...]] [--no-intermediate-functions]
[--demangle] [--type-hierarchy TYPE_HIERARCHY]
[--type-hierarchy-pdf TYPE_HIERARCHY_PDF] [--diff [DIFF ...]]
[--debug] [--quiet] [--version] [-dumpversion]
FILES [FILES ...]
A utility to recursively map the structure of a file.
A utility to merge the JSON output of `polyfile`
with a polytracker.json file from PolyTracker.
https://github.com/trailofbits/polyfile/
https://github.com/trailofbits/polytracker/
positional arguments:
FILE the file to analyze; pass '-' or omit to read from
STDIN
FILES Path to the PolyFile JSON output and/or the PolyTracker JSON output. Merging will only occur if both files are provided. The `--cfg` and `--type-hierarchy` options can be used if only a single file is provided, but no merging will occur.
optional arguments:
-h, --help show this help message and exit
--filetype FILETYPE, -f FILETYPE
explicitly match against the given filetype or
filetype wildcard (default is to match against all
filetypes)
--list, -l list the supported filetypes (for the `--filetype`
argument) and exit
--html HTML, -t HTML path to write an interactive HTML file for exploring
the PDF
--only-match-mime, -I
just print out the matching MIME types for the file,
one on each line
--only-match, -m do not attempt to parse known filetypes; only match
against file magic
--require-match if no matches are found, exit with code 127
--max-matches MAX_MATCHES
stop scanning after having found this many matches
--debug, -d print debug information
--trace, -dd print extra verbose debug information
--quiet, -q suppress all log output (overrides --debug)
--version, -v print PolyFile's version information to STDERR
-dumpversion print PolyFile's raw version information to STDOUT and
exit
--cfg CFG, -c CFG Optional path to output a Graphviz .dot file representing the control flow graph of the program trace
--cfg-pdf CFG_PDF, -p CFG_PDF
Similar to --cfg, but renders the graph to a PDF instead of outputting the .dot source
--dataflow [DATAFLOW ...]
For the CFG generation options, only render functions that participated in dataflow. `--dataflow 10` means that only functions in the dataflow related to byte 10 should be included. `--dataflow 10:30` means that only functions operating on bytes 10 through 29 should be included. The beginning or end of a range can be omitted and will default to the beginning and end of the file, respectively. Multiple `--dataflow` ranges can be specified. `--dataflow :` will render the CFG only with functions that operated on tainted bytes. Omitting `--dataflow` will produce a CFG containing all functions.
--no-intermediate-functions
To be used in conjunction with `--dataflow`. If enabled, only functions in the dataflow graph if they operated on the tainted bytes. This can result in a disjoint dataflow graph.
--demangle Demangle C++ function names in the CFG (requires that PolyFile was installed with the `demangle` option, or that the `cxxfilt` Python module is installed.)
--type-hierarchy TYPE_HIERARCHY, -t TYPE_HIERARCHY
Optional path to output a Graphviz .dot file representing the type hierarchy extracted from PolyFile
--type-hierarchy-pdf TYPE_HIERARCHY_PDF, -y TYPE_HIERARCHY_PDF
Similar to --type-hierarchy, but renders the graph to a PDF instead of outputting the .dot source
--diff [DIFF ...] Diff an arbitrary number of input polytracker.json files, all treated as the same class, against one or more polytracker.json provided after `--diff` arguments
--debug, -d Print debug information
--quiet, -q Suppress all log output (overrides --debug)
--version, -v Print PolyMerge's version information and exit
-dumpversion Print PolyMerge's raw version information and exit
```

The output of `polymerge` is the same as [PolyFile’s output format](docs/json_format.md), augmented with the following:
Expand Down Expand Up @@ -202,5 +250,4 @@ This research was developed by [Trail of
Bits](https://www.trailofbits.com/) with funding from the Defense
Advanced Research Projects Agency (DARPA) under the SafeDocs program
as a subcontractor to [Galois](https://galois.com). It is licensed under the [Apache 2.0 license](LICENSE).
The [PDF parser](polyfile/pdfparser.py) is modified from the parser developed by Didier Stevens and released into the
public domain. © 2019, Trail of Bits.
© 2019, Trail of Bits.
6 changes: 6 additions & 0 deletions hooks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Default Git Hooks for PolyFile Development

To enable these hooks, developers must run this after cloning the repo:
```bash
$ git config core.hooksPath ./hooks
```
49 changes: 49 additions & 0 deletions hooks/pre-commit
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/bin/sh
#
# An example hook script to verify what is about to be committed.
# Called by "git commit" with no arguments. The hook should
# exit with non-zero status after issuing an appropriate message if
# it wants to stop the commit.
#
# To enable this hook, rename this file to "pre-commit".

if git rev-parse --verify HEAD >/dev/null 2>&1
then
against=HEAD
else
# Initial commit: diff against an empty tree object
against=$(git hash-object -t tree /dev/null)
fi

# If you want to allow non-ASCII filenames set this variable to true.
allownonascii=$(git config --bool hooks.allownonascii)

# Redirect output to stderr.
exec 1>&2

# Cross platform projects tend to avoid non-ASCII filenames; prevent
# them from being added to the repository. We exploit the fact that the
# printable range starts at the space character and ends with tilde.
if [ "$allownonascii" != "true" ] &&
# Note that the use of brackets around a tr range is ok here, (it's
# even required, for portability to Solaris 10's /usr/bin/tr), since
# the square bracket bytes happen to fall in the designated range.
test $(git diff --cached --name-only --diff-filter=A -z $against |
LC_ALL=C tr -d '[ -~]\0' | wc -c) != 0
then
cat <<\EOF
Error: Attempt to add a non-ASCII file name.
This can cause problems if you want to work with people on other platforms.
To be portable it is advisable to rename the file.
If you know what you are doing you can disable this check using:
git config hooks.allownonascii true
EOF
exit 1
fi

# If there are whitespace errors, print the offending file names and fail.
exec git diff-index --check --cached $against --
82 changes: 82 additions & 0 deletions hooks/pre-push
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
#!/bin/sh

# An example hook script to verify what is about to be pushed. Called by "git
# push" after it has checked the remote status, but before anything has been
# pushed. If this script exits with a non-zero status nothing will be pushed.
#
# This hook is called with the following parameters:
#
# $1 -- Name of the remote to which the push is being done
# $2 -- URL to which the push is being done
#
# If pushing without using a named remote those arguments will be equal.
#
# Information about the commits which are being pushed is supplied as lines to
# the standard input in the form:
#
# <local ref> <local sha1> <remote ref> <remote sha1>
#
# This sample shows how to prevent push of commits where the log message starts
# with "WIP" (work in progress).

#remote="$1"
#url="$2"
#
#z40=0000000000000000000000000000000000000000
#
#while read local_ref local_sha remote_ref remote_sha
#do
# if [ "$local_sha" = $z40 ]
# then
# # Handle delete
# :
# else
# if [ "$remote_sha" = $z40 ]
# then
# # New branch, examine all commits
# range="$local_sha"
# else
# # Update to existing branch, examine new commits
# range="$remote_sha..$local_sha"
# fi
#
# # Check for WIP commit
# commit=`git rev-list -n 1 --grep '^WIP' "$range"`
# if [ -n "$commit" ]
# then
# echo >&2 "Found WIP commit in $local_ref, not pushing"
# exit 1
# fi
# fi
#done

# We could do the following as a `pre-commit` hook, but it's expensive, so only do it pre-push:
echo Linting Python code...
flake8 polyfile tests --exclude polyfile/kaitai/parsers/ --select=E9,F63,F7,F82 1>/dev/null 2>/dev/null
RESULT=$?
if [ $RESULT -ne 0 ]; then
cat <<\EOF
Failed Python lint:
flake8 polyfile tests --exclude polyfile/kaitai/parsers/ --count --select=E9,F63,F7,F82 --show-source --statistics
EOF
exit 1
fi

#flake8 polyfile tests --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics 1>/dev/null 2>/dev/null
#RESULT=$?
#if [ $RESULT -ne 0 ]; then
# cat <<\EOF
#Failed Python lint:
#
# flake8 polyfile tests --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
#EOF
# exit 1
#fi

echo Running Tests...
pytest tests

# echo Type-checking Python code...
# mypy --ignore-missing-imports polyfile tests
#exit $?
2 changes: 1 addition & 1 deletion polyfile/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from . import nes, pdf, zipmatcher, trid, kaitaimatcher, polyfile
from . import nes, pdf, jpeg, zipmatcher, kaitaimatcher, languagematcher, polyfile
from .__main__ import main
from .polyfile import __version__
Loading

0 comments on commit 9c2d20b

Please sign in to comment.