Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Tainting Program Input #1094

Closed
mauricesvp opened this issue Apr 26, 2020 · 14 comments
Closed

[Question] Tainting Program Input #1094

mauricesvp opened this issue Apr 26, 2020 · 14 comments

Comments

@mauricesvp
Copy link

Hello,

I want to taint the input of a program (64bit ELF on Linux to be precise). For example, running cat on one or more files, I'd like to know what instructions (specifically, at what addresses), the input changes/influences the program execution, up until one or more sinks are reached (in this case a sink would be cat writing to stdout).

So far, I have tried to use primus-taint, however I am not quite sure how I could use the IR-results to get the info I need. Is there a simple way of achieving the aforementioned goal?

On a side note I might add that the links under Running Primus Taint Analysis point to deleted posts.

All help is appreciated, thanks in advance.

@ivg
Copy link
Member

ivg commented Apr 27, 2020

I want to taint the input of a program (64bit ELF on Linux to be precise).

Do you mean the command-line input? It is surprisingly hard :) Like for simple binaries, it is totally doable, just taint the argv array on the input to the master:

;; file taint-main.lisp -- taints the inputs of the main function
(require posix)

(defmethod call (name argc argv)
  (when (= name 'main)
    (+= argv (sizeof ptr_t)) ; skip the binary name
    (while (read-word ptr_t argv)
      (let ((arg (read-word ptr_t argv)))
        (taint-introduce-indirectly
         'command-line-arguments arg (strlen arg)))
      (+= argv (sizeof ptr_t)))))

So far so good, and if you will take a simple example from our testsuite it will work like a charm

bap x86_64-linux-gnu-echo --run --primus-lisp-load=taint-main --primus-print-obs=pc-changed,taint-attached --primus-print-output=incidents --run-argv=hello,cruel,world

Instead of propagating the taint event to the static representation of the program we just record all observations of the pc-changed event (which is generated every time we enter a new instruction) and taint-attached which is generated every time we attach/propagate taint to a newly generated value. So, a simple hack would be to grep for all taint-attached and look for the pc-changed, event before that, e.g.,

grep -B1 taint-attached incidents | grep pc-changed

You can further cut the address from the event using awk or cut and even turn the output into a set:

cat incidents | grep -B1 taint-attached | grep pc-changed | cut -d' ' -f2 | cut -d: -f1 | sort -u 
0x40070F
0x400712
0x400715
0x400720
0x400723
0x40083C
0x400853

These are the instructions that are tainted by the input. If your binary is compiled with the debugging information, you can even pipe them through addr2line to get the source code.

But here comes the real-world and in it is always not that easy. In the real-world (e.g., in cat) people are using gnu getopt or even getopt_long interleaved with various i18n functions which we do not yet, unfortunately, model. Not that it is that hard (it is not printf) but still someone had to do this. So, if you will try the same approach with cat you won't find anything. Most likely, you will not even find a reference to main, especially in an ASLR binary compiled with a modern compiler. So, such a naive approach works only for educational purposes.

To analyze real binaries en mass, we are not focusing on command-line arguments only, but we generalize to all inputs that come to the binary. So, we track the known user-input functions and taint the data that they define. In some of our projects, we even employed a special module that taints all upper exposed variables, so that we can be sure, that argv-pointed data and environment variables are also included in the set.

I hope that it gives enough starting points :)

On a side note I might add that the links under Running Primus Taint Analysis point to deleted posts.

Indeed, thanks for noticing this. We had to delete the old documentation because we ran out of space on github.io. I think that I have fixed all of them, tell me if you find any non-working reference.

@mauricesvp
Copy link
Author

I hope that it gives enough starting points :)

Indeed, thank you very much already :)

But here comes the real-world and in it is always not that easy. In the real-world (e.g., in cat) people are using gnu getopt or even getopt_long interleaved with various i18n functions which we do not yet, unfortunately, model. Not that it is that hard (it is not printf) but still someone had to do this. So, if you will try the same approach with cat you won't find anything.

Luckily, I have already compiled all the coreutils myself, disabling ASLR as well stack canaries, and also not stripping debug information. (I have done that to speed up CFG generation).

Using bap <custom_cat> --run --primus-lisp-load=taint-main --primus-print-obs=pc-changed,taint-attached --primus-print-output=incidents --run-argv=<file1>,<file2>, I am able to generate tainted addresses, however I am not sure if there aren't some addresses missing. To be precise, I get two addresses from within main, at which open and strcmp are called respectively, as well as three addresses from within _getopt_internal_r (which maybe has something to do with the lacking support of getopt, as mentioned by you). What I would have sort of expected is to also have addresses at (any) write function.

Using taint-sources instead of taint-main, I do not get any (taint-attached) results, do I perhaps need to further change the arguments in some way?

I think that I have fixed all of them, tell me if you find any non-working reference.

At the time of writing this, the two links under here are still invalid.

@ivg
Copy link
Member

ivg commented Apr 27, 2020

What I would have sort of expected is to also have addresses at (any) write function.

There is no taint propagation between the filename and its contents by default, as indeed that are two different pieces of data that are not intersecting.

But you can, of course, enable it with the following simple Primus Lisp script (sorry didn't have time to check if it works)

(defmethod call-return (name path fd)
  (when (and (= name 'open)
             (taint-get-direct 'command-line-arguments path))
    (dict-add 'tainted-file-descriptors fd path)))

(defmethod call-return (name fd buf _ bytes-read)
  (when (and (= name 'read)
             (dict-has 'tainted-file-descriptors fd))
    (taint-introduce-indirectly 'user-input buf bytes-read)))

Just add it to the taint-main.lisp file. The idea is simple, whenever we see an open which pathname is tainted with our 'command-line-arguments taint we record this fd in our dictionary (we keep the filename so that later we can create nice incident reports, but for now it is not really necessary).

Now, we we see the call to read that returns bytes-read (the number of bytes read, yeah there should be a check that the number is positive) we look if the fd is our victim fd, and if it is, then we introduce a new taint, now called 'user-input to the data pointed by buf with the number of bytes that returned read.

Using taint-sources instead of taint-main, I do not get any (taint-attached) results, do I perhaps need to further change the arguments in some way?

yeah, add also posix, e.g., --primus-lisp-load=taint-sources,posix
I've cheated by adding the posix feature in the taint-main. (Note, posix enables emulation of various POSIX functions).

@mauricesvp
Copy link
Author

But you can, of course, enable it with the following simple Primus Lisp script (sorry didn't have time to check if it works)

Hm, it doesn't seem to work quite yet, I get the same addresses.

yeah, add also posix, e.g., --primus-lisp-load=taint-sources,posix
I've cheated by adding the posix feature in the taint-main. (Note, posix enables emulation of various POSIX functions).

Also here I still get no taint-attached results. :/

@ivg
Copy link
Member

ivg commented Apr 27, 2020

Can you drop the binary here (in a zip to make GitHub happy), so that we can be on the same page?
Also, how do you run the analysis? If you have real input files (and want them to be available for microexecution) you can use --primus-lisp-channel-redirect, e.g.,

--primus-lisp-channel-redirect=foo:./data/foo

then open("foo") with the consecutive read will read the data from ./data/foo. I think that what is happening is that you're running bap in the standard interpretation mode (not in the promiscuous mode) and open("foo") just returns -1 as there is no such file, so that read is not even called. Besides, you can use (msg "bla-bla-bla") for debugging Primus Lisp scripts.

@mauricesvp
Copy link
Author

Can you drop the binary here (in a zip to make GitHub happy), so that we can be on the same page?

Here it is: cat.zip

Also, how do you run the analysis?

Like this: bap cat-cfi --run --primus-lisp-load=taint-main --primus-print-obs=pc-changed,taint-attached --primus-print-output=incidents --run-argv=foo,bar, where foo and bar are files with a couple of lines of text, placed in the same directory as cat-cfi.
Trying the --primus-lisp-channel-redirect flag I get the same results (also for taint-sources).

@ivg
Copy link
Member

ivg commented Apr 28, 2020

I see, we forgot that there is the flags parameter in open(2) so we were tainting fd 0 (the value of the flags parameter) instead of tainting the actual fd returned from the call. Ah, and one more error, instead of looking for taint in the value pointed by the path parameter, we were looking for taint attached to the path itself (in other words, forgot to dereference it). After those two bugs fixed, with the following code:

So, with the following code

(require posix)
(require pointers)

(defmethod call (name argc argv)
  (when (= name 'main)
    (+= argv (sizeof ptr_t)) ; skip the binary name
    (while (read-word ptr_t argv)
      (let ((arg (read-word ptr_t argv)))
        (taint-introduce-indirectly
         'command-line-arguments arg (strlen arg)))
      (+= argv (sizeof ptr_t)))))


(defmethod call-return (name path _ fd)
  (when (and (= name 'open)
             (taint-get-indirect 'command-line-arguments path))
    (dict-add 'tainted-file-descriptors fd path)))

(defmethod call-return (name fd buf _ bytes-read)
  (when (and (= name 'read)
             (/= bytes-read 0)
             (/= bytes-read -1)
             (dict-has 'tainted-file-descriptors fd))
    (taint-introduce-indirectly 'user-input buf bytes-read)))

we can run bap as

bap cat-cfi --run --primus-lisp-load=taint-main --primus-print-obs=pc-changed,taint-attached --primus-print-output=incidents --run-argv=cat,foo,bar --primus-lisp-channel-redirect=foo:./foo,bar:./bar -dbir:cat.bir -dasm:cat.asm --report-progress

This will add only one address to the set of tainted addresses, the long awaited 0x40308E, which is the call to write. So the total number if instructions that depend on the input with this invocation of cat is not impressive. But it is expected, first of all, because cat is a pipe program that doesn't really depend on the contents of the files that it pipes. The second reason is that we investigated only one path, which corresponds to our input specification (and covers less than 10% of the whole program. We can enable the promiscuous mode to cover more paths. In this mode, Primus instead of acting as a standard interpreter will use forced execution to visit both sides of each branch. To enable this mode we need to add a couple of options. We already had a few options so it becomes unwieldy to use the command line for that. We will use recipes, just create a file named taint-inputs.recipe with the following contents

(option pass run)
(option primus-lisp-load
        taint-main)
(option primus-print-observations
        pc-changed
        taint-attached
        lisp-message)
(option primus-print-output
        incidents)
(option run-argv cat foo bar)
(option primus-lisp-channel-redirect
        foo:$prefix/foo
        bar:$prefix/bar
        <stdin>:/dev/null
        <stdout>:stdout
        <stderr>:stderr)

(option primus-promiscuous-mode)
(option primus-greedy-scheduler)
(option primus-limit-max-length 32768)
(option report-progress)

(option log-dir .)

You can even then zip it (together with the taint-main.lisp, foo, and bar files) and share so that other people can reuse it. Now, we can run it

bap cat-cfi --recipe=taint-inputs

On mine machine it will investigate 765 more paths and discover about 30 more instructions that potentially depend on the user input. It will now cover nearly half of the program. This is the code that we were able to reach from main. It looks that the binary is static and includes a lot of unreachable code, as far as I can see.

@mauricesvp
Copy link
Author

First of all, thanks a lot for your efforts!

Running the updated taint-main I run into some weird problems though.
On the one hand I noticed that the contents of foo and bar get wiped, is this intended?
Also, other than one more address (which is a call to strcmp), I don't get any new addresses compared to the previous version.
Furthermore, is the stdout file suppossed to only contain the expected output (in this case the contents of foo and bar)? Because for me there are only some sort of random bytes (14K, with some repeated sequences) in there.
Also another small detail, you have added cat in the argv, does this make a difference (in that it perhaps should be set to cat-cfi instead? Testing with taint-main I have not found any difference though).

@ivg
Copy link
Member

ivg commented Apr 28, 2020

On the one hand I noticed that the contents of foo and bar get wiped, is this intended?

Nope, it's a bug, fixed in #1049

Also, other than one more address (which is a call to strcmp), I don't get any new addresses compared to the previous version.

That's strange. Did you restore the contents of the file? As in the standard interpretation mode the empty file will not induce any data dependency on write. Also, take a look in the log file (it is either in ~/.local/state/bap/log or in the current working directory when you use the provided recipe). Look for any suspicious things, like exceptions or something like this. I am running on a development version of bap which is many thousands of lines of code different from the master version, so my mileage may vary. It might be that you will need to add some header files to move forward. It will be indicated in the log (something like no matching definition for isdigit). You can use the --api-add option to add more headers (or, as a hack, just modify the posix.h in your bap installation and add to its end

// ctypes.h
int isalnum(int c);
int isalpha(int c);
int iscntrl(int c);
int isdigit(int c);
int isgraph(int c);
int islower(int c);
int isprint(int c);
int ispunct(int c);
int isspace(int c);
int isupper(int c);
int isxdigit(int c);

int isascii(int c);
int isblank(int c);

Furthermore, is the stdout file suppossed to only contain the expected output (in this case the contents of foo and bar)? Because for me there are only some sort of random bytes (14K, with some repeated sequences) in there.

In the standard interpretation mode, yes, it will contain only the contents of those files. In promiscuous mode it may contain random values as the interpreter will randomly explore various paths with random inputs.

Also another small detail, you have added cat in the argv, does this make a difference (in that it perhaps should be set to cat-cfi instead?

Yep, run-argv is basically the argv array as passed to main or to execve, i.e., its first element should be the name of the program. It could be anything, of course, though some programs, e.g., busybox are using the passed name to determine what they should do.

@mauricesvp
Copy link
Author

Running it again (making 100% sure foo and bar are not empty), I don't get anything with "error", "fail", "missing" etc in the log.

The only remotely interesting lines I found are

x86.info> can't detect ABI, falling back to default

as well as

disassemble.info> importing knowledge from cache
bap.warning> region is not mapped to memory
bap.warning> region is not mapped to memory
bap.warning> region is not mapped to memory
bap.warning> region is not mapped to memory
bap.warning> region is not mapped to memory
bap.warning> region is not mapped to memory
bap.warning> region is not mapped to memory
bap.warning> region is not mapped to memory
bap.info> providing rooter and symbolizer from image

which properly are nothing out of the ordinary.

This might be far-fetched, but does this look alright? (I can only suspect that the execution potentially stops preemptively)

run.info> The Lisp Machine program is well-typed
run.info> Fork 1: starting from the %0003454c entry point
run.info> execution from %0003454c terminated with: Halt 
run.info> Ok, we've terminated normally

@ivg
Copy link
Member

ivg commented Apr 29, 2020

Yep, everything looks alright. But do you see the file contents in the stdout file? Looks like that you need to update. I don't remember the details, but without #1049 it won't work, as the file contents is deleted before it is read.

@mauricesvp
Copy link
Author

Using the current master (instead of the 2.0.0) version, I now got it working :) Thanks a lot again.

@ivg
Copy link
Member

ivg commented Apr 29, 2020

You may also find this discussion in our Gitter channel interesting.

@ivg ivg closed this as completed Apr 30, 2020
@mauricesvp
Copy link
Author

@ivg One question about the observation log, is there a way of adding or recovering information about the order in which the different addresses have been hit? (Like a control flow recovery, in order to determine the "last" address tainted).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants