Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAP 2.0 #960

Merged
merged 390 commits into from
Aug 14, 2019
Merged

BAP 2.0 #960

merged 390 commits into from
Aug 14, 2019

Conversation

ivg
Copy link
Member

@ivg ivg commented Jul 29, 2019

BAP 2.0 Release

The BAP 2.0 Release is inevitable and starting from this PR we are officially switching to BAP 2.0. The stable release is planned on October, so there is plenty of time to fix the bugs and update interfaces, before they are freezed. Therefore, comments, reviews, and questions are very welcomed. Please, keep in mind, that this is still work in progress, and a lot has to be done.

What's new?

The BAP 2.0 release brings three main features:

  • new IR
  • knowledge representation framework
  • scalable runtime

Without any doubts, the main contribution of BAP 2.0 is the new representation of program semantics which is made extensible and abstract. It means, that BAP no longer depends on BIL or any other (intermediate) language and allows an analyst to pick a representation that suits their needs the best. For example, there is a plugin that adds support for the BIL language. And we have another plugin that seamlessly integrates old (BAP 0.8) BIL into BAP framework. There are also plans to add support for Ghidra's Sleigh and other languages in the (nearest) future. The bottom line is that BIL is no longer the representation, but just an option, which could be chosen. Moreover, BAP supports multiple representations at the same time as well as translation between them.

While the new representation is extensible (i.e., the set of semantics features are not fixed) it comes with the initial theory, called the Core Theory which covers regular bitvector machines with memories, with speculative execution, and floating point arithmetics.

Underneath the hood, we use a knowledge representation system, that provides a unified representation for knowledge on all scales and levels of program representation - i.e., we use the same mechanisms to represent instructions, subroutines, programs, and even operating systems. The global knowledge representation also enables scalability and extends analysis in BAP from the level of small binaries, to the level of operating systems and even higher.

There are also lots of other contributions, like a completely new speculative disassembler, which
is both more conservative and more precise, supports arbitrary delay slots (and other modes of speculative execution) is independent on BIL and is much more efficient. The full list of changes with detailed descriptions will be provided below.

Breaking changes

None.

The old Bap.Std interface as well as all libraries that constitute BAP 1.x distribution are reimplemented and use the new representation underneath the hood. So why are we bumping the version number? The main reason, is that since now Bap.Std et alas are deprecated and should be only used for compatibility reasons only. We will gradually migrate our analyses to the new framework and encourage everyone to use the new representation in the newly written analyses.

There is, however, one change that to some extent could be considered as a breaking change - the Type.t data type of the BIL representation is extended with one more constructor Unk which denotes an expression with unknown type. This small change allowed us to implement the projection of the new (and rich) semantics into old BAP without breaking any interfaces. A little bit cheaty, but unless you explicitly model features that are not representable in BIL you will not notice this change and your old code will work seamlessly.

Besides, we even have a plugin that models floating point arithmetics in BIL in a soft-float style, so that any analysis that relies on BIL could be applied to the floating point code.

The Detailed Description of New Features

The Knowledge Representation Framework

In BAP 1.x we were using the Project.t data structure to eagerly represent all pieces of information that we were able to accumulate about the program and employed as simple mechanism of passes to build and analyse this structure. This approach had a few problems. First of all, the representation weren't uniform and really extensible. While some fields were blessed with an explicit representation, like control flow graph and instructions, others were second-class citizens and were represented as attributes, which were obscure and rather inefficient. Another problem was that in program analysis properties of a program are usually mutually dependent and in fact for a highly connected dependency graph, which is hard to resolve using a pass-based approach, even with dependencies and appropriate pass management. And finally, this didn't scale, since for large programs eagerly building the Project.t data structure is just infeasible and, usually, not necessary, as many question could be answered without requiring a full CFG or semantics for all possible instructions in the binary.

Therefore, it became obvious to us, that we need a common representation for different properties of a program with mutual dependencies and built-in fixed point. Our representation is a classical frame language, with objects having properties (accessed via slots). We ensure termination and independence from the order of execution of different analyses, by ensuring that all properties are represented as domains (partially ordered sets) and by preventing non-monotonic updates. We fully embed the knowledge representation runtime in our host language OCaml with all the rules properly reflected in the OCaml type system, e.g., properties are fully typed and safe to use and could be accessed via a monadic interface (which is now much more convenient, after the introduction of the monadic let-syntax into OCaml). If a property is recursive (either directly or via other properties) then a fixed point is computed. To ensure correctness of the computed fixed point (as well as its availability) we restrict all updates to be monotone (wrt to the provided partial ordering). Since the same property in program analysis could be computed to different values by different analyses, we also provide a consensus resolution engine, which ensures that a consensus will be eventually reached (and that it result doesn't depend on the order of evaluation).

While our Knowledge Framework is currently implemented in OCaml and could be operated only from OCaml, it is designed to be retargetable and we're currently investigating other backends, which will potentially provide better interoperability and scalability, such as using different Datalog implementations, Souffle or Semmle-like systems as possible backends.

Further Reading

The Core Theory

The Core Theory uses the new Knowledge Representation Framework to define a set of objects and their properties which are commonly used to express knowledge in Program Analysis. In other words, it basically defines the language which is used for stating properties in program analysis. But in a usual way, i.e., as an AST or a set of classes/constructors, rather than as a set of abstractions, allowing any implementation of those abstraction to fill the role of representation.

The word theory here should be interpreted as mathematical theory, i.e., a set of sentence, and in general we base or design a lot on SMT-LIB specification. From this perspective, our Core Theory is a big umbrella theory for small and more manageable subtheories, namely:

  • booleans (SMT's Core theory)
  • bitvectors (SMT's BV)
  • memories (SMT's A)
  • floating points (SMT's FP)
  • effects (no SMT's counterpart)

All expressions and statements in Core Theory are sorted and their sorts are reflected in the host language type system through tagless final embedding. The latter means, that the OCaml type checker will ensure that all formulae are well-formed and well-sorted. For example, it is impossible to add a value of bitvector sort to a value of memory sort, and even bitvectors and memories sorts are indexed with their sizes, e.g., it wouldn't be possible to apply an arithmetic operation to two bitvectors of different sizes.

The Core Theory is not fixed, it just defines a sane initial theory, but it is possible to extend with new definitions and even define new sorts. The Core Theory doesn't define the semantics of the language leaving it to corresponding analyses. For example, the BIL plugin denotes the semantics of Core Theory in terms of BIL language.

In general, any language which provides an implementation for the Core Theory interface could be used interchangeably as a representation. In the Core Theory the translation from an abstract term to a concrete representation is called reification. For example, we can reify a program term into its BIL representation or a graph, represented as BIR, or to a set of values or destinations that are computed in this term. From the programmer's perspective, implementing the Core Theory interfaces just means providing a suitable representation in terms of the defined language for each construct in the core theory, e.g.,

let add x y = 
   x >>-> fun x ->
   y >>-> fun y ->
   ret Bil.(x + y)

A translation from a concrete representational (e.g. BIl) to abstract term is called reflection, but sometimes we stick to the old and traditional name lifting. It is usually a bit more convoluted, since this direction involves moving from an untyped representation (which potentially includes ill-formed sentences) to a typed one, so one have to satisfy the OCaml typechecker. However, a special support is provided for BIL-like languages in terms of a generic parser, therefore expressing the semantics of a concrete language in terms of Core Theory usually distills to simple syntactic rules, e.g.,

  let bitv : type t r. (t,exp,r) bitv_parser =
    fun (module S) -> function
      | BinOp(PLUS,x,y) -> S.add x y
      | BinOp(MINUS,x,y) -> S.sub x y
      | BinOp(TIMES,x,y) -> S.mul x y
      | BinOp(DIVIDE,x,y) -> S.div x y
      ...

Of course, a full bi-directional specification of BIL/BIR is provided as a part of this PR. We are planning later to add the semantic specification of the legacy BIL.

Further Reading

New Speculative Disassembling Driver

This PR also provides a completely new, written from scratch, driver for disassembling (as well as all existing disassemblers are rewritten to use this driver). The driver is using the new Knowledge Framework and builds the whole program CFG incrementally storing only the minimum information in the process memory. It could be used to implement disassemblers of different favors, e.g., linear, recursive descent, probabilistic, and superset disassemblers.

It is implemented as a recursive descent parser with arbitrary backtracking and is capable to automatically refute instruction chains which are not terminated by jumps, thus seriously reducing the generated CFG in speculative and superset modes.

More importantly, the disassembler is designed to work incrementally and build the CFG on the fly, so that we can analyze huge binaries, each subroutine independently. Which also relies on a new CFG reconstruction and function boundaries algorithm described below.

Further reading

New CFG reconstruction algorithm

The task of partitioning of the whole program CFG into a quotient set of subroutines' control flow graphs is now well-defined by the definition of the control flow graph of a subroutine, which requires each node of a graph to be reachable and dominated by the entry node. Therefore, we build a dominator tree (forest) where nodes which are not dominated by any other nodes (beyond the pseudo entrance node) are function starts and children of a function start node denotes the body of the function.

We use our fixpoint function to compute the fixed point of the domination relationship, which also allows us to compute the set of functions incrementally, any newly discovered edges could be added to the fixed point equation, and we will continue stabilization from the point where we have stopped.

Further Reading

New Bitvectors Library

Since all terms in Core Theory are well-typed it is no longer required to store types of bitvectors in their bodies as tags. Therefore, we switched to the tagless representation of bitvectors, which is much more memory and CPU and memory efficient. We also changed the semantics of bitvector operations to fully satisfy the new SMT-LIB behavior, especially with respect to division by zero. Now all operations on bitvectors are total and well-defined.

We provide the old Bap_bitvector library as a shim over the new implementation for backward compatibility.

Further reading

Primus 2.0 and BAP Lisp

While we still retain the old Primus (which fully relies on BIR and Project data structure) we aslo provide Primus 2.0, which is much more flexible and easier to use, but not yet as complete as Primus 1.x.

In the new Primus we are moving away from higher-order polymorphism (which proved actually unnecessary, as all analyses were still parametrized by the identity monad). In other words, components are no longer functors, but are plain old functions.

The observation interface is tweaked such that we no longer are using tuples to pass multiple arguments but instead a curried notation is used, that greatly improves the performance (no more allocations when an observation is made).

The project data structure is no longer used (in fact Primus 2.0 doesn't depend on Bap.Std at all), instead the knowledge monad is transfused into the Primus monad.

Finally, we're developing a new BAP Lisp language which will be independent of Primus and will be able to store any semantics, and a new Primus Interpreter which will be extensible and capable of interpreting BAP Lisp.

Further Reading

Acknowledgments

I would like to thank all the contributors, especially Rafael Turner and Oleg Kamenkov for their impressive work on floating point modelling.

Questions

Please, do not hesitate to ask questions, leave your comments, suggestions and share your fears here. We will be happy to address them all. Keep in mind, that nothing is released yet, and therefore it is not the best time to change and ask for changes :)

P.S. There are many more new features in this PR, I just didn't have enough time to go through all of them. I will update this message as necessary. There are also lots of bug fixes, which we should probably backport into 1.7

Missing Features

  • Project serialization - it is currently disabled, so no caching. This is the number one priority work, and will be provided very soonish
  • Primus 2.0 with BAP Lisp Interpreter
  • New frontend - which will operate in 2.x and 1.x modes.

gitoleg and others added 30 commits January 22, 2019 10:10
Based on approximation of sqrt(x + 1) with Ramez coefficients
the interface design is inspired (or more correctly enforced) by the
current optimization capabilities of OCaml, the end goal is that if
a bitvector operation could be computed without memory allocations,
then it should be, even if the width is not known in advance.

this is still a wip
didn't provide the specializations, probably will add them later.
Since we're going to use Bitvec as our next new Bitvector
implementation and since we don't want to alias it to Word or
Addr anymore (because it will introduce confusion with Bap.Std.Word
and Bap.Std.Addr modules as well as significandly slow down
compilation), we need a shorter name. However, Bitv is already used
in the opam universe, and Bv is probably too short. So the middle
ground was chosen, which actually matches with the SmtLib name for
bitvectors.

We also added two sublibraries along with submodules which enable
support for bitvectors in the Janestreet suite of libraries.
- Bitvec_order provides support for the interfaces that depend on
  ordering, like maps, sets, comparables, etc
- Bitvec_binprot provides support bin_io

Together they provide functions that make Bitvec.t a first-class
citizen of JS Core libraries. We're moving away from the old JS style
of providing full comparable or Regular interface for each module,
since it bloats the binaries and, more importantly, significandly
slows down the compilation. We are now relying on small interfaces
such as compartors, so now instead of `Bitvec.Set.t` we will
use `(Bitvec.t, bitvec_order) Set.t` and instead of `Bitvec.Set.Empty`
we will use `Set.empty bitvec_order`

See the module documentation for more information.
so BAP can depend on the core theory now.
also made core theory less heavy.
Since many use cases of the Bitvec interface will use the generic
operations with modulus created on the fly from the integer
represented width, having it boxed will have sever performance
ramifications.

So, instead of representing modulus as a pair of width and the modulus
(i.e., 2^w-1 and w) we have to decide on one or another.

We decided to use 2^w-1 as the modulus representation as this is the
value that is used in normalization, so computing it every time will
add two extra operations which are not [@@noalloc]ed.

The ramification of this choice is that we still need to go back from
2^w-1 to w. The direct solution is to count bits, which is rather
efficient already. For the [msb] operation we don't really need to
know the width to test whether the msb is set, since

     msb x m ::= (m = m lsr 1 lor x) mod m

(though I'm still not sure whichever will be more efficient, the
[test_bit (num_bits m)] or the implementation above, but both are
rather efficient)

We also reimplemented the shift operations. They no longer depend
on the signed extract and are implemented purely relying on bitwise
operations. We use `num_bits` to get the width there.
many of our packages were dependent on bap only indirectly
the subpackages won't work with separate installation, since they have
to share the same META file, and will break Dune and in general
fragile.
…rs. This new casting operation is used to extract xmm registers from ymm registers in order to support monomorphic functions in sse. The SysV ABI is updated to recognize functions that read and write to xmm0.
… sin function and evaluate user-defined polynomials using Horner's method. Current example only computes a geometric series.
ivg added 5 commits August 8, 2019 11:36
and loads it in the bil_main, so that later we can add a command line
option which will select an appropriate semantics.
although BIL semantics is now self-optimizing it is still missing some
global optimizations (like elimination of useless assignments, aka
value propagation) which prevents our tests from working. We will
later disable the optimization passes, once our Bil denotational
optimizer will handle such cases.
it needs an architecture, if none is present in the context, then it
will just translate jumps to specials.
@ivg
Copy link
Member Author

ivg commented Aug 8, 2019

finally, all tests are passed!

1) The Value.pp_slots function didn't work at all,
   it was assertfaulting because the sexp_of_f function
   of Dict wasn't really implemented.

2) The IR printers weren't handling the empty filter as
   "print everything" as it is should be by the documentation.

Also, adds the instruction destinations inspector.
@ivg
Copy link
Member Author

ivg commented Aug 8, 2019

OK, that's should be all, it is good to merge, unless nothing else is found.
I will be now working on the new frontend, which will go in a separate PR.

@@ -0,0 +1,6 @@

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this file is dead and should be removed

ivg added 4 commits August 8, 2019 15:36
btw, it was a good idea to add useful small primitives as a syntax
module, but we can do this later.
it was still using the binary notation in sexp
we will later add a much better tools for knowledge
introspection and visualization, but so far we are blocked by
the bap frontend work, since we need for that a better commandline
interface, than bap 1.x provides. But so far it is better than
nothing.
@XVilka XVilka mentioned this pull request Aug 12, 2019
18 tasks
ivg and others added 6 commits August 13, 2019 11:32
As of Primus 1.x and before this commit, the interprocedural control
flow wasn't following the return stack, but instead ignored the return
statements and was always following the control flow graph.

This had several ramifications
1) it was fragile as not all jumps could be properly classified as
returns statically
2) it didn't work with tail calls
3) it wasn't fair, because we were misinterpreting malicious program
as benign.

The new implementation is following the true control flow, i.e., we
are not ignoring any jumps and call the computed destinations. In case
when the dynamic control flow diverges from the static control flow,
we are raise the Cfi_violation exception, which is properly trapped in
the same manner as Division_by_zero and Segmentation_fault.

Implementation
--------------

We use a shadow stack approach, and each interprocedural call which we
expect to return pushes the return value (which we call a prompt) to
that stack. The the contination linked to the call destination is executed
bounded by the pushed prompt. It will stop when it reached the pushed
prompt (i.e., when the control is literally returned to us) or when
the continuation returns normally (e.g., when a Primus Lisp or
primitive is called). In either case, we pop a prompt (if it was
pushed), and compare the prompt value with the expected and raise (a
trappable) exception if they diverge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants