BAP 2.0 #960

ivg · 2019-07-29T20:15:03Z

BAP 2.0 Release

The BAP 2.0 Release is inevitable and starting from this PR we are officially switching to BAP 2.0. The stable release is planned on October, so there is plenty of time to fix the bugs and update interfaces, before they are freezed. Therefore, comments, reviews, and questions are very welcomed. Please, keep in mind, that this is still work in progress, and a lot has to be done.

What's new?

The BAP 2.0 release brings three main features:

new IR
knowledge representation framework
scalable runtime

Without any doubts, the main contribution of BAP 2.0 is the new representation of program semantics which is made extensible and abstract. It means, that BAP no longer depends on BIL or any other (intermediate) language and allows an analyst to pick a representation that suits their needs the best. For example, there is a plugin that adds support for the BIL language. And we have another plugin that seamlessly integrates old (BAP 0.8) BIL into BAP framework. There are also plans to add support for Ghidra's Sleigh and other languages in the (nearest) future. The bottom line is that BIL is no longer the representation, but just an option, which could be chosen. Moreover, BAP supports multiple representations at the same time as well as translation between them.

While the new representation is extensible (i.e., the set of semantics features are not fixed) it comes with the initial theory, called the Core Theory which covers regular bitvector machines with memories, with speculative execution, and floating point arithmetics.

Underneath the hood, we use a knowledge representation system, that provides a unified representation for knowledge on all scales and levels of program representation - i.e., we use the same mechanisms to represent instructions, subroutines, programs, and even operating systems. The global knowledge representation also enables scalability and extends analysis in BAP from the level of small binaries, to the level of operating systems and even higher.

There are also lots of other contributions, like a completely new speculative disassembler, which
is both more conservative and more precise, supports arbitrary delay slots (and other modes of speculative execution) is independent on BIL and is much more efficient. The full list of changes with detailed descriptions will be provided below.

Breaking changes

None.

The old Bap.Std interface as well as all libraries that constitute BAP 1.x distribution are reimplemented and use the new representation underneath the hood. So why are we bumping the version number? The main reason, is that since now Bap.Std et alas are deprecated and should be only used for compatibility reasons only. We will gradually migrate our analyses to the new framework and encourage everyone to use the new representation in the newly written analyses.

There is, however, one change that to some extent could be considered as a breaking change - the Type.t data type of the BIL representation is extended with one more constructor Unk which denotes an expression with unknown type. This small change allowed us to implement the projection of the new (and rich) semantics into old BAP without breaking any interfaces. A little bit cheaty, but unless you explicitly model features that are not representable in BIL you will not notice this change and your old code will work seamlessly.

Besides, we even have a plugin that models floating point arithmetics in BIL in a soft-float style, so that any analysis that relies on BIL could be applied to the floating point code.

The Detailed Description of New Features

The Knowledge Representation Framework
The Core Theory
New Speculative Disassembling Driver
New CFG reconstruction algorithm
New Bitvectors Library
Primus 2.0 and BAP Lisp

The Knowledge Representation Framework

In BAP 1.x we were using the Project.t data structure to eagerly represent all pieces of information that we were able to accumulate about the program and employed as simple mechanism of passes to build and analyse this structure. This approach had a few problems. First of all, the representation weren't uniform and really extensible. While some fields were blessed with an explicit representation, like control flow graph and instructions, others were second-class citizens and were represented as attributes, which were obscure and rather inefficient. Another problem was that in program analysis properties of a program are usually mutually dependent and in fact for a highly connected dependency graph, which is hard to resolve using a pass-based approach, even with dependencies and appropriate pass management. And finally, this didn't scale, since for large programs eagerly building the Project.t data structure is just infeasible and, usually, not necessary, as many question could be answered without requiring a full CFG or semantics for all possible instructions in the binary.

Therefore, it became obvious to us, that we need a common representation for different properties of a program with mutual dependencies and built-in fixed point. Our representation is a classical frame language, with objects having properties (accessed via slots). We ensure termination and independence from the order of execution of different analyses, by ensuring that all properties are represented as domains (partially ordered sets) and by preventing non-monotonic updates. We fully embed the knowledge representation runtime in our host language OCaml with all the rules properly reflected in the OCaml type system, e.g., properties are fully typed and safe to use and could be accessed via a monadic interface (which is now much more convenient, after the introduction of the monadic let-syntax into OCaml). If a property is recursive (either directly or via other properties) then a fixed point is computed. To ensure correctness of the computed fixed point (as well as its availability) we restrict all updates to be monotone (wrt to the provided partial ordering). Since the same property in program analysis could be computed to different values by different analyses, we also provide a consensus resolution engine, which ensures that a consensus will be eventually reached (and that it result doesn't depend on the order of evaluation).

While our Knowledge Framework is currently implemented in OCaml and could be operated only from OCaml, it is designed to be retargetable and we're currently investigating other backends, which will potentially provide better interoperability and scalability, such as using different Datalog implementations, Souffle or Semmle-like systems as possible backends.

The Core Theory

The Core Theory uses the new Knowledge Representation Framework to define a set of objects and their properties which are commonly used to express knowledge in Program Analysis. In other words, it basically defines the language which is used for stating properties in program analysis. But in a usual way, i.e., as an AST or a set of classes/constructors, rather than as a set of abstractions, allowing any implementation of those abstraction to fill the role of representation.

The word theory here should be interpreted as mathematical theory, i.e., a set of sentence, and in general we base or design a lot on SMT-LIB specification. From this perspective, our Core Theory is a big umbrella theory for small and more manageable subtheories, namely:

booleans (SMT's Core theory)
bitvectors (SMT's BV)
memories (SMT's A)
floating points (SMT's FP)
effects (no SMT's counterpart)

All expressions and statements in Core Theory are sorted and their sorts are reflected in the host language type system through tagless final embedding. The latter means, that the OCaml type checker will ensure that all formulae are well-formed and well-sorted. For example, it is impossible to add a value of bitvector sort to a value of memory sort, and even bitvectors and memories sorts are indexed with their sizes, e.g., it wouldn't be possible to apply an arithmetic operation to two bitvectors of different sizes.

The Core Theory is not fixed, it just defines a sane initial theory, but it is possible to extend with new definitions and even define new sorts. The Core Theory doesn't define the semantics of the language leaving it to corresponding analyses. For example, the BIL plugin denotes the semantics of Core Theory in terms of BIL language.

In general, any language which provides an implementation for the Core Theory interface could be used interchangeably as a representation. In the Core Theory the translation from an abstract term to a concrete representation is called reification. For example, we can reify a program term into its BIL representation or a graph, represented as BIR, or to a set of values or destinations that are computed in this term. From the programmer's perspective, implementing the Core Theory interfaces just means providing a suitable representation in terms of the defined language for each construct in the core theory, e.g.,

let add x y = 
   x >>-> fun x ->
   y >>-> fun y ->
   ret Bil.(x + y)

A translation from a concrete representational (e.g. BIl) to abstract term is called reflection, but sometimes we stick to the old and traditional name lifting. It is usually a bit more convoluted, since this direction involves moving from an untyped representation (which potentially includes ill-formed sentences) to a typed one, so one have to satisfy the OCaml typechecker. However, a special support is provided for BIL-like languages in terms of a generic parser, therefore expressing the semantics of a concrete language in terms of Core Theory usually distills to simple syntactic rules, e.g.,

  let bitv : type t r. (t,exp,r) bitv_parser =
    fun (module S) -> function
      | BinOp(PLUS,x,y) -> S.add x y
      | BinOp(MINUS,x,y) -> S.sub x y
      | BinOp(TIMES,x,y) -> S.mul x y
      | BinOp(DIVIDE,x,y) -> S.div x y
      ...

Of course, a full bi-directional specification of BIL/BIR is provided as a part of this PR. We are planning later to add the semantic specification of the legacy BIL.

New Speculative Disassembling Driver

This PR also provides a completely new, written from scratch, driver for disassembling (as well as all existing disassemblers are rewritten to use this driver). The driver is using the new Knowledge Framework and builds the whole program CFG incrementally storing only the minimum information in the process memory. It could be used to implement disassemblers of different favors, e.g., linear, recursive descent, probabilistic, and superset disassemblers.

It is implemented as a recursive descent parser with arbitrary backtracking and is capable to automatically refute instruction chains which are not terminated by jumps, thus seriously reducing the generated CFG in speculative and superset modes.

More importantly, the disassembler is designed to work incrementally and build the CFG on the fly, so that we can analyze huge binaries, each subroutine independently. Which also relies on a new CFG reconstruction and function boundaries algorithm described below.

New CFG reconstruction algorithm

The task of partitioning of the whole program CFG into a quotient set of subroutines' control flow graphs is now well-defined by the definition of the control flow graph of a subroutine, which requires each node of a graph to be reachable and dominated by the entry node. Therefore, we build a dominator tree (forest) where nodes which are not dominated by any other nodes (beyond the pseudo entrance node) are function starts and children of a function start node denotes the body of the function.

We use our fixpoint function to compute the fixed point of the domination relationship, which also allows us to compute the set of functions incrementally, any newly discovered edges could be added to the fixed point equation, and we will continue stabilization from the point where we have stopped.

New Bitvectors Library

Since all terms in Core Theory are well-typed it is no longer required to store types of bitvectors in their bodies as tags. Therefore, we switched to the tagless representation of bitvectors, which is much more memory and CPU and memory efficient. We also changed the semantics of bitvector operations to fully satisfy the new SMT-LIB behavior, especially with respect to division by zero. Now all operations on bitvectors are total and well-defined.

We provide the old Bap_bitvector library as a shim over the new implementation for backward compatibility.

Primus 2.0 and BAP Lisp

While we still retain the old Primus (which fully relies on BIR and Project data structure) we aslo provide Primus 2.0, which is much more flexible and easier to use, but not yet as complete as Primus 1.x.

In the new Primus we are moving away from higher-order polymorphism (which proved actually unnecessary, as all analyses were still parametrized by the identity monad). In other words, components are no longer functors, but are plain old functions.

The observation interface is tweaked such that we no longer are using tuples to pass multiple arguments but instead a curried notation is used, that greatly improves the performance (no more allocations when an observation is made).

The project data structure is no longer used (in fact Primus 2.0 doesn't depend on Bap.Std at all), instead the knowledge monad is transfused into the Primus monad.

Finally, we're developing a new BAP Lisp language which will be independent of Primus and will be able to store any semantics, and a new Primus Interpreter which will be extensible and capable of interpreting BAP Lisp.

Acknowledgments

I would like to thank all the contributors, especially Rafael Turner and Oleg Kamenkov for their impressive work on floating point modelling.

Questions

Please, do not hesitate to ask questions, leave your comments, suggestions and share your fears here. We will be happy to address them all. Keep in mind, that nothing is released yet, and therefore it is not the best time to change and ask for changes :)

P.S. There are many more new features in this PR, I just didn't have enough time to go through all of them. I will update this message as necessary. There are also lots of bug fixes, which we should probably backport into 1.7

Missing Features

Project serialization - it is currently disabled, so no caching. This is the number one priority work, and will be provided very soonish
Primus 2.0 with BAP Lisp Interpreter
New frontend - which will operate in 2.x and 1.x modes.

Based on approximation of sqrt(x + 1) with Ramez coefficients

the interface design is inspired (or more correctly enforced) by the current optimization capabilities of OCaml, the end goal is that if a bitvector operation could be computed without memory allocations, then it should be, even if the width is not known in advance. this is still a wip

didn't provide the specializations, probably will add them later.

Since we're going to use Bitvec as our next new Bitvector implementation and since we don't want to alias it to Word or Addr anymore (because it will introduce confusion with Bap.Std.Word and Bap.Std.Addr modules as well as significandly slow down compilation), we need a shorter name. However, Bitv is already used in the opam universe, and Bv is probably too short. So the middle ground was chosen, which actually matches with the SmtLib name for bitvectors. We also added two sublibraries along with submodules which enable support for bitvectors in the Janestreet suite of libraries. - Bitvec_order provides support for the interfaces that depend on ordering, like maps, sets, comparables, etc - Bitvec_binprot provides support bin_io Together they provide functions that make Bitvec.t a first-class citizen of JS Core libraries. We're moving away from the old JS style of providing full comparable or Regular interface for each module, since it bloats the binaries and, more importantly, significandly slows down the compilation. We are now relying on small interfaces such as compartors, so now instead of `Bitvec.Set.t` we will use `(Bitvec.t, bitvec_order) Set.t` and instead of `Bitvec.Set.Empty` we will use `Set.empty bitvec_order` See the module documentation for more information.

at least an attempt

so BAP can depend on the core theory now. also made core theory less heavy.

@noalloc

Since many use cases of the Bitvec interface will use the generic operations with modulus created on the fly from the integer represented width, having it boxed will have sever performance ramifications. So, instead of representing modulus as a pair of width and the modulus (i.e., 2^w-1 and w) we have to decide on one or another. We decided to use 2^w-1 as the modulus representation as this is the value that is used in normalization, so computing it every time will add two extra operations which are not [@@noalloc]ed. The ramification of this choice is that we still need to go back from 2^w-1 to w. The direct solution is to count bits, which is rather efficient already. For the [msb] operation we don't really need to know the width to test whether the msb is set, since msb x m ::= (m = m lsr 1 lor x) mod m (though I'm still not sure whichever will be more efficient, the [test_bit (num_bits m)] or the implementation above, but both are rather efficient) We also reimplemented the shift operations. They no longer depend on the signed extract and are implemented purely relying on bitwise operations. We use `num_bits` to get the width there.

many of our packages were dependent on bap only indirectly

the subpackages won't work with separate installation, since they have to share the same META file, and will break Dune and in general fragile.

Bap elementary

…rs. This new casting operation is used to extract xmm registers from ymm registers in order to support monomorphic functions in sse. The SysV ABI is updated to recognize functions that read and write to xmm0.

… sin function and evaluate user-defined polynomials using Horner's method. Current example only computes a geometric series.

and loads it in the bil_main, so that later we can add a command line option which will select an appropriate semantics.

although BIL semantics is now self-optimizing it is still missing some global optimizations (like elimination of useless assignments, aka value propagation) which prevents our tests from working. We will later disable the optimization passes, once our Bil denotational optimizer will handle such cases.

it needs an architecture, if none is present in the context, then it will just translate jumps to specials.

ivg · 2019-08-08T18:24:34Z

finally, all tests are passed!

1) The Value.pp_slots function didn't work at all, it was assertfaulting because the sexp_of_f function of Dict wasn't really implemented. 2) The IR printers weren't handling the empty filter as "print everything" as it is should be by the documentation. Also, adds the instruction destinations inspector.

ivg · 2019-08-08T19:24:01Z

OK, that's should be all, it is good to merge, unless nothing else is found.
I will be now working on the new frontend, which will go in a separate PR.

ivg · 2019-08-08T19:29:21Z

lib/bap_core_theory/bap_core_theory_syntax.ml

@@ -0,0 +1,6 @@
+


looks like this file is dead and should be removed

btw, it was a good idea to add useful small primitives as a syntax module, but we can do this later.

it was still using the binary notation in sexp

we will later add a much better tools for knowledge introspection and visualization, but so far we are blocked by the bap frontend work, since we need for that a better commandline interface, than bap 1.x provides. But so far it is better than nothing.

As of Primus 1.x and before this commit, the interprocedural control flow wasn't following the return stack, but instead ignored the return statements and was always following the control flow graph. This had several ramifications 1) it was fragile as not all jumps could be properly classified as returns statically 2) it didn't work with tail calls 3) it wasn't fair, because we were misinterpreting malicious program as benign. The new implementation is following the true control flow, i.e., we are not ignoring any jumps and call the computed destinations. In case when the dynamic control flow diverges from the static control flow, we are raise the Cfi_violation exception, which is properly trapped in the same manner as Division_by_zero and Segmentation_fault. Implementation -------------- We use a shadow stack approach, and each interprocedural call which we expect to return pushes the return value (which we call a prompt) to that stack. The the contination linked to the call destination is executed bounded by the pushed prompt. It will stop when it reached the pushed prompt (i.e., when the control is literally returned to us) or when the continuation returns normally (e.g., when a Primus Lisp or primitive is called). In either case, we pop a prompt (if it was pushed), and compare the prompt value with the expected and raise (a trappable) exception if they diverge.

gitoleg and others added 30 commits January 22, 2019 10:10

new implementation of fsqrt

3870cce

Based on approximation of sqrt(x + 1) with Ramez coefficients

initial scrap of the bitvector, still wip

ab7d86b

bap-elementary, initial commit

10623e1

wrote docs

65acae6

fixed a typo

fe01544

finished the initial implementation

6fcc80e

didn't provide the specializations, probably will add them later.

makes bitvec-order compatible with newer versions of base

3a488bb

adds a couple of missing functions

d4fd98d

merge with upstream/master

81c9717

Merge branch 'bap-new-semantics' into primus-2.0

7958b2a

fixes some leftovers from the merge conflict

e8a127d

makes bap_build indepndent on core-kernel

72a6cdf

makes the build system indepdent on core-kernel

0cdbe43

at least an attempt

the core-theory no longer depends on BAP

ee7454c

so BAP can depend on the core theory now. also made core theory less heavy.

moves some of new primus-list stuff to the new bitvector

aa5de01

wip

2fd645c

updates value to reflec the new Bitvec interface

ad3df28

finishes with the lib part

d6684b7

moves bitvector to the bitvec base

657e172

updates oasis files to add bap dependency

9c1ba6f

many of our packages were dependent on bap only indirectly

makes bitvec sublibraries standalone

fab17a0

the subpackages won't work with separate installation, since they have to share the same META file, and will break Dune and in general fragile.

Merge pull request #27 from gitoleg/bap-elementary

cc93604

Bap elementary

The following code allows ymm registers to be cast to smaller registe…

0c1c4cd

…rs. This new casting operation is used to extract xmm registers from ymm registers in order to support monomorphic functions in sse. The SysV ABI is updated to recognize functions that read and write to xmm0.

prototype for sin function from double to double

c891622

initial example of using Core Theory's floating point support to hook…

64fcda3

… sin function and evaluate user-defined polynomials using Horner's method. Current example only computes a geometric series.

fix warnings regarding Type not being in scope for Imm and run signature

d54055f

working merlin file

e5d5ce8

ivg added 5 commits August 8, 2019 11:36

publishes Bil semantics in the Bil_semantics interface

4549e2c

and loads it in the bil_main, so that later we can add a command line option which will select an appropriate semantics.

updates the testsuite

7d219fb

Merge remote-tracking branch 'origin/bap-2.0' into bap-2.0

ae06058

fixes a small bug in the optimizer

92bb14a

it needs an architecture, if none is present in the context, then it will just translate jumps to specials.

ivg commented Aug 8, 2019

View reviewed changes

lib/bap_core_theory/bap_core_theory_syntax.ml Outdated

@@ -0,0 +1,6 @@

Copy link

Member Author

ivg Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this file is dead and should be removed

ivg added 4 commits August 8, 2019 15:36

deletes a stale file

f6c18d3

btw, it was a good idea to add useful small primitives as a syntax module, but we can do this later.

fixes (again) Bitvector sexp representation

2f023cc

it was still using the binary notation in sexp

prettifies inspectors for a couple of domains

0bfe192

XVilka mentioned this pull request Aug 12, 2019

implements the superset disassembler #944

Draft

18 tasks

ivg and others added 6 commits August 13, 2019 11:32

updated submodule with bap-veri

0a42892

Merge remote-tracking branch 'origin/bap-2.0' into bap-2.0

db33122

Merge remote-tracking branch 'upstream/master' into bap-2.0

8b7e2a6

fixes Bitvec.to_int[X] converters for negative numbers

4f1094a

fixed minor bug in module naming

1ac0c1e

gitoleg approved these changes Aug 14, 2019

View reviewed changes

gitoleg merged commit 2773e8c into BinaryAnalysisPlatform:master Aug 14, 2019

This was referenced Aug 15, 2019

Make analysis and disassembly incremental radareorg/radare2#14812

Open

Have you considered to push this mainline? draperlaboratory/cbat_tools#7

Open

ivg mentioned this pull request Nov 14, 2019

BAP 2.0.0 release ocaml/opam-repository#15284

Merged

ivg deleted the bap-2.0 branch June 10, 2020 12:35

Heersin mentioned this pull request Jul 26, 2021

New Rizin IL integration rizinorg/rizin#1361

Closed

4 tasks

Heersin mentioned this pull request Aug 13, 2021

New Rizin IL integration rizinorg/rizin#1451

Closed

4 tasks

This was referenced Sep 12, 2021

Rzil op root approach rizinorg/rizin#1658

Closed

New RZIL integration rizinorg/rizin#1663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BAP 2.0 #960

BAP 2.0 #960

ivg commented Jul 29, 2019 •

edited

Loading

ivg commented Aug 8, 2019

ivg commented Aug 8, 2019

ivg Aug 8, 2019

BAP 2.0 #960

BAP 2.0 #960

Conversation

ivg commented Jul 29, 2019 • edited Loading

BAP 2.0 Release

What's new?

Breaking changes

The Detailed Description of New Features

The Knowledge Representation Framework

Further Reading

The Core Theory

Further Reading

New Speculative Disassembling Driver

Further reading

New CFG reconstruction algorithm

Further Reading

New Bitvectors Library

Further reading

Primus 2.0 and BAP Lisp

Further Reading

Acknowledgments

Questions

Missing Features

ivg commented Aug 8, 2019

ivg commented Aug 8, 2019

ivg Aug 8, 2019

Choose a reason for hiding this comment

ivg commented Jul 29, 2019 •

edited

Loading