-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BAP 2.0 #960
Merged
Merged
BAP 2.0 #960
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Based on approximation of sqrt(x + 1) with Ramez coefficients
the interface design is inspired (or more correctly enforced) by the current optimization capabilities of OCaml, the end goal is that if a bitvector operation could be computed without memory allocations, then it should be, even if the width is not known in advance. this is still a wip
didn't provide the specializations, probably will add them later.
Since we're going to use Bitvec as our next new Bitvector implementation and since we don't want to alias it to Word or Addr anymore (because it will introduce confusion with Bap.Std.Word and Bap.Std.Addr modules as well as significandly slow down compilation), we need a shorter name. However, Bitv is already used in the opam universe, and Bv is probably too short. So the middle ground was chosen, which actually matches with the SmtLib name for bitvectors. We also added two sublibraries along with submodules which enable support for bitvectors in the Janestreet suite of libraries. - Bitvec_order provides support for the interfaces that depend on ordering, like maps, sets, comparables, etc - Bitvec_binprot provides support bin_io Together they provide functions that make Bitvec.t a first-class citizen of JS Core libraries. We're moving away from the old JS style of providing full comparable or Regular interface for each module, since it bloats the binaries and, more importantly, significandly slows down the compilation. We are now relying on small interfaces such as compartors, so now instead of `Bitvec.Set.t` we will use `(Bitvec.t, bitvec_order) Set.t` and instead of `Bitvec.Set.Empty` we will use `Set.empty bitvec_order` See the module documentation for more information.
at least an attempt
so BAP can depend on the core theory now. also made core theory less heavy.
Since many use cases of the Bitvec interface will use the generic operations with modulus created on the fly from the integer represented width, having it boxed will have sever performance ramifications. So, instead of representing modulus as a pair of width and the modulus (i.e., 2^w-1 and w) we have to decide on one or another. We decided to use 2^w-1 as the modulus representation as this is the value that is used in normalization, so computing it every time will add two extra operations which are not [@@noalloc]ed. The ramification of this choice is that we still need to go back from 2^w-1 to w. The direct solution is to count bits, which is rather efficient already. For the [msb] operation we don't really need to know the width to test whether the msb is set, since msb x m ::= (m = m lsr 1 lor x) mod m (though I'm still not sure whichever will be more efficient, the [test_bit (num_bits m)] or the implementation above, but both are rather efficient) We also reimplemented the shift operations. They no longer depend on the signed extract and are implemented purely relying on bitwise operations. We use `num_bits` to get the width there.
many of our packages were dependent on bap only indirectly
the subpackages won't work with separate installation, since they have to share the same META file, and will break Dune and in general fragile.
Bap elementary
…rs. This new casting operation is used to extract xmm registers from ymm registers in order to support monomorphic functions in sse. The SysV ABI is updated to recognize functions that read and write to xmm0.
… sin function and evaluate user-defined polynomials using Horner's method. Current example only computes a geometric series.
and loads it in the bil_main, so that later we can add a command line option which will select an appropriate semantics.
although BIL semantics is now self-optimizing it is still missing some global optimizations (like elimination of useless assignments, aka value propagation) which prevents our tests from working. We will later disable the optimization passes, once our Bil denotational optimizer will handle such cases.
it needs an architecture, if none is present in the context, then it will just translate jumps to specials.
finally, all tests are passed! |
1) The Value.pp_slots function didn't work at all, it was assertfaulting because the sexp_of_f function of Dict wasn't really implemented. 2) The IR printers weren't handling the empty filter as "print everything" as it is should be by the documentation. Also, adds the instruction destinations inspector.
OK, that's should be all, it is good to merge, unless nothing else is found. |
ivg
commented
Aug 8, 2019
@@ -0,0 +1,6 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this file is dead and should be removed
btw, it was a good idea to add useful small primitives as a syntax module, but we can do this later.
it was still using the binary notation in sexp
we will later add a much better tools for knowledge introspection and visualization, but so far we are blocked by the bap frontend work, since we need for that a better commandline interface, than bap 1.x provides. But so far it is better than nothing.
18 tasks
As of Primus 1.x and before this commit, the interprocedural control flow wasn't following the return stack, but instead ignored the return statements and was always following the control flow graph. This had several ramifications 1) it was fragile as not all jumps could be properly classified as returns statically 2) it didn't work with tail calls 3) it wasn't fair, because we were misinterpreting malicious program as benign. The new implementation is following the true control flow, i.e., we are not ignoring any jumps and call the computed destinations. In case when the dynamic control flow diverges from the static control flow, we are raise the Cfi_violation exception, which is properly trapped in the same manner as Division_by_zero and Segmentation_fault. Implementation -------------- We use a shadow stack approach, and each interprocedural call which we expect to return pushes the return value (which we call a prompt) to that stack. The the contination linked to the call destination is executed bounded by the pushed prompt. It will stop when it reached the pushed prompt (i.e., when the control is literally returned to us) or when the continuation returns normally (e.g., when a Primus Lisp or primitive is called). In either case, we pop a prompt (if it was pushed), and compare the prompt value with the expected and raise (a trappable) exception if they diverge.
gitoleg
approved these changes
Aug 14, 2019
This was referenced Aug 15, 2019
4 tasks
4 tasks
This was referenced Sep 12, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
BAP 2.0 Release
The BAP 2.0 Release is inevitable and starting from this PR we are officially switching to BAP 2.0. The stable release is planned on October, so there is plenty of time to fix the bugs and update interfaces, before they are freezed. Therefore, comments, reviews, and questions are very welcomed. Please, keep in mind, that this is still work in progress, and a lot has to be done.
What's new?
The BAP 2.0 release brings three main features:
Without any doubts, the main contribution of BAP 2.0 is the new representation of program semantics which is made extensible and abstract. It means, that BAP no longer depends on BIL or any other (intermediate) language and allows an analyst to pick a representation that suits their needs the best. For example, there is a plugin that adds support for the BIL language. And we have another plugin that seamlessly integrates old (BAP 0.8) BIL into BAP framework. There are also plans to add support for Ghidra's Sleigh and other languages in the (nearest) future. The bottom line is that BIL is no longer the representation, but just an option, which could be chosen. Moreover, BAP supports multiple representations at the same time as well as translation between them.
While the new representation is extensible (i.e., the set of semantics features are not fixed) it comes with the initial theory, called the Core Theory which covers regular bitvector machines with memories, with speculative execution, and floating point arithmetics.
Underneath the hood, we use a knowledge representation system, that provides a unified representation for knowledge on all scales and levels of program representation - i.e., we use the same mechanisms to represent instructions, subroutines, programs, and even operating systems. The global knowledge representation also enables scalability and extends analysis in BAP from the level of small binaries, to the level of operating systems and even higher.
There are also lots of other contributions, like a completely new speculative disassembler, which
is both more conservative and more precise, supports arbitrary delay slots (and other modes of speculative execution) is independent on BIL and is much more efficient. The full list of changes with detailed descriptions will be provided below.
Breaking changes
None.
The old
Bap.Std
interface as well as all libraries that constitute BAP 1.x distribution are reimplemented and use the new representation underneath the hood. So why are we bumping the version number? The main reason, is that since now Bap.Std et alas are deprecated and should be only used for compatibility reasons only. We will gradually migrate our analyses to the new framework and encourage everyone to use the new representation in the newly written analyses.There is, however, one change that to some extent could be considered as a breaking change - the
Type.t
data type of the BIL representation is extended with one more constructorUnk
which denotes an expression with unknown type. This small change allowed us to implement the projection of the new (and rich) semantics into old BAP without breaking any interfaces. A little bit cheaty, but unless you explicitly model features that are not representable in BIL you will not notice this change and your old code will work seamlessly.Besides, we even have a plugin that models floating point arithmetics in BIL in a soft-float style, so that any analysis that relies on BIL could be applied to the floating point code.
The Detailed Description of New Features
The Knowledge Representation Framework
In BAP 1.x we were using the Project.t data structure to eagerly represent all pieces of information that we were able to accumulate about the program and employed as simple mechanism of passes to build and analyse this structure. This approach had a few problems. First of all, the representation weren't uniform and really extensible. While some fields were blessed with an explicit representation, like control flow graph and instructions, others were second-class citizens and were represented as attributes, which were obscure and rather inefficient. Another problem was that in program analysis properties of a program are usually mutually dependent and in fact for a highly connected dependency graph, which is hard to resolve using a pass-based approach, even with dependencies and appropriate pass management. And finally, this didn't scale, since for large programs eagerly building the
Project.t
data structure is just infeasible and, usually, not necessary, as many question could be answered without requiring a full CFG or semantics for all possible instructions in the binary.Therefore, it became obvious to us, that we need a common representation for different properties of a program with mutual dependencies and built-in fixed point. Our representation is a classical frame language, with objects having properties (accessed via slots). We ensure termination and independence from the order of execution of different analyses, by ensuring that all properties are represented as domains (partially ordered sets) and by preventing non-monotonic updates. We fully embed the knowledge representation runtime in our host language OCaml with all the rules properly reflected in the OCaml type system, e.g., properties are fully typed and safe to use and could be accessed via a monadic interface (which is now much more convenient, after the introduction of the monadic let-syntax into OCaml). If a property is recursive (either directly or via other properties) then a fixed point is computed. To ensure correctness of the computed fixed point (as well as its availability) we restrict all updates to be monotone (wrt to the provided partial ordering). Since the same property in program analysis could be computed to different values by different analyses, we also provide a consensus resolution engine, which ensures that a consensus will be eventually reached (and that it result doesn't depend on the order of evaluation).
While our Knowledge Framework is currently implemented in OCaml and could be operated only from OCaml, it is designed to be retargetable and we're currently investigating other backends, which will potentially provide better interoperability and scalability, such as using different Datalog implementations, Souffle or Semmle-like systems as possible backends.
Further Reading
The Core Theory
The Core Theory uses the new Knowledge Representation Framework to define a set of objects and their properties which are commonly used to express knowledge in Program Analysis. In other words, it basically defines the language which is used for stating properties in program analysis. But in a usual way, i.e., as an AST or a set of classes/constructors, rather than as a set of abstractions, allowing any implementation of those abstraction to fill the role of representation.
The word theory here should be interpreted as mathematical theory, i.e., a set of sentence, and in general we base or design a lot on SMT-LIB specification. From this perspective, our Core Theory is a big umbrella theory for small and more manageable subtheories, namely:
All expressions and statements in Core Theory are sorted and their sorts are reflected in the host language type system through tagless final embedding. The latter means, that the OCaml type checker will ensure that all formulae are well-formed and well-sorted. For example, it is impossible to add a value of bitvector sort to a value of memory sort, and even bitvectors and memories sorts are indexed with their sizes, e.g., it wouldn't be possible to apply an arithmetic operation to two bitvectors of different sizes.
The Core Theory is not fixed, it just defines a sane initial theory, but it is possible to extend with new definitions and even define new sorts. The Core Theory doesn't define the semantics of the language leaving it to corresponding analyses. For example, the BIL plugin denotes the semantics of Core Theory in terms of BIL language.
In general, any language which provides an implementation for the Core Theory interface could be used interchangeably as a representation. In the Core Theory the translation from an abstract term to a concrete representation is called reification. For example, we can reify a program term into its BIL representation or a graph, represented as BIR, or to a set of values or destinations that are computed in this term. From the programmer's perspective, implementing the Core Theory interfaces just means providing a suitable representation in terms of the defined language for each construct in the core theory, e.g.,
A translation from a concrete representational (e.g. BIl) to abstract term is called reflection, but sometimes we stick to the old and traditional name lifting. It is usually a bit more convoluted, since this direction involves moving from an untyped representation (which potentially includes ill-formed sentences) to a typed one, so one have to satisfy the OCaml typechecker. However, a special support is provided for BIL-like languages in terms of a generic parser, therefore expressing the semantics of a concrete language in terms of Core Theory usually distills to simple syntactic rules, e.g.,
Of course, a full bi-directional specification of BIL/BIR is provided as a part of this PR. We are planning later to add the semantic specification of the legacy BIL.
Further Reading
New Speculative Disassembling Driver
This PR also provides a completely new, written from scratch, driver for disassembling (as well as all existing disassemblers are rewritten to use this driver). The driver is using the new Knowledge Framework and builds the whole program CFG incrementally storing only the minimum information in the process memory. It could be used to implement disassemblers of different favors, e.g., linear, recursive descent, probabilistic, and superset disassemblers.
It is implemented as a recursive descent parser with arbitrary backtracking and is capable to automatically refute instruction chains which are not terminated by jumps, thus seriously reducing the generated CFG in speculative and superset modes.
More importantly, the disassembler is designed to work incrementally and build the CFG on the fly, so that we can analyze huge binaries, each subroutine independently. Which also relies on a new CFG reconstruction and function boundaries algorithm described below.
Further reading
New CFG reconstruction algorithm
The task of partitioning of the whole program CFG into a quotient set of subroutines' control flow graphs is now well-defined by the definition of the control flow graph of a subroutine, which requires each node of a graph to be reachable and dominated by the entry node. Therefore, we build a dominator tree (forest) where nodes which are not dominated by any other nodes (beyond the pseudo entrance node) are function starts and children of a function start node denotes the body of the function.
We use our fixpoint function to compute the fixed point of the domination relationship, which also allows us to compute the set of functions incrementally, any newly discovered edges could be added to the fixed point equation, and we will continue stabilization from the point where we have stopped.
Further Reading
New Bitvectors Library
Since all terms in Core Theory are well-typed it is no longer required to store types of bitvectors in their bodies as tags. Therefore, we switched to the tagless representation of bitvectors, which is much more memory and CPU and memory efficient. We also changed the semantics of bitvector operations to fully satisfy the new SMT-LIB behavior, especially with respect to division by zero. Now all operations on bitvectors are total and well-defined.
We provide the old Bap_bitvector library as a shim over the new implementation for backward compatibility.
Further reading
Primus 2.0 and BAP Lisp
While we still retain the old Primus (which fully relies on BIR and Project data structure) we aslo provide Primus 2.0, which is much more flexible and easier to use, but not yet as complete as Primus 1.x.
In the new Primus we are moving away from higher-order polymorphism (which proved actually unnecessary, as all analyses were still parametrized by the identity monad). In other words, components are no longer functors, but are plain old functions.
The observation interface is tweaked such that we no longer are using tuples to pass multiple arguments but instead a curried notation is used, that greatly improves the performance (no more allocations when an observation is made).
The project data structure is no longer used (in fact Primus 2.0 doesn't depend on Bap.Std at all), instead the knowledge monad is transfused into the Primus monad.
Finally, we're developing a new BAP Lisp language which will be independent of Primus and will be able to store any semantics, and a new Primus Interpreter which will be extensible and capable of interpreting BAP Lisp.
Further Reading
Acknowledgments
I would like to thank all the contributors, especially Rafael Turner and Oleg Kamenkov for their impressive work on floating point modelling.
Questions
Please, do not hesitate to ask questions, leave your comments, suggestions and share your fears here. We will be happy to address them all. Keep in mind, that nothing is released yet, and therefore it is not the best time to change and ask for changes :)
P.S. There are many more new features in this PR, I just didn't have enough time to go through all of them. I will update this message as necessary. There are also lots of bug fixes, which we should probably backport into 1.7
Missing Features