enables multiple projects in the same knowledge base #1119

ivg · 2020-06-11T18:43:22Z

This PR paves the road to the global knowledge base which will be storing information about several binaries (executables, libraries, object files, etc). It also brings a few new functional and convenience features. The quick summary is below:

enables multiple projects in the same knowledge base;
adds scoping of the procedures stored in the knowledge base;
introduces a new knowledge base class for code units;
extracts more information about binaries from the llvm loaders;
introduces unit biasing, fixes the base address is not respect by many components #1126
scopes information sources to the corresponding paths
renames the bap.std package to bap
adds bap compare command together with the new collators extension point
introduces rules for introspecting procedures stored in the knowledge base
joins symbolizers via the new possible-name property to collect all the aliases
moves logging and event facilities to the bap-main library
adds more parsers to the Bitvec module
updates all information sources to enable their cooperation with more than one file
rewrites the objdump plugin, it now also able to provide address from name
provides extensible abstractions for language, architecture, abi, etc
publishes and extends the Toplevel module
refines the project documentation

New features

The `compare` command and `callgraph-collator`

This is a new command that showcases the new ability to store multiple same or different files in the knowledge base. It comes with a new extension injection point called collators (sorry, the compare and comparators names are already too busy in OCaml). A collator is an object that takes two projects and compares them to its taste. Its interface is designed in such way that we don't need to store in the residential memory more than one project or to store information about more than two projects when we compare several files. As a demonstration, we implemented a collator that compares N binaries by comparing their callgraphs.

Knowledge Base Rules and their introspection

Procedures, stored in the knowledge base, are totally opaque so it is hard to guess which property is provided by which procedure and what are the dependencies. To enable transparency, we now explicitly and statically describe all rules, using the new rule description eDSL. The description doesn't affect the semantics and is purely for information purposes. To get all the rules, stored in the knowledge base, just type bap list rules, this is, for example, how symbolizers are described in terms of the knowledge base rules,

-- [Symbolizer.provide s] reflects [s] to KB.
bap:reflect-symbolizers(symbolizer) ::=
  core-theory:unit-bias
  core-theory:unit-path
  core-theory:label-unit
  core-theory:label-addr
  bap:arch
  -------------------------
  core-theory:possible-name

Implementation Details

The problem

When a project is created using the Project.create function it uses the internal knowledge base and stores all the information about the disassembled binary in it. When the Project.create function is called the second time, it most likely will end up in a conflicting state since we are indexing objects by their virtual addresses and is quite probable that two binaries will have different instructions at the same addresses. In addition, our old information sources, such as branchers, rooters, and symbolizers, albeit being deprecated are still playing an important role in our infrastructure and they were reflecting their information directly into the knowledge base every time a new file is opened, so that if several binaries were opened in a row they will keep computing roots and names for both of them at the same time, which also will lead eventually to the conflicting information.

Solution

We represent program labels as the knowledge base symbols, which were interned in the core-theory package. Instead, we can intern them in the current-package and set the current-package variable to a different value for different files projects. That enables the same addresses to be distinguished if the came from different files.

Scoping information sources is a bit harder. First of all, we introduced promising and providing operations to the Knowledge interface that temporary store procedures in the knowledge base. It is not, however, always possible to stretch the right scope, moreover, sometimes it is very convenient when the procedure is stored in the knowledge base indefinitely and provide information even after the project is reconstructed and maybe even after the knowledge base is persisted and loaded again.

To achieve these goals, we introduced the notion of code units. A code unit denotes a set of instructions that share common properties and attributed each instruction a unit. From the unit, we can obtain information about the file name, target architecture, and even the programming language from which the instruction was compiled. A well-behaved information provider, when it computes some instruction property, can now look into its unit and figure out its origin and check if it matches with its own source of information, e.g., when radare2 symbolizer is readings its symbols from the file named foo and the address comes from a unit that belongs to a file named bar it should not provide any symbolic information about that address. For the old information sources, such as rooter, symbolizer, and brancher, we enabled this behavior automatically, by adding the path property to the sources. When the source is provided to the knowledge base, the stored procedure checks if the path in the source matches the origin path of the address.

The objdump plugin is not using the old interface anymore and is fully rewritten. The idea is to devise the interface for providing information that will be used instead of the old symbolizers and rooters. It is still a work-in-progress that we have cherry-picked from the branch that delivers Ghidra support. The new implementation of the objdump plugin is a proof-of-concept that operates fully from the knowledge base (it is not even dependent on the Bap.Std interface). For each address, it obtains its unit and if it doesn't have any information about it, it opens and parses that file and then provides the obtained information as usual. It also provides a new service that is dual to the service that our symbolizers provide, it is able to resolve names to addresses. This latter service uncovered a long-term bug that we have in BAP but that went unnoticed. Namely, that --llvm-base option wasn't properly addressed by all our information providers. When this option was used, our llvm loader was rebasing the binary and, as a result, all addresses were different from what other information sources that rely on the original data were seeing. At the end, objdump, radare, and ida, were providing information for the real addresses not for the shifted. That also led to multiple conflicts.

The real culprit was the original design, as we shouldn't have this option at all on the llvm level, but instead handle it globally. But it is too late to change anything so we provided a more general solution. We introduced the bias property to our unit class, which denotes that all addresses in this unit are biased with respect to the real addresses. Then we updated our information providers to respect that bias, both when we need to go from biased to real addresses and back from real to biased. To minimize the impact, we added automated handling of biases to our old information sources, rooter, symbolizer, and brancher. We assume that they are all providing unbiased addresses and automatically subtract the bias before passing the addresses to them. Only the information sources that were obtained from the image source are considered biased, so this extra correction is not needed. Right now we do not allow users to create explicit biased sources as we don't see the real need for that, but later we may publish this interface.

ivg · 2020-06-11T18:44:15Z

@gitoleg, take a closer look at the stub-resolver, as I had to rewrite it substantially.

or in the user package if one is specified. Also, demistifies the program objects and documents explicitly how they are formed.

We assumed by default that all our information sources (rooters, symbolizers, and branchers) are unbiased but it wasn't true for the sources that we created of images (using corresponding of_image) that were already operating using the biased information from the loader. To fix this issue we added a hidden parameter to mark information sources as biased or unbiased and perform bias substraction (and also addition in case when the destinations are provided) based on this parameter. We may later make it public, but so far it is only set for information sources that are derived from images. Also the common code between information sources were factored out into the Bap_disasm_source module.

We are now able to query the bitness without having to pull in the Bap.Std interface so we can implement everything neatly.

and uses it for the rec, as far as I can grep, I don't see any information sources that are not properly scoped, either by limiting them to the path or to a function.

also exports the full toplevel interface

The [for_file] function now also sets the path (the same as the corresponding [for_addr] function sets the address). The [for_region] now interns the boundaries in the current package and builds the finaly symbol from their concatenation to enable intersecting regions from different files in the same knowledge base.

also establishes equalities between it and its re-export in Bap.Std and adds an convenience alias in the Bap_main module for the loggers.

Implements support for various relocations and improves existing that enables us to pass all tests without relying on external symbols or tools such as objdump or radare2. This branch support PLT-like relocations, as well as direct calls with GLOB_DAT relocations (fixes BinaryAnalysisPlatform#1135). The PLT entries are constant folded and memory references are then analyzed. We also extended the analysis that detects stub functions to support various ABI and file formats. For PowerPC MachO, that stores stubs directly in the text section, we implemented a signature matching procedure to reliably detect the stubs. We also significantly improved support of mips, which was sufferening from missing function starts that correspond to the stubbed functions as byteweigh is unable to detect these stubs. In addition, this PR brings a new library called Bap_relation that is a bidirectional mapping useful for storing addr <-> name mapping and ensure their bijection. This library is now used explicitly or implicitly (via the old symbolizer interface) by all our providers of symbolic information. This change prevents symbolizers from providing conflicting information, which may later lead to the knowledge base conflicts. We also removed so far the name to address translation service that we recently introduced BinaryAnalysisPlatform#1119. We are not ready for this service yet (our knowledge base is not having enough rules stored in it) and without this rule we can disassemble 25% faster. There are also a couple of minor fixes and quality of life improvements: - fixes Insn.dests domain functions - a better default for the KB.Domain.Powerset inspect parameter - makes glibc-runtime heuristic more aggressive

Implements support for various relocations and improves existing that enables us to pass all tests without relying on external symbols or tools such as objdump or radare2. This branch support PLT-like relocations, as well as direct calls with GLOB_DAT relocations (fixes #1135). The PLT entries are constant folded and memory references are then analyzed. We also extended the analysis that detects stub functions to support various ABI and file formats. For PowerPC MachO, that stores stubs directly in the text section, we implemented a signature matching procedure to reliably detect the stubs. We also significantly improved support of mips, which was sufferening from missing function starts that correspond to the stubbed functions as byteweigh is unable to detect these stubs. In addition, this PR brings a new library called Bap_relation that is a bidirectional mapping useful for storing addr <-> name mapping and ensure their bijection. This library is now used explicitly or implicitly (via the old symbolizer interface) by all our providers of symbolic information. This change prevents symbolizers from providing conflicting information, which may later lead to the knowledge base conflicts. We also removed so far the name to address translation service that we recently introduced #1119. We are not ready for this service yet (our knowledge base is not having enough rules stored in it) and without this rule we can disassemble 25% faster. There are also a couple of minor fixes and quality of life improvements: - fixes Insn.dests domain functions - a better default for the KB.Domain.Powerset inspect parameter - makes glibc-runtime heuristic more aggressive

Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old representation suffered from a few problems that we inherited from LLVM. The main issue is that Arch.t is not extensible and in order to add a new architecture the Bap.Std code shall be changed in a backward-compatibility-breaking manner. Arch.t is als unable to represent the whole variety of computing devices, which is especially relevant to micro-controllers (AVR, PIC) and IoT devices on which we are currently focusing. Finally, Arch.t is not precise enough to capture information that is necessary for code generation, the new venue that we are currently exploring. As the first attempt that didn't really work we introduced arch, sub, and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem with that approach was the stringly typed interface as `arch` was represented as a simple string. In addition, the proposed properties werent' able to describe uncommon architectures. Finally, it was very awkward to use, all fields were optional with no good defaults. This is the second attempt and it will be split into several pull requests. The first PR, this one, introduce the Theory.Target.t but still keeps Arch.t alive, i.e., it is used by all internal and external components of BAP. This is to ensure that switching to Target.t doesn't break any existing code. The consequent pull requests will gradually deprecated functions that use Arch.t and switch Target.t everywhere. The most important switch will affect the disassembler/decoder framework, which is currently still stuck on Arch.t. Just to be clear, after this work is finished and until BAP 3.0 and maybe even thereafter Arch.t will still work as it used to work and no code will break or require updates. However, newly added architectures, such as AVR or PIC, i.e., those that could not be represented with Arch.t will not be available for the code that still relies on it. In addition to Theory.Target.t we add a few more abstractions and convenience functions, e.g., `Project.empty` and a completely new interface for Project.Input.t generation, which makes it easier to create projects from strings or other custom data, e.g., `Project.Input.from_string` . We also add Source and Compiler abstractions to the knowledge base Core Theory. These abstractions, together with Target, describe the full cycle of the program transformation from source to the target binary using the specified compiler (and the other way around). The Target abstraction itself comes with a few more data types that describe various aspects of the target system, including file formats, ABI, floating-point ABI (FABI), endianness, which is no longer limited to the binary choice of little and big endianness, and an extensible data type for storing target-specific options. Finally, all targets are formed into hierarchies and families, which helps in controlling the vast zoo of computer architectures and devices. The Target.t is an abstract data type and is self-describing and includes enough information that describes all the details of the architecture. We also provide four library modules, for arm, mips, powerpc, and x86 that exposes the currenlty declared targets (there is about 80 targets currently and more will be added soonish, see `bap list targets` for the up-to-date list). Our LLVM backend is not yet precise enough to recongize many of the supported targets and we don't have analyses right now that will infer the target from the binary, but we will add the `--target` option in the next PRs (when we will switch to Target.t) everywhere. As usual, comments, questions, reviews are very welcome.

Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old representation suffered from a few problems that we inherited from LLVM. The main issue is that Arch.t is not extensible and in order to add a new architecture the Bap.Std code shall be changed in a backward-compatibility-breaking manner. Arch.t is als unable to represent the whole variety of computing devices, which is especially relevant to micro-controllers (AVR, PIC) and IoT devices on which we are currently focusing. Finally, Arch.t is not precise enough to capture information that is necessary for code generation, the new venue that we are currently exploring. As the first attempt that didn't really work we introduced arch, sub, and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem with that approach was the stringly typed interface as `arch` was represented as a simple string. In addition, the proposed properties werent' able to describe uncommon architectures. Finally, it was very awkward to use, all fields were optional with no good defaults. This is the second attempt and it will be split into several pull requests. The first PR, this one, introduce the Theory.Target.t but still keeps Arch.t alive, i.e., it is used by all internal and external components of BAP. This is to ensure that switching to Target.t doesn't break any existing code. The consequent pull requests will gradually deprecated functions that use Arch.t and switch Target.t everywhere. The most important switch will affect the disassembler/decoder framework, which is currently still stuck on Arch.t. Just to be clear, after this work is finished and until BAP 3.0 and maybe even thereafter Arch.t will still work as it used to work and no code will break or require updates. However, newly added architectures, such as AVR or PIC, i.e., those that could not be represented with Arch.t will not be available for the code that still relies on it. In addition to Theory.Target.t we add a few more abstractions and convenience functions, e.g., `Project.empty` and a completely new interface for Project.Input.t generation, which makes it easier to create projects from strings or other custom data, e.g., `Project.Input.from_string` . We also add Source, Language, and Compiler abstractions to the knowledge base Core Theory. These abstractions, together with Target, describe the full cycle of the program transformation using the compiler from source code in the given language to the program for the specified target (and the other way around). The Target abstraction itself comes with a few more data types that describe various aspects of the target system, including file formats, ABI, floating-point ABI (FABI), endianness, which is no longer limited to the binary choice of little and big endianness, and an extensible data type for storing target-specific options. Finally, all targets are formed into hierarchies and families, which helps in controlling the vast zoo of computer architectures and devices. The Target.t is an abstract data type and is self-describing and includes enough information that describes all the details of the architecture. We also provide four library modules, for arm, mips, powerpc, and x86 that exposes the currenlty declared targets (there is about 80 targets currently and more will be added soonish, see `bap list targets` for the up-to-date list). Our LLVM backend is not yet precise enough to recongize many of the supported targets and we don't have analyses right now that will infer the target from the binary, but we will add the `--target` option in the next PRs (when we will switch to Target.t) everywhere. As usual, comments, questions, reviews are very welcome.

Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old representation suffered from a few problems that we inherited from LLVM. The main issue is that Arch.t is not extensible and in order to add a new architecture the Bap.Std code shall be changed in a backward-compatibility-breaking manner. Arch.t is als unable to represent the whole variety of computing devices, which is especially relevant to micro-controllers (AVR, PIC) and IoT devices on which we are currently focusing. Finally, Arch.t is not precise enough to capture information that is necessary for code generation, the new venue that we are currently exploring. As the first attempt that didn't really work we introduced arch, sub, and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem with that approach was the stringly typed interface as `arch` was represented as a simple string. In addition, the proposed properties werent' able to describe uncommon architectures. Finally, it was very awkward to use, all fields were optional with no good defaults. This is the second attempt and it will be split into several pull requests. The first PR, this one, introduce the Theory.Target.t but still keeps Arch.t alive, i.e., it is used by all internal and external components of BAP. This is to ensure that switching to Target.t doesn't break any existing code. The consequent pull requests will gradually deprecated functions that use Arch.t and switch Target.t everywhere. The most important switch will affect the disassembler/decoder framework, which is currently still stuck on Arch.t. Just to be clear, after this work is finished and until BAP 3.0 and maybe even thereafter Arch.t will still work as it used to work and no code will break or require updates. However, newly added architectures, such as AVR or PIC, i.e., those that could not be represented with Arch.t will not be available for the code that still relies on it. In addition to Theory.Target.t we add a few more abstractions and convenience functions, e.g., `Project.empty` and a completely new interface for Project.Input.t generation, which makes it easier to create projects from strings or other custom data, e.g., `Project.Input.from_string` . We also add Source, Language, and Compiler abstractions to the knowledge base Core Theory. These abstractions, together with Target, describe the full cycle of the program transformation using the compiler from source code in the given language to the program for the specified target (and the other way around). The Target abstraction itself comes with a few more data types that describe various aspects of the target system, including file formats, ABI, floating-point ABI (FABI), endianness, which is no longer limited to the binary choice of little and big endianness, and an extensible data type for storing target-specific options. Finally, all targets are formed into hierarchies and families, which helps in controlling the vast zoo of computer architectures and devices. The Target.t is an abstract data type and is self-describing and includes enough information that describes all the details of the architecture. We also provide four library modules, for arm, mips, powerpc, and x86 that exposes the currenlty declared targets. Our LLVM backend is not yet precise enough to recongize many of the supported targets and we don't have analyses right now that will infer the target from the binary, but we will add the `--target` option in the next PRs (when we will switch to Target.t) everywhere. As usual, comments, questions, reviews are very welcome.

Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old representation suffered from a few problems that we inherited from LLVM. The main issue is that Arch.t is not extensible and in order to add a new architecture the Bap.Std code shall be changed in a backward-compatibility-breaking manner. Arch.t is als unable to represent the whole variety of computing devices, which is especially relevant to micro-controllers (AVR, PIC) and IoT devices on which we are currently focusing. Finally, Arch.t is not precise enough to capture information that is necessary for code generation, the new venue that we are currently exploring. As the first attempt that didn't really work we introduced arch, sub, and other properties to the `core-theory:unit` class in #1119. The problem with that approach was the stringly typed interface as `arch` was represented as a simple string. In addition, the proposed properties werent' able to describe uncommon architectures. Finally, it was very awkward to use, all fields were optional with no good defaults. This is the second attempt and it will be split into several pull requests. The first PR, this one, introduce the Theory.Target.t but still keeps Arch.t alive, i.e., it is used by all internal and external components of BAP. This is to ensure that switching to Target.t doesn't break any existing code. The consequent pull requests will gradually deprecated functions that use Arch.t and switch Target.t everywhere. The most important switch will affect the disassembler/decoder framework, which is currently still stuck on Arch.t. Just to be clear, after this work is finished and until BAP 3.0 and maybe even thereafter Arch.t will still work as it used to work and no code will break or require updates. However, newly added architectures, such as AVR or PIC, i.e., those that could not be represented with Arch.t will not be available for the code that still relies on it. In addition to Theory.Target.t we add a few more abstractions and convenience functions, e.g., `Project.empty` and a completely new interface for Project.Input.t generation, which makes it easier to create projects from strings or other custom data, e.g., `Project.Input.from_string` . We also add Source, Language, and Compiler abstractions to the knowledge base Core Theory. These abstractions, together with Target, describe the full cycle of the program transformation using the compiler from source code in the given language to the program for the specified target (and the other way around). The Target abstraction itself comes with a few more data types that describe various aspects of the target system, including file formats, ABI, floating-point ABI (FABI), endianness, which is no longer limited to the binary choice of little and big endianness, and an extensible data type for storing target-specific options. Finally, all targets are formed into hierarchies and families, which helps in controlling the vast zoo of computer architectures and devices. The Target.t is an abstract data type and is self-describing and includes enough information that describes all the details of the architecture. We also provide four library modules, for arm, mips, powerpc, and x86 that exposes the currenlty declared targets. Our LLVM backend is not yet precise enough to recongize many of the supported targets and we don't have analyses right now that will infer the target from the binary, but we will add the `--target` option in the next PRs (when we will switch to Target.t) everywhere. As usual, comments, questions, reviews are very welcome.

ivg requested a review from gitoleg June 11, 2020 18:43

ivg added the KB label Jun 11, 2020

ivg self-assigned this Jun 12, 2020

ivg mentioned this pull request Jun 17, 2020

implement support for R_X86_64_GLOB_DAT relocations #1135

Closed

ivg force-pushed the compartmentalize-projects branch 3 times, most recently from 00439d6 to 7866818 Compare June 29, 2020 20:28

ivg force-pushed the compartmentalize-projects branch 6 times, most recently from 49c69b7 to 6b7865c Compare July 13, 2020 17:40

ivg mentioned this pull request Jul 14, 2020

Core Theory based ARM lifter #1174

Closed

ivg force-pushed the compartmentalize-projects branch from 6b7865c to caec1cf Compare July 14, 2020 14:13

ivg added 14 commits July 15, 2020 13:22

creates program modules in the current package

dc94bb4

or in the user package if one is specified. Also, demistifies the program objects and documents explicitly how they are formed.

adds package and set_package functions to the knowledge interface

a423663

adds ?package to corresponding function in the Tid module

b8be703

adds the package parameter to the Project.creat function

c21de9f

fixes stub-resolver, it was adding bogus slots to the program class

f45fa44

properly qualifies the start and exit nodes of the Tid graph

6d74685

uses the new package field in the disassemble command

f7e31ab

interns contexts in the current package

4af460b

adds the collator extension points

36e2ddd

adds the collate command

cb897e4

adds the callgraph collators

a655f87

adds scoped promises and proposals to knowledge

fb13a31

switches to scoped promises for arches

df168f0

enables lazy processing of the projects in the collate command

9baad7e

ivg added 14 commits July 15, 2020 13:22

improves signature mismatch error in the OGRE parser

5e2c541

fixes abi printer

9b867d6

adds the newly added fields to the image specification

9a852b5

enables propagation of spec into KB and arch into spec

76ff3c6

initializes the unit for the low-level disassembler interface

69eb917

fixes the hardcoded modulus in the objdump plugin

50434ef

We are now able to query the bitness without having to pull in the Bap.Std interface so we can implement everything neatly.

scopes the information sources obtained from the image

6619f60

enables scopes promises for information sources

f70bc24

and uses it for the rec, as far as I can grep, I don't see any information sources that are not properly scoped, either by limiting them to the path or to a function.

updates the project documentation, documents the toplevel interface

6b6cf53

also exports the full toplevel interface

flushes the formatter in the callgraph collator.

40b8624

documents the Bap_main_event module

7c71724

also establishes equalities between it and its re-export in Bap.Std and adds an convenience alias in the Bap_main module for the loggers.

adds the compare command tests

0f8b7f2

ivg force-pushed the compartmentalize-projects branch from 2310da7 to 0f8b7f2 Compare July 15, 2020 17:22

ivg merged commit 53da1ca into BinaryAnalysisPlatform:master Jul 15, 2020

ivg mentioned this pull request Aug 17, 2020

improves symbolization facilities #1209

Merged

ivg mentioned this pull request Sep 24, 2020

overhauls the target/architecture abstraction (1/n) #1225

Merged

ivg deleted the compartmentalize-projects branch December 1, 2021 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enables multiple projects in the same knowledge base #1119

enables multiple projects in the same knowledge base #1119

ivg commented Jun 11, 2020 •

edited

Loading

ivg commented Jun 11, 2020

enables multiple projects in the same knowledge base #1119

enables multiple projects in the same knowledge base #1119

Conversation

ivg commented Jun 11, 2020 • edited Loading

New features

The compare command and callgraph-collator

Knowledge Base Rules and their introspection

Implementation Details

The problem

Solution

ivg commented Jun 11, 2020

ivg commented Jun 11, 2020 •

edited

Loading

The `compare` command and `callgraph-collator`