Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enables multiple projects in the same knowledge base #1119

Merged
merged 67 commits into from
Jul 15, 2020

Conversation

ivg
Copy link
Member

@ivg ivg commented Jun 11, 2020

This PR paves the road to the global knowledge base which will be storing information about several binaries (executables, libraries, object files, etc). It also brings a few new functional and convenience features. The quick summary is below:

  1. enables multiple projects in the same knowledge base;
  2. adds scoping of the procedures stored in the knowledge base;
  3. introduces a new knowledge base class for code units;
  4. extracts more information about binaries from the llvm loaders;
  5. introduces unit biasing, fixes the base address is not respect by many components #1126
  6. scopes information sources to the corresponding paths
  7. renames the bap.std package to bap
  8. adds bap compare command together with the new collators extension point
  9. introduces rules for introspecting procedures stored in the knowledge base
  10. joins symbolizers via the new possible-name property to collect all the aliases
  11. moves logging and event facilities to the bap-main library
  12. adds more parsers to the Bitvec module
  13. updates all information sources to enable their cooperation with more than one file
  14. rewrites the objdump plugin, it now also able to provide address from name
  15. provides extensible abstractions for language, architecture, abi, etc
  16. publishes and extends the Toplevel module
  17. refines the project documentation

New features

The compare command and callgraph-collator

This is a new command that showcases the new ability to store multiple same or different files in the knowledge base. It comes with a new extension injection point called collators (sorry, the compare and comparators names are already too busy in OCaml). A collator is an object that takes two projects and compares them to its taste. Its interface is designed in such way that we don't need to store in the residential memory more than one project or to store information about more than two projects when we compare several files. As a demonstration, we implemented a collator that compares N binaries by comparing their callgraphs.

Knowledge Base Rules and their introspection

Procedures, stored in the knowledge base, are totally opaque so it is hard to guess which property is provided by which procedure and what are the dependencies. To enable transparency, we now explicitly and statically describe all rules, using the new rule description eDSL. The description doesn't affect the semantics and is purely for information purposes. To get all the rules, stored in the knowledge base, just type bap list rules, this is, for example, how symbolizers are described in terms of the knowledge base rules,

-- [Symbolizer.provide s] reflects [s] to KB.
bap:reflect-symbolizers(symbolizer) ::=
  core-theory:unit-bias
  core-theory:unit-path
  core-theory:label-unit
  core-theory:label-addr
  bap:arch
  -------------------------
  core-theory:possible-name

Implementation Details

The problem

When a project is created using the Project.create function it uses the internal knowledge base and stores all the information about the disassembled binary in it. When the Project.create function is called the second time, it most likely will end up in a conflicting state since we are indexing objects by their virtual addresses and is quite probable that two binaries will have different instructions at the same addresses. In addition, our old information sources, such as branchers, rooters, and symbolizers, albeit being deprecated are still playing an important role in our infrastructure and they were reflecting their information directly into the knowledge base every time a new file is opened, so that if several binaries were opened in a row they will keep computing roots and names for both of them at the same time, which also will lead eventually to the conflicting information.

Solution

We represent program labels as the knowledge base symbols, which were interned in the core-theory package. Instead, we can intern them in the current-package and set the current-package variable to a different value for different files projects. That enables the same addresses to be distinguished if the came from different files.

Scoping information sources is a bit harder. First of all, we introduced promising and providing operations to the Knowledge interface that temporary store procedures in the knowledge base. It is not, however, always possible to stretch the right scope, moreover, sometimes it is very convenient when the procedure is stored in the knowledge base indefinitely and provide information even after the project is reconstructed and maybe even after the knowledge base is persisted and loaded again.

To achieve these goals, we introduced the notion of code units. A code unit denotes a set of instructions that share common properties and attributed each instruction a unit. From the unit, we can obtain information about the file name, target architecture, and even the programming language from which the instruction was compiled. A well-behaved information provider, when it computes some instruction property, can now look into its unit and figure out its origin and check if it matches with its own source of information, e.g., when radare2 symbolizer is readings its symbols from the file named foo and the address comes from a unit that belongs to a file named bar it should not provide any symbolic information about that address. For the old information sources, such as rooter, symbolizer, and brancher, we enabled this behavior automatically, by adding the path property to the sources. When the source is provided to the knowledge base, the stored procedure checks if the path in the source matches the origin path of the address.

The objdump plugin is not using the old interface anymore and is fully rewritten. The idea is to devise the interface for providing information that will be used instead of the old symbolizers and rooters. It is still a work-in-progress that we have cherry-picked from the branch that delivers Ghidra support. The new implementation of the objdump plugin is a proof-of-concept that operates fully from the knowledge base (it is not even dependent on the Bap.Std interface). For each address, it obtains its unit and if it doesn't have any information about it, it opens and parses that file and then provides the obtained information as usual. It also provides a new service that is dual to the service that our symbolizers provide, it is able to resolve names to addresses. This latter service uncovered a long-term bug that we have in BAP but that went unnoticed. Namely, that --llvm-base option wasn't properly addressed by all our information providers. When this option was used, our llvm loader was rebasing the binary and, as a result, all addresses were different from what other information sources that rely on the original data were seeing. At the end, objdump, radare, and ida, were providing information for the real addresses not for the shifted. That also led to multiple conflicts.

The real culprit was the original design, as we shouldn't have this option at all on the llvm level, but instead handle it globally. But it is too late to change anything so we provided a more general solution. We introduced the bias property to our unit class, which denotes that all addresses in this unit are biased with respect to the real addresses. Then we updated our information providers to respect that bias, both when we need to go from biased to real addresses and back from real to biased. To minimize the impact, we added automated handling of biases to our old information sources, rooter, symbolizer, and brancher. We assume that they are all providing unbiased addresses and automatically subtract the bias before passing the addresses to them. Only the information sources that were obtained from the image source are considered biased, so this extra correction is not needed. Right now we do not allow users to create explicit biased sources as we don't see the real need for that, but later we may publish this interface.

@ivg ivg requested a review from gitoleg June 11, 2020 18:43
@ivg
Copy link
Member Author

ivg commented Jun 11, 2020

@gitoleg, take a closer look at the stub-resolver, as I had to rewrite it substantially.

@ivg ivg added the KB label Jun 11, 2020
@ivg ivg self-assigned this Jun 12, 2020
@ivg ivg force-pushed the compartmentalize-projects branch 3 times, most recently from 00439d6 to 7866818 Compare June 29, 2020 20:28
@ivg ivg force-pushed the compartmentalize-projects branch 6 times, most recently from 49c69b7 to 6b7865c Compare July 13, 2020 17:40
@ivg ivg force-pushed the compartmentalize-projects branch from 6b7865c to caec1cf Compare July 14, 2020 14:13
ivg added 14 commits July 15, 2020 13:22
We assumed by default that all our information sources (rooters,
symbolizers, and branchers) are unbiased but it wasn't true for
the sources that we created of images (using corresponding of_image)
that were already operating using the biased information from the
loader.

To fix this issue we added a hidden parameter to mark information
sources as biased or unbiased and perform bias substraction (and also
addition in case when the destinations are provided) based on this
parameter. We may later make it public, but so far it is only set for
information sources that are derived from images.

Also the common code between information sources were factored out
into the Bap_disasm_source module.
We are now able to query the bitness without having to pull in the
Bap.Std interface so we can implement everything neatly.
and uses it for the rec, as far as I can grep, I don't see any
information sources that are not properly scoped, either by limiting
them to the path or to a function.
The [for_file] function now also sets the path (the same as the
corresponding [for_addr] function sets the address).

The [for_region] now interns the boundaries in the current package
and builds the finaly symbol from their concatenation to enable
intersecting regions from different files in the same knowledge base.
also establishes equalities between it and its re-export in Bap.Std
and adds an convenience alias in the Bap_main module for the loggers.
@ivg ivg force-pushed the compartmentalize-projects branch from 2310da7 to 0f8b7f2 Compare July 15, 2020 17:22
@ivg ivg merged commit 53da1ca into BinaryAnalysisPlatform:master Jul 15, 2020
ivg added a commit to ivg/bap that referenced this pull request Aug 17, 2020
Implements support for various relocations and improves existing that
enables us to pass all tests without relying on external symbols or
tools such as objdump or radare2.

This branch support PLT-like relocations, as well as direct calls with
GLOB_DAT relocations (fixes BinaryAnalysisPlatform#1135). The PLT entries are constant
folded and memory references are then analyzed. We also extended the
analysis that detects stub functions to support various ABI and file
formats. For PowerPC MachO, that stores stubs directly in the text
section, we implemented a signature matching procedure to reliably
detect the stubs. We also significantly improved support of mips,
which was sufferening from missing function starts that correspond to
the stubbed functions as byteweigh is unable to detect these stubs.

In addition, this PR brings a new library called Bap_relation that is
a bidirectional mapping useful for storing addr <-> name mapping and
ensure their bijection. This library is now used explicitly or
implicitly (via the old symbolizer interface) by all our providers of
symbolic information. This change prevents symbolizers from providing
conflicting information, which may later lead to the knowledge base
conflicts.

We also removed so far the name to address translation service that we
recently introduced BinaryAnalysisPlatform#1119. We are not ready for this service yet (our
knowledge base is not having enough rules stored in it) and without
this rule we can disassemble 25% faster.

There are also a couple of minor fixes and quality of life
improvements:
- fixes Insn.dests domain functions
- a better default for the KB.Domain.Powerset inspect parameter
- makes glibc-runtime heuristic more aggressive
ivg added a commit to ivg/bap that referenced this pull request Aug 17, 2020
Implements support for various relocations and improves existing that
enables us to pass all tests without relying on external symbols or
tools such as objdump or radare2.

This branch support PLT-like relocations, as well as direct calls with
GLOB_DAT relocations (fixes BinaryAnalysisPlatform#1135). The PLT entries are constant
folded and memory references are then analyzed. We also extended the
analysis that detects stub functions to support various ABI and file
formats. For PowerPC MachO, that stores stubs directly in the text
section, we implemented a signature matching procedure to reliably
detect the stubs. We also significantly improved support of mips,
which was sufferening from missing function starts that correspond to
the stubbed functions as byteweigh is unable to detect these stubs.

In addition, this PR brings a new library called Bap_relation that is
a bidirectional mapping useful for storing addr <-> name mapping and
ensure their bijection. This library is now used explicitly or
implicitly (via the old symbolizer interface) by all our providers of
symbolic information. This change prevents symbolizers from providing
conflicting information, which may later lead to the knowledge base
conflicts.

We also removed so far the name to address translation service that we
recently introduced BinaryAnalysisPlatform#1119. We are not ready for this service yet (our
knowledge base is not having enough rules stored in it) and without
this rule we can disassemble 25% faster.

There are also a couple of minor fixes and quality of life
improvements:
- fixes Insn.dests domain functions
- a better default for the KB.Domain.Powerset inspect parameter
- makes glibc-runtime heuristic more aggressive
ivg added a commit to ivg/bap that referenced this pull request Aug 18, 2020
Implements support for various relocations and improves existing that
enables us to pass all tests without relying on external symbols or
tools such as objdump or radare2.

This branch support PLT-like relocations, as well as direct calls with
GLOB_DAT relocations (fixes BinaryAnalysisPlatform#1135). The PLT entries are constant
folded and memory references are then analyzed. We also extended the
analysis that detects stub functions to support various ABI and file
formats. For PowerPC MachO, that stores stubs directly in the text
section, we implemented a signature matching procedure to reliably
detect the stubs. We also significantly improved support of mips,
which was sufferening from missing function starts that correspond to
the stubbed functions as byteweigh is unable to detect these stubs.

In addition, this PR brings a new library called Bap_relation that is
a bidirectional mapping useful for storing addr <-> name mapping and
ensure their bijection. This library is now used explicitly or
implicitly (via the old symbolizer interface) by all our providers of
symbolic information. This change prevents symbolizers from providing
conflicting information, which may later lead to the knowledge base
conflicts.

We also removed so far the name to address translation service that we
recently introduced BinaryAnalysisPlatform#1119. We are not ready for this service yet (our
knowledge base is not having enough rules stored in it) and without
this rule we can disassemble 25% faster.

There are also a couple of minor fixes and quality of life
improvements:
- fixes Insn.dests domain functions
- a better default for the KB.Domain.Powerset inspect parameter
- makes glibc-runtime heuristic more aggressive
ivg added a commit to ivg/bap that referenced this pull request Aug 19, 2020
Implements support for various relocations and improves existing that
enables us to pass all tests without relying on external symbols or
tools such as objdump or radare2.

This branch support PLT-like relocations, as well as direct calls with
GLOB_DAT relocations (fixes BinaryAnalysisPlatform#1135). The PLT entries are constant
folded and memory references are then analyzed. We also extended the
analysis that detects stub functions to support various ABI and file
formats. For PowerPC MachO, that stores stubs directly in the text
section, we implemented a signature matching procedure to reliably
detect the stubs. We also significantly improved support of mips,
which was sufferening from missing function starts that correspond to
the stubbed functions as byteweigh is unable to detect these stubs.

In addition, this PR brings a new library called Bap_relation that is
a bidirectional mapping useful for storing addr <-> name mapping and
ensure their bijection. This library is now used explicitly or
implicitly (via the old symbolizer interface) by all our providers of
symbolic information. This change prevents symbolizers from providing
conflicting information, which may later lead to the knowledge base
conflicts.

We also removed so far the name to address translation service that we
recently introduced BinaryAnalysisPlatform#1119. We are not ready for this service yet (our
knowledge base is not having enough rules stored in it) and without
this rule we can disassemble 25% faster.

There are also a couple of minor fixes and quality of life
improvements:
- fixes Insn.dests domain functions
- a better default for the KB.Domain.Powerset inspect parameter
- makes glibc-runtime heuristic more aggressive
ivg added a commit that referenced this pull request Aug 21, 2020
Implements support for various relocations and improves existing that
enables us to pass all tests without relying on external symbols or
tools such as objdump or radare2.

This branch support PLT-like relocations, as well as direct calls with
GLOB_DAT relocations (fixes #1135). The PLT entries are constant
folded and memory references are then analyzed. We also extended the
analysis that detects stub functions to support various ABI and file
formats. For PowerPC MachO, that stores stubs directly in the text
section, we implemented a signature matching procedure to reliably
detect the stubs. We also significantly improved support of mips,
which was sufferening from missing function starts that correspond to
the stubbed functions as byteweigh is unable to detect these stubs.

In addition, this PR brings a new library called Bap_relation that is
a bidirectional mapping useful for storing addr <-> name mapping and
ensure their bijection. This library is now used explicitly or
implicitly (via the old symbolizer interface) by all our providers of
symbolic information. This change prevents symbolizers from providing
conflicting information, which may later lead to the knowledge base
conflicts.

We also removed so far the name to address translation service that we
recently introduced #1119. We are not ready for this service yet (our
knowledge base is not having enough rules stored in it) and without
this rule we can disassemble 25% faster.

There are also a couple of minor fixes and quality of life
improvements:
- fixes Insn.dests domain functions
- a better default for the KB.Domain.Powerset inspect parameter
- makes glibc-runtime heuristic more aggressive
ivg added a commit to ivg/bap that referenced this pull request Sep 24, 2020
Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old
representation suffered from a few problems that we inherited from
LLVM. The main issue is that Arch.t is not extensible and in order
to add a new architecture the Bap.Std code shall be changed in a
backward-compatibility-breaking manner. Arch.t is als unable to
represent the whole variety of computing devices, which is especially
relevant to micro-controllers (AVR, PIC) and IoT devices on which we
are currently focusing. Finally, Arch.t is not precise enough to
capture information that is necessary for code generation, the new
venue that we are currently exploring.

As the first attempt that didn't really work we introduced arch, sub,
and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem
with that approach was the stringly typed interface as `arch` was
represented as a simple string. In addition, the proposed properties
werent' able to describe uncommon architectures. Finally, it was very
awkward to use, all fields were optional with no good
defaults.

This is the second attempt and it will be split into several pull
requests. The first PR, this one, introduce the Theory.Target.t but
still keeps Arch.t alive, i.e., it is used by all internal and
external components of BAP. This is to ensure that switching to
Target.t doesn't break any existing code. The consequent pull requests
will gradually deprecated functions that use Arch.t and switch
Target.t everywhere. The most important switch will affect the
disassembler/decoder framework, which is currently still stuck on
Arch.t. Just to be clear, after this work is finished and until BAP
3.0 and maybe even thereafter Arch.t will still work as it used to
work and no code will break or require updates. However, newly added
architectures, such as AVR or PIC, i.e., those that could not be
represented with Arch.t will not be available for the code that still
relies on it.

In addition to Theory.Target.t we add a few more abstractions and
convenience functions, e.g., `Project.empty` and a completely new
interface for Project.Input.t generation, which makes it easier to
create projects from strings or other custom data, e.g.,
`Project.Input.from_string` .

We also add Source and Compiler abstractions to the knowledge base
Core Theory. These abstractions, together with Target, describe the
full cycle of the program transformation from source to the target
binary using the specified compiler (and the other way around). The
Target abstraction itself comes with a few more data types that
describe various aspects of the target system, including file formats,
ABI, floating-point ABI (FABI), endianness, which is no longer limited
to the binary choice of little and big endianness, and an extensible
data type for storing target-specific options.

Finally, all targets are formed into hierarchies and families, which
helps in controlling the vast zoo of computer architectures and
devices.

The Target.t is an abstract data type and is self-describing and
includes enough information that describes all the details of the
architecture. We also provide four library modules, for arm, mips,
powerpc, and x86 that exposes the currenlty declared targets (there is
about 80 targets currently and more will be added soonish, see
`bap list targets` for the up-to-date list).

Our LLVM backend is not yet precise enough to recongize many of the
supported targets and we don't have analyses right now that will infer
the target from the binary, but we will add the `--target` option in
the next PRs (when we will switch to Target.t) everywhere.

As usual, comments, questions, reviews are very welcome.
ivg added a commit to ivg/bap that referenced this pull request Sep 24, 2020
Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old
representation suffered from a few problems that we inherited from
LLVM. The main issue is that Arch.t is not extensible and in order
to add a new architecture the Bap.Std code shall be changed in a
backward-compatibility-breaking manner. Arch.t is als unable to
represent the whole variety of computing devices, which is especially
relevant to micro-controllers (AVR, PIC) and IoT devices on which we
are currently focusing. Finally, Arch.t is not precise enough to
capture information that is necessary for code generation, the new
venue that we are currently exploring.

As the first attempt that didn't really work we introduced arch, sub,
and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem
with that approach was the stringly typed interface as `arch` was
represented as a simple string. In addition, the proposed properties
werent' able to describe uncommon architectures. Finally, it was very
awkward to use, all fields were optional with no good
defaults.

This is the second attempt and it will be split into several pull
requests. The first PR, this one, introduce the Theory.Target.t but
still keeps Arch.t alive, i.e., it is used by all internal and
external components of BAP. This is to ensure that switching to
Target.t doesn't break any existing code. The consequent pull requests
will gradually deprecated functions that use Arch.t and switch
Target.t everywhere. The most important switch will affect the
disassembler/decoder framework, which is currently still stuck on
Arch.t. Just to be clear, after this work is finished and until BAP
3.0 and maybe even thereafter Arch.t will still work as it used to
work and no code will break or require updates. However, newly added
architectures, such as AVR or PIC, i.e., those that could not be
represented with Arch.t will not be available for the code that still
relies on it.

In addition to Theory.Target.t we add a few more abstractions and
convenience functions, e.g., `Project.empty` and a completely new
interface for Project.Input.t generation, which makes it easier to
create projects from strings or other custom data, e.g.,
`Project.Input.from_string` .

We also add Source, Language, and Compiler abstractions to the
knowledge base Core Theory. These abstractions, together with Target,
describe the full cycle of the program transformation using the
compiler from source code in the given language to the program for the
specified target (and the other way around). The Target abstraction
itself comes with a few more data types that describe various aspects
of the target system, including file formats, ABI, floating-point
ABI (FABI), endianness, which is no longer limited to the binary
choice of little and big endianness, and an extensible data type for
storing target-specific options.

Finally, all targets are formed into hierarchies and families, which
helps in controlling the vast zoo of computer architectures and
devices.

The Target.t is an abstract data type and is self-describing and
includes enough information that describes all the details of the
architecture. We also provide four library modules, for arm, mips,
powerpc, and x86 that exposes the currenlty declared targets (there is
about 80 targets currently and more will be added soonish, see
`bap list targets` for the up-to-date list).

Our LLVM backend is not yet precise enough to recongize many of the
supported targets and we don't have analyses right now that will infer
the target from the binary, but we will add the `--target` option in
the next PRs (when we will switch to Target.t) everywhere.

As usual, comments, questions, reviews are very welcome.
ivg added a commit to ivg/bap that referenced this pull request Sep 24, 2020
Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old
representation suffered from a few problems that we inherited from
LLVM. The main issue is that Arch.t is not extensible and in order
to add a new architecture the Bap.Std code shall be changed in a
backward-compatibility-breaking manner. Arch.t is als unable to
represent the whole variety of computing devices, which is especially
relevant to micro-controllers (AVR, PIC) and IoT devices on which we
are currently focusing. Finally, Arch.t is not precise enough to
capture information that is necessary for code generation, the new
venue that we are currently exploring.

As the first attempt that didn't really work we introduced arch, sub,
and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem
with that approach was the stringly typed interface as `arch` was
represented as a simple string. In addition, the proposed properties
werent' able to describe uncommon architectures. Finally, it was very
awkward to use, all fields were optional with no good
defaults.

This is the second attempt and it will be split into several pull
requests. The first PR, this one, introduce the Theory.Target.t but
still keeps Arch.t alive, i.e., it is used by all internal and
external components of BAP. This is to ensure that switching to
Target.t doesn't break any existing code. The consequent pull requests
will gradually deprecated functions that use Arch.t and switch
Target.t everywhere. The most important switch will affect the
disassembler/decoder framework, which is currently still stuck on
Arch.t. Just to be clear, after this work is finished and until BAP
3.0 and maybe even thereafter Arch.t will still work as it used to
work and no code will break or require updates. However, newly added
architectures, such as AVR or PIC, i.e., those that could not be
represented with Arch.t will not be available for the code that still
relies on it.

In addition to Theory.Target.t we add a few more abstractions and
convenience functions, e.g., `Project.empty` and a completely new
interface for Project.Input.t generation, which makes it easier to
create projects from strings or other custom data, e.g.,
`Project.Input.from_string` .

We also add Source, Language, and Compiler abstractions to the
knowledge base Core Theory. These abstractions, together with Target,
describe the full cycle of the program transformation using the
compiler from source code in the given language to the program for the
specified target (and the other way around). The Target abstraction
itself comes with a few more data types that describe various aspects
of the target system, including file formats, ABI, floating-point
ABI (FABI), endianness, which is no longer limited to the binary
choice of little and big endianness, and an extensible data type for
storing target-specific options.

Finally, all targets are formed into hierarchies and families, which
helps in controlling the vast zoo of computer architectures and
devices.

The Target.t is an abstract data type and is self-describing and
includes enough information that describes all the details of the
architecture. We also provide four library modules, for arm, mips,
powerpc, and x86 that exposes the currenlty declared targets.

Our LLVM backend is not yet precise enough to recongize many of the
supported targets and we don't have analyses right now that will infer
the target from the binary, but we will add the `--target` option in
the next PRs (when we will switch to Target.t) everywhere.

As usual, comments, questions, reviews are very welcome.
ivg added a commit to ivg/bap that referenced this pull request Sep 24, 2020
Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old
representation suffered from a few problems that we inherited from
LLVM. The main issue is that Arch.t is not extensible and in order
to add a new architecture the Bap.Std code shall be changed in a
backward-compatibility-breaking manner. Arch.t is als unable to
represent the whole variety of computing devices, which is especially
relevant to micro-controllers (AVR, PIC) and IoT devices on which we
are currently focusing. Finally, Arch.t is not precise enough to
capture information that is necessary for code generation, the new
venue that we are currently exploring.

As the first attempt that didn't really work we introduced arch, sub,
and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem
with that approach was the stringly typed interface as `arch` was
represented as a simple string. In addition, the proposed properties
werent' able to describe uncommon architectures. Finally, it was very
awkward to use, all fields were optional with no good
defaults.

This is the second attempt and it will be split into several pull
requests. The first PR, this one, introduce the Theory.Target.t but
still keeps Arch.t alive, i.e., it is used by all internal and
external components of BAP. This is to ensure that switching to
Target.t doesn't break any existing code. The consequent pull requests
will gradually deprecated functions that use Arch.t and switch
Target.t everywhere. The most important switch will affect the
disassembler/decoder framework, which is currently still stuck on
Arch.t. Just to be clear, after this work is finished and until BAP
3.0 and maybe even thereafter Arch.t will still work as it used to
work and no code will break or require updates. However, newly added
architectures, such as AVR or PIC, i.e., those that could not be
represented with Arch.t will not be available for the code that still
relies on it.

In addition to Theory.Target.t we add a few more abstractions and
convenience functions, e.g., `Project.empty` and a completely new
interface for Project.Input.t generation, which makes it easier to
create projects from strings or other custom data, e.g.,
`Project.Input.from_string` .

We also add Source, Language, and Compiler abstractions to the
knowledge base Core Theory. These abstractions, together with Target,
describe the full cycle of the program transformation using the
compiler from source code in the given language to the program for the
specified target (and the other way around). The Target abstraction
itself comes with a few more data types that describe various aspects
of the target system, including file formats, ABI, floating-point
ABI (FABI), endianness, which is no longer limited to the binary
choice of little and big endianness, and an extensible data type for
storing target-specific options.

Finally, all targets are formed into hierarchies and families, which
helps in controlling the vast zoo of computer architectures and
devices.

The Target.t is an abstract data type and is self-describing and
includes enough information that describes all the details of the
architecture. We also provide four library modules, for arm, mips,
powerpc, and x86 that exposes the currenlty declared targets.

Our LLVM backend is not yet precise enough to recongize many of the
supported targets and we don't have analyses right now that will infer
the target from the binary, but we will add the `--target` option in
the next PRs (when we will switch to Target.t) everywhere.

As usual, comments, questions, reviews are very welcome.
ivg added a commit to ivg/bap that referenced this pull request Sep 25, 2020
Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old
representation suffered from a few problems that we inherited from
LLVM. The main issue is that Arch.t is not extensible and in order
to add a new architecture the Bap.Std code shall be changed in a
backward-compatibility-breaking manner. Arch.t is als unable to
represent the whole variety of computing devices, which is especially
relevant to micro-controllers (AVR, PIC) and IoT devices on which we
are currently focusing. Finally, Arch.t is not precise enough to
capture information that is necessary for code generation, the new
venue that we are currently exploring.

As the first attempt that didn't really work we introduced arch, sub,
and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem
with that approach was the stringly typed interface as `arch` was
represented as a simple string. In addition, the proposed properties
werent' able to describe uncommon architectures. Finally, it was very
awkward to use, all fields were optional with no good
defaults.

This is the second attempt and it will be split into several pull
requests. The first PR, this one, introduce the Theory.Target.t but
still keeps Arch.t alive, i.e., it is used by all internal and
external components of BAP. This is to ensure that switching to
Target.t doesn't break any existing code. The consequent pull requests
will gradually deprecated functions that use Arch.t and switch
Target.t everywhere. The most important switch will affect the
disassembler/decoder framework, which is currently still stuck on
Arch.t. Just to be clear, after this work is finished and until BAP
3.0 and maybe even thereafter Arch.t will still work as it used to
work and no code will break or require updates. However, newly added
architectures, such as AVR or PIC, i.e., those that could not be
represented with Arch.t will not be available for the code that still
relies on it.

In addition to Theory.Target.t we add a few more abstractions and
convenience functions, e.g., `Project.empty` and a completely new
interface for Project.Input.t generation, which makes it easier to
create projects from strings or other custom data, e.g.,
`Project.Input.from_string` .

We also add Source, Language, and Compiler abstractions to the
knowledge base Core Theory. These abstractions, together with Target,
describe the full cycle of the program transformation using the
compiler from source code in the given language to the program for the
specified target (and the other way around). The Target abstraction
itself comes with a few more data types that describe various aspects
of the target system, including file formats, ABI, floating-point
ABI (FABI), endianness, which is no longer limited to the binary
choice of little and big endianness, and an extensible data type for
storing target-specific options.

Finally, all targets are formed into hierarchies and families, which
helps in controlling the vast zoo of computer architectures and
devices.

The Target.t is an abstract data type and is self-describing and
includes enough information that describes all the details of the
architecture. We also provide four library modules, for arm, mips,
powerpc, and x86 that exposes the currenlty declared targets.

Our LLVM backend is not yet precise enough to recongize many of the
supported targets and we don't have analyses right now that will infer
the target from the binary, but we will add the `--target` option in
the next PRs (when we will switch to Target.t) everywhere.

As usual, comments, questions, reviews are very welcome.
ivg added a commit that referenced this pull request Sep 25, 2020
Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old
representation suffered from a few problems that we inherited from
LLVM. The main issue is that Arch.t is not extensible and in order
to add a new architecture the Bap.Std code shall be changed in a
backward-compatibility-breaking manner. Arch.t is als unable to
represent the whole variety of computing devices, which is especially
relevant to micro-controllers (AVR, PIC) and IoT devices on which we
are currently focusing. Finally, Arch.t is not precise enough to
capture information that is necessary for code generation, the new
venue that we are currently exploring.

As the first attempt that didn't really work we introduced arch, sub,
and other properties to the `core-theory:unit` class in #1119. The problem
with that approach was the stringly typed interface as `arch` was
represented as a simple string. In addition, the proposed properties
werent' able to describe uncommon architectures. Finally, it was very
awkward to use, all fields were optional with no good
defaults.

This is the second attempt and it will be split into several pull
requests. The first PR, this one, introduce the Theory.Target.t but
still keeps Arch.t alive, i.e., it is used by all internal and
external components of BAP. This is to ensure that switching to
Target.t doesn't break any existing code. The consequent pull requests
will gradually deprecated functions that use Arch.t and switch
Target.t everywhere. The most important switch will affect the
disassembler/decoder framework, which is currently still stuck on
Arch.t. Just to be clear, after this work is finished and until BAP
3.0 and maybe even thereafter Arch.t will still work as it used to
work and no code will break or require updates. However, newly added
architectures, such as AVR or PIC, i.e., those that could not be
represented with Arch.t will not be available for the code that still
relies on it.

In addition to Theory.Target.t we add a few more abstractions and
convenience functions, e.g., `Project.empty` and a completely new
interface for Project.Input.t generation, which makes it easier to
create projects from strings or other custom data, e.g.,
`Project.Input.from_string` .

We also add Source, Language, and Compiler abstractions to the
knowledge base Core Theory. These abstractions, together with Target,
describe the full cycle of the program transformation using the
compiler from source code in the given language to the program for the
specified target (and the other way around). The Target abstraction
itself comes with a few more data types that describe various aspects
of the target system, including file formats, ABI, floating-point
ABI (FABI), endianness, which is no longer limited to the binary
choice of little and big endianness, and an extensible data type for
storing target-specific options.

Finally, all targets are formed into hierarchies and families, which
helps in controlling the vast zoo of computer architectures and
devices.

The Target.t is an abstract data type and is self-describing and
includes enough information that describes all the details of the
architecture. We also provide four library modules, for arm, mips,
powerpc, and x86 that exposes the currenlty declared targets.

Our LLVM backend is not yet precise enough to recongize many of the
supported targets and we don't have analyses right now that will infer
the target from the binary, but we will add the `--target` option in
the next PRs (when we will switch to Target.t) everywhere.

As usual, comments, questions, reviews are very welcome.
@ivg ivg deleted the compartmentalize-projects branch December 1, 2021 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

the base address is not respect by many components
1 participant