-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enables multiple projects in the same knowledge base #1119
Merged
ivg
merged 67 commits into
BinaryAnalysisPlatform:master
from
ivg:compartmentalize-projects
Jul 15, 2020
Merged
enables multiple projects in the same knowledge base #1119
ivg
merged 67 commits into
BinaryAnalysisPlatform:master
from
ivg:compartmentalize-projects
Jul 15, 2020
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@gitoleg, take a closer look at the stub-resolver, as I had to rewrite it substantially. |
00439d6
to
7866818
Compare
49c69b7
to
6b7865c
Compare
6b7865c
to
caec1cf
Compare
or in the user package if one is specified. Also, demistifies the program objects and documents explicitly how they are formed.
We assumed by default that all our information sources (rooters, symbolizers, and branchers) are unbiased but it wasn't true for the sources that we created of images (using corresponding of_image) that were already operating using the biased information from the loader. To fix this issue we added a hidden parameter to mark information sources as biased or unbiased and perform bias substraction (and also addition in case when the destinations are provided) based on this parameter. We may later make it public, but so far it is only set for information sources that are derived from images. Also the common code between information sources were factored out into the Bap_disasm_source module.
We are now able to query the bitness without having to pull in the Bap.Std interface so we can implement everything neatly.
and uses it for the rec, as far as I can grep, I don't see any information sources that are not properly scoped, either by limiting them to the path or to a function.
also exports the full toplevel interface
The [for_file] function now also sets the path (the same as the corresponding [for_addr] function sets the address). The [for_region] now interns the boundaries in the current package and builds the finaly symbol from their concatenation to enable intersecting regions from different files in the same knowledge base.
also establishes equalities between it and its re-export in Bap.Std and adds an convenience alias in the Bap_main module for the loggers.
2310da7
to
0f8b7f2
Compare
ivg
added a commit
to ivg/bap
that referenced
this pull request
Aug 17, 2020
Implements support for various relocations and improves existing that enables us to pass all tests without relying on external symbols or tools such as objdump or radare2. This branch support PLT-like relocations, as well as direct calls with GLOB_DAT relocations (fixes BinaryAnalysisPlatform#1135). The PLT entries are constant folded and memory references are then analyzed. We also extended the analysis that detects stub functions to support various ABI and file formats. For PowerPC MachO, that stores stubs directly in the text section, we implemented a signature matching procedure to reliably detect the stubs. We also significantly improved support of mips, which was sufferening from missing function starts that correspond to the stubbed functions as byteweigh is unable to detect these stubs. In addition, this PR brings a new library called Bap_relation that is a bidirectional mapping useful for storing addr <-> name mapping and ensure their bijection. This library is now used explicitly or implicitly (via the old symbolizer interface) by all our providers of symbolic information. This change prevents symbolizers from providing conflicting information, which may later lead to the knowledge base conflicts. We also removed so far the name to address translation service that we recently introduced BinaryAnalysisPlatform#1119. We are not ready for this service yet (our knowledge base is not having enough rules stored in it) and without this rule we can disassemble 25% faster. There are also a couple of minor fixes and quality of life improvements: - fixes Insn.dests domain functions - a better default for the KB.Domain.Powerset inspect parameter - makes glibc-runtime heuristic more aggressive
ivg
added a commit
to ivg/bap
that referenced
this pull request
Aug 17, 2020
Implements support for various relocations and improves existing that enables us to pass all tests without relying on external symbols or tools such as objdump or radare2. This branch support PLT-like relocations, as well as direct calls with GLOB_DAT relocations (fixes BinaryAnalysisPlatform#1135). The PLT entries are constant folded and memory references are then analyzed. We also extended the analysis that detects stub functions to support various ABI and file formats. For PowerPC MachO, that stores stubs directly in the text section, we implemented a signature matching procedure to reliably detect the stubs. We also significantly improved support of mips, which was sufferening from missing function starts that correspond to the stubbed functions as byteweigh is unable to detect these stubs. In addition, this PR brings a new library called Bap_relation that is a bidirectional mapping useful for storing addr <-> name mapping and ensure their bijection. This library is now used explicitly or implicitly (via the old symbolizer interface) by all our providers of symbolic information. This change prevents symbolizers from providing conflicting information, which may later lead to the knowledge base conflicts. We also removed so far the name to address translation service that we recently introduced BinaryAnalysisPlatform#1119. We are not ready for this service yet (our knowledge base is not having enough rules stored in it) and without this rule we can disassemble 25% faster. There are also a couple of minor fixes and quality of life improvements: - fixes Insn.dests domain functions - a better default for the KB.Domain.Powerset inspect parameter - makes glibc-runtime heuristic more aggressive
ivg
added a commit
to ivg/bap
that referenced
this pull request
Aug 18, 2020
Implements support for various relocations and improves existing that enables us to pass all tests without relying on external symbols or tools such as objdump or radare2. This branch support PLT-like relocations, as well as direct calls with GLOB_DAT relocations (fixes BinaryAnalysisPlatform#1135). The PLT entries are constant folded and memory references are then analyzed. We also extended the analysis that detects stub functions to support various ABI and file formats. For PowerPC MachO, that stores stubs directly in the text section, we implemented a signature matching procedure to reliably detect the stubs. We also significantly improved support of mips, which was sufferening from missing function starts that correspond to the stubbed functions as byteweigh is unable to detect these stubs. In addition, this PR brings a new library called Bap_relation that is a bidirectional mapping useful for storing addr <-> name mapping and ensure their bijection. This library is now used explicitly or implicitly (via the old symbolizer interface) by all our providers of symbolic information. This change prevents symbolizers from providing conflicting information, which may later lead to the knowledge base conflicts. We also removed so far the name to address translation service that we recently introduced BinaryAnalysisPlatform#1119. We are not ready for this service yet (our knowledge base is not having enough rules stored in it) and without this rule we can disassemble 25% faster. There are also a couple of minor fixes and quality of life improvements: - fixes Insn.dests domain functions - a better default for the KB.Domain.Powerset inspect parameter - makes glibc-runtime heuristic more aggressive
ivg
added a commit
to ivg/bap
that referenced
this pull request
Aug 19, 2020
Implements support for various relocations and improves existing that enables us to pass all tests without relying on external symbols or tools such as objdump or radare2. This branch support PLT-like relocations, as well as direct calls with GLOB_DAT relocations (fixes BinaryAnalysisPlatform#1135). The PLT entries are constant folded and memory references are then analyzed. We also extended the analysis that detects stub functions to support various ABI and file formats. For PowerPC MachO, that stores stubs directly in the text section, we implemented a signature matching procedure to reliably detect the stubs. We also significantly improved support of mips, which was sufferening from missing function starts that correspond to the stubbed functions as byteweigh is unable to detect these stubs. In addition, this PR brings a new library called Bap_relation that is a bidirectional mapping useful for storing addr <-> name mapping and ensure their bijection. This library is now used explicitly or implicitly (via the old symbolizer interface) by all our providers of symbolic information. This change prevents symbolizers from providing conflicting information, which may later lead to the knowledge base conflicts. We also removed so far the name to address translation service that we recently introduced BinaryAnalysisPlatform#1119. We are not ready for this service yet (our knowledge base is not having enough rules stored in it) and without this rule we can disassemble 25% faster. There are also a couple of minor fixes and quality of life improvements: - fixes Insn.dests domain functions - a better default for the KB.Domain.Powerset inspect parameter - makes glibc-runtime heuristic more aggressive
ivg
added a commit
that referenced
this pull request
Aug 21, 2020
Implements support for various relocations and improves existing that enables us to pass all tests without relying on external symbols or tools such as objdump or radare2. This branch support PLT-like relocations, as well as direct calls with GLOB_DAT relocations (fixes #1135). The PLT entries are constant folded and memory references are then analyzed. We also extended the analysis that detects stub functions to support various ABI and file formats. For PowerPC MachO, that stores stubs directly in the text section, we implemented a signature matching procedure to reliably detect the stubs. We also significantly improved support of mips, which was sufferening from missing function starts that correspond to the stubbed functions as byteweigh is unable to detect these stubs. In addition, this PR brings a new library called Bap_relation that is a bidirectional mapping useful for storing addr <-> name mapping and ensure their bijection. This library is now used explicitly or implicitly (via the old symbolizer interface) by all our providers of symbolic information. This change prevents symbolizers from providing conflicting information, which may later lead to the knowledge base conflicts. We also removed so far the name to address translation service that we recently introduced #1119. We are not ready for this service yet (our knowledge base is not having enough rules stored in it) and without this rule we can disassemble 25% faster. There are also a couple of minor fixes and quality of life improvements: - fixes Insn.dests domain functions - a better default for the KB.Domain.Powerset inspect parameter - makes glibc-runtime heuristic more aggressive
ivg
added a commit
to ivg/bap
that referenced
this pull request
Sep 24, 2020
Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old representation suffered from a few problems that we inherited from LLVM. The main issue is that Arch.t is not extensible and in order to add a new architecture the Bap.Std code shall be changed in a backward-compatibility-breaking manner. Arch.t is als unable to represent the whole variety of computing devices, which is especially relevant to micro-controllers (AVR, PIC) and IoT devices on which we are currently focusing. Finally, Arch.t is not precise enough to capture information that is necessary for code generation, the new venue that we are currently exploring. As the first attempt that didn't really work we introduced arch, sub, and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem with that approach was the stringly typed interface as `arch` was represented as a simple string. In addition, the proposed properties werent' able to describe uncommon architectures. Finally, it was very awkward to use, all fields were optional with no good defaults. This is the second attempt and it will be split into several pull requests. The first PR, this one, introduce the Theory.Target.t but still keeps Arch.t alive, i.e., it is used by all internal and external components of BAP. This is to ensure that switching to Target.t doesn't break any existing code. The consequent pull requests will gradually deprecated functions that use Arch.t and switch Target.t everywhere. The most important switch will affect the disassembler/decoder framework, which is currently still stuck on Arch.t. Just to be clear, after this work is finished and until BAP 3.0 and maybe even thereafter Arch.t will still work as it used to work and no code will break or require updates. However, newly added architectures, such as AVR or PIC, i.e., those that could not be represented with Arch.t will not be available for the code that still relies on it. In addition to Theory.Target.t we add a few more abstractions and convenience functions, e.g., `Project.empty` and a completely new interface for Project.Input.t generation, which makes it easier to create projects from strings or other custom data, e.g., `Project.Input.from_string` . We also add Source and Compiler abstractions to the knowledge base Core Theory. These abstractions, together with Target, describe the full cycle of the program transformation from source to the target binary using the specified compiler (and the other way around). The Target abstraction itself comes with a few more data types that describe various aspects of the target system, including file formats, ABI, floating-point ABI (FABI), endianness, which is no longer limited to the binary choice of little and big endianness, and an extensible data type for storing target-specific options. Finally, all targets are formed into hierarchies and families, which helps in controlling the vast zoo of computer architectures and devices. The Target.t is an abstract data type and is self-describing and includes enough information that describes all the details of the architecture. We also provide four library modules, for arm, mips, powerpc, and x86 that exposes the currenlty declared targets (there is about 80 targets currently and more will be added soonish, see `bap list targets` for the up-to-date list). Our LLVM backend is not yet precise enough to recongize many of the supported targets and we don't have analyses right now that will infer the target from the binary, but we will add the `--target` option in the next PRs (when we will switch to Target.t) everywhere. As usual, comments, questions, reviews are very welcome.
ivg
added a commit
to ivg/bap
that referenced
this pull request
Sep 24, 2020
Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old representation suffered from a few problems that we inherited from LLVM. The main issue is that Arch.t is not extensible and in order to add a new architecture the Bap.Std code shall be changed in a backward-compatibility-breaking manner. Arch.t is als unable to represent the whole variety of computing devices, which is especially relevant to micro-controllers (AVR, PIC) and IoT devices on which we are currently focusing. Finally, Arch.t is not precise enough to capture information that is necessary for code generation, the new venue that we are currently exploring. As the first attempt that didn't really work we introduced arch, sub, and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem with that approach was the stringly typed interface as `arch` was represented as a simple string. In addition, the proposed properties werent' able to describe uncommon architectures. Finally, it was very awkward to use, all fields were optional with no good defaults. This is the second attempt and it will be split into several pull requests. The first PR, this one, introduce the Theory.Target.t but still keeps Arch.t alive, i.e., it is used by all internal and external components of BAP. This is to ensure that switching to Target.t doesn't break any existing code. The consequent pull requests will gradually deprecated functions that use Arch.t and switch Target.t everywhere. The most important switch will affect the disassembler/decoder framework, which is currently still stuck on Arch.t. Just to be clear, after this work is finished and until BAP 3.0 and maybe even thereafter Arch.t will still work as it used to work and no code will break or require updates. However, newly added architectures, such as AVR or PIC, i.e., those that could not be represented with Arch.t will not be available for the code that still relies on it. In addition to Theory.Target.t we add a few more abstractions and convenience functions, e.g., `Project.empty` and a completely new interface for Project.Input.t generation, which makes it easier to create projects from strings or other custom data, e.g., `Project.Input.from_string` . We also add Source, Language, and Compiler abstractions to the knowledge base Core Theory. These abstractions, together with Target, describe the full cycle of the program transformation using the compiler from source code in the given language to the program for the specified target (and the other way around). The Target abstraction itself comes with a few more data types that describe various aspects of the target system, including file formats, ABI, floating-point ABI (FABI), endianness, which is no longer limited to the binary choice of little and big endianness, and an extensible data type for storing target-specific options. Finally, all targets are formed into hierarchies and families, which helps in controlling the vast zoo of computer architectures and devices. The Target.t is an abstract data type and is self-describing and includes enough information that describes all the details of the architecture. We also provide four library modules, for arm, mips, powerpc, and x86 that exposes the currenlty declared targets (there is about 80 targets currently and more will be added soonish, see `bap list targets` for the up-to-date list). Our LLVM backend is not yet precise enough to recongize many of the supported targets and we don't have analyses right now that will infer the target from the binary, but we will add the `--target` option in the next PRs (when we will switch to Target.t) everywhere. As usual, comments, questions, reviews are very welcome.
ivg
added a commit
to ivg/bap
that referenced
this pull request
Sep 24, 2020
Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old representation suffered from a few problems that we inherited from LLVM. The main issue is that Arch.t is not extensible and in order to add a new architecture the Bap.Std code shall be changed in a backward-compatibility-breaking manner. Arch.t is als unable to represent the whole variety of computing devices, which is especially relevant to micro-controllers (AVR, PIC) and IoT devices on which we are currently focusing. Finally, Arch.t is not precise enough to capture information that is necessary for code generation, the new venue that we are currently exploring. As the first attempt that didn't really work we introduced arch, sub, and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem with that approach was the stringly typed interface as `arch` was represented as a simple string. In addition, the proposed properties werent' able to describe uncommon architectures. Finally, it was very awkward to use, all fields were optional with no good defaults. This is the second attempt and it will be split into several pull requests. The first PR, this one, introduce the Theory.Target.t but still keeps Arch.t alive, i.e., it is used by all internal and external components of BAP. This is to ensure that switching to Target.t doesn't break any existing code. The consequent pull requests will gradually deprecated functions that use Arch.t and switch Target.t everywhere. The most important switch will affect the disassembler/decoder framework, which is currently still stuck on Arch.t. Just to be clear, after this work is finished and until BAP 3.0 and maybe even thereafter Arch.t will still work as it used to work and no code will break or require updates. However, newly added architectures, such as AVR or PIC, i.e., those that could not be represented with Arch.t will not be available for the code that still relies on it. In addition to Theory.Target.t we add a few more abstractions and convenience functions, e.g., `Project.empty` and a completely new interface for Project.Input.t generation, which makes it easier to create projects from strings or other custom data, e.g., `Project.Input.from_string` . We also add Source, Language, and Compiler abstractions to the knowledge base Core Theory. These abstractions, together with Target, describe the full cycle of the program transformation using the compiler from source code in the given language to the program for the specified target (and the other way around). The Target abstraction itself comes with a few more data types that describe various aspects of the target system, including file formats, ABI, floating-point ABI (FABI), endianness, which is no longer limited to the binary choice of little and big endianness, and an extensible data type for storing target-specific options. Finally, all targets are formed into hierarchies and families, which helps in controlling the vast zoo of computer architectures and devices. The Target.t is an abstract data type and is self-describing and includes enough information that describes all the details of the architecture. We also provide four library modules, for arm, mips, powerpc, and x86 that exposes the currenlty declared targets. Our LLVM backend is not yet precise enough to recongize many of the supported targets and we don't have analyses right now that will infer the target from the binary, but we will add the `--target` option in the next PRs (when we will switch to Target.t) everywhere. As usual, comments, questions, reviews are very welcome.
ivg
added a commit
to ivg/bap
that referenced
this pull request
Sep 24, 2020
Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old representation suffered from a few problems that we inherited from LLVM. The main issue is that Arch.t is not extensible and in order to add a new architecture the Bap.Std code shall be changed in a backward-compatibility-breaking manner. Arch.t is als unable to represent the whole variety of computing devices, which is especially relevant to micro-controllers (AVR, PIC) and IoT devices on which we are currently focusing. Finally, Arch.t is not precise enough to capture information that is necessary for code generation, the new venue that we are currently exploring. As the first attempt that didn't really work we introduced arch, sub, and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem with that approach was the stringly typed interface as `arch` was represented as a simple string. In addition, the proposed properties werent' able to describe uncommon architectures. Finally, it was very awkward to use, all fields were optional with no good defaults. This is the second attempt and it will be split into several pull requests. The first PR, this one, introduce the Theory.Target.t but still keeps Arch.t alive, i.e., it is used by all internal and external components of BAP. This is to ensure that switching to Target.t doesn't break any existing code. The consequent pull requests will gradually deprecated functions that use Arch.t and switch Target.t everywhere. The most important switch will affect the disassembler/decoder framework, which is currently still stuck on Arch.t. Just to be clear, after this work is finished and until BAP 3.0 and maybe even thereafter Arch.t will still work as it used to work and no code will break or require updates. However, newly added architectures, such as AVR or PIC, i.e., those that could not be represented with Arch.t will not be available for the code that still relies on it. In addition to Theory.Target.t we add a few more abstractions and convenience functions, e.g., `Project.empty` and a completely new interface for Project.Input.t generation, which makes it easier to create projects from strings or other custom data, e.g., `Project.Input.from_string` . We also add Source, Language, and Compiler abstractions to the knowledge base Core Theory. These abstractions, together with Target, describe the full cycle of the program transformation using the compiler from source code in the given language to the program for the specified target (and the other way around). The Target abstraction itself comes with a few more data types that describe various aspects of the target system, including file formats, ABI, floating-point ABI (FABI), endianness, which is no longer limited to the binary choice of little and big endianness, and an extensible data type for storing target-specific options. Finally, all targets are formed into hierarchies and families, which helps in controlling the vast zoo of computer architectures and devices. The Target.t is an abstract data type and is self-describing and includes enough information that describes all the details of the architecture. We also provide four library modules, for arm, mips, powerpc, and x86 that exposes the currenlty declared targets. Our LLVM backend is not yet precise enough to recongize many of the supported targets and we don't have analyses right now that will infer the target from the binary, but we will add the `--target` option in the next PRs (when we will switch to Target.t) everywhere. As usual, comments, questions, reviews are very welcome.
ivg
added a commit
to ivg/bap
that referenced
this pull request
Sep 25, 2020
Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old representation suffered from a few problems that we inherited from LLVM. The main issue is that Arch.t is not extensible and in order to add a new architecture the Bap.Std code shall be changed in a backward-compatibility-breaking manner. Arch.t is als unable to represent the whole variety of computing devices, which is especially relevant to micro-controllers (AVR, PIC) and IoT devices on which we are currently focusing. Finally, Arch.t is not precise enough to capture information that is necessary for code generation, the new venue that we are currently exploring. As the first attempt that didn't really work we introduced arch, sub, and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem with that approach was the stringly typed interface as `arch` was represented as a simple string. In addition, the proposed properties werent' able to describe uncommon architectures. Finally, it was very awkward to use, all fields were optional with no good defaults. This is the second attempt and it will be split into several pull requests. The first PR, this one, introduce the Theory.Target.t but still keeps Arch.t alive, i.e., it is used by all internal and external components of BAP. This is to ensure that switching to Target.t doesn't break any existing code. The consequent pull requests will gradually deprecated functions that use Arch.t and switch Target.t everywhere. The most important switch will affect the disassembler/decoder framework, which is currently still stuck on Arch.t. Just to be clear, after this work is finished and until BAP 3.0 and maybe even thereafter Arch.t will still work as it used to work and no code will break or require updates. However, newly added architectures, such as AVR or PIC, i.e., those that could not be represented with Arch.t will not be available for the code that still relies on it. In addition to Theory.Target.t we add a few more abstractions and convenience functions, e.g., `Project.empty` and a completely new interface for Project.Input.t generation, which makes it easier to create projects from strings or other custom data, e.g., `Project.Input.from_string` . We also add Source, Language, and Compiler abstractions to the knowledge base Core Theory. These abstractions, together with Target, describe the full cycle of the program transformation using the compiler from source code in the given language to the program for the specified target (and the other way around). The Target abstraction itself comes with a few more data types that describe various aspects of the target system, including file formats, ABI, floating-point ABI (FABI), endianness, which is no longer limited to the binary choice of little and big endianness, and an extensible data type for storing target-specific options. Finally, all targets are formed into hierarchies and families, which helps in controlling the vast zoo of computer architectures and devices. The Target.t is an abstract data type and is self-describing and includes enough information that describes all the details of the architecture. We also provide four library modules, for arm, mips, powerpc, and x86 that exposes the currenlty declared targets. Our LLVM backend is not yet precise enough to recongize many of the supported targets and we don't have analyses right now that will infer the target from the binary, but we will add the `--target` option in the next PRs (when we will switch to Target.t) everywhere. As usual, comments, questions, reviews are very welcome.
ivg
added a commit
that referenced
this pull request
Sep 25, 2020
Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old representation suffered from a few problems that we inherited from LLVM. The main issue is that Arch.t is not extensible and in order to add a new architecture the Bap.Std code shall be changed in a backward-compatibility-breaking manner. Arch.t is als unable to represent the whole variety of computing devices, which is especially relevant to micro-controllers (AVR, PIC) and IoT devices on which we are currently focusing. Finally, Arch.t is not precise enough to capture information that is necessary for code generation, the new venue that we are currently exploring. As the first attempt that didn't really work we introduced arch, sub, and other properties to the `core-theory:unit` class in #1119. The problem with that approach was the stringly typed interface as `arch` was represented as a simple string. In addition, the proposed properties werent' able to describe uncommon architectures. Finally, it was very awkward to use, all fields were optional with no good defaults. This is the second attempt and it will be split into several pull requests. The first PR, this one, introduce the Theory.Target.t but still keeps Arch.t alive, i.e., it is used by all internal and external components of BAP. This is to ensure that switching to Target.t doesn't break any existing code. The consequent pull requests will gradually deprecated functions that use Arch.t and switch Target.t everywhere. The most important switch will affect the disassembler/decoder framework, which is currently still stuck on Arch.t. Just to be clear, after this work is finished and until BAP 3.0 and maybe even thereafter Arch.t will still work as it used to work and no code will break or require updates. However, newly added architectures, such as AVR or PIC, i.e., those that could not be represented with Arch.t will not be available for the code that still relies on it. In addition to Theory.Target.t we add a few more abstractions and convenience functions, e.g., `Project.empty` and a completely new interface for Project.Input.t generation, which makes it easier to create projects from strings or other custom data, e.g., `Project.Input.from_string` . We also add Source, Language, and Compiler abstractions to the knowledge base Core Theory. These abstractions, together with Target, describe the full cycle of the program transformation using the compiler from source code in the given language to the program for the specified target (and the other way around). The Target abstraction itself comes with a few more data types that describe various aspects of the target system, including file formats, ABI, floating-point ABI (FABI), endianness, which is no longer limited to the binary choice of little and big endianness, and an extensible data type for storing target-specific options. Finally, all targets are formed into hierarchies and families, which helps in controlling the vast zoo of computer architectures and devices. The Target.t is an abstract data type and is self-describing and includes enough information that describes all the details of the architecture. We also provide four library modules, for arm, mips, powerpc, and x86 that exposes the currenlty declared targets. Our LLVM backend is not yet precise enough to recongize many of the supported targets and we don't have analyses right now that will infer the target from the binary, but we will add the `--target` option in the next PRs (when we will switch to Target.t) everywhere. As usual, comments, questions, reviews are very welcome.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR paves the road to the global knowledge base which will be storing information about several binaries (executables, libraries, object files, etc). It also brings a few new functional and convenience features. The quick summary is below:
bap.std
package tobap
bap compare
command together with the new collators extension pointNew features
The
compare
command andcallgraph-collator
This is a new command that showcases the new ability to store multiple same or different files in the knowledge base. It comes with a new extension injection point called collators (sorry, the compare and comparators names are already too busy in OCaml). A collator is an object that takes two projects and compares them to its taste. Its interface is designed in such way that we don't need to store in the residential memory more than one project or to store information about more than two projects when we compare several files. As a demonstration, we implemented a collator that compares N binaries by comparing their callgraphs.
Knowledge Base Rules and their introspection
Procedures, stored in the knowledge base, are totally opaque so it is hard to guess which property is provided by which procedure and what are the dependencies. To enable transparency, we now explicitly and statically describe all rules, using the new rule description eDSL. The description doesn't affect the semantics and is purely for information purposes. To get all the rules, stored in the knowledge base, just type
bap list rules
, this is, for example, how symbolizers are described in terms of the knowledge base rules,Implementation Details
The problem
When a project is created using the
Project.create
function it uses the internal knowledge base and stores all the information about the disassembled binary in it. When theProject.create
function is called the second time, it most likely will end up in a conflicting state since we are indexing objects by their virtual addresses and is quite probable that two binaries will have different instructions at the same addresses. In addition, our old information sources, such asbranchers
,rooters
, andsymbolizers
, albeit being deprecated are still playing an important role in our infrastructure and they were reflecting their information directly into the knowledge base every time a new file is opened, so that if several binaries were opened in a row they will keep computing roots and names for both of them at the same time, which also will lead eventually to the conflicting information.Solution
We represent program labels as the knowledge base symbols, which were interned in the
core-theory
package. Instead, we can intern them in the current-package and set the current-package variable to a different value for different files projects. That enables the same addresses to be distinguished if the came from different files.Scoping information sources is a bit harder. First of all, we introduced
promising
andproviding
operations to the Knowledge interface that temporary store procedures in the knowledge base. It is not, however, always possible to stretch the right scope, moreover, sometimes it is very convenient when the procedure is stored in the knowledge base indefinitely and provide information even after the project is reconstructed and maybe even after the knowledge base is persisted and loaded again.To achieve these goals, we introduced the notion of code units. A code unit denotes a set of instructions that share common properties and attributed each instruction a unit. From the unit, we can obtain information about the file name, target architecture, and even the programming language from which the instruction was compiled. A well-behaved information provider, when it computes some instruction property, can now look into its unit and figure out its origin and check if it matches with its own source of information, e.g., when radare2 symbolizer is readings its symbols from the file named
foo
and the address comes from a unit that belongs to a file namedbar
it should not provide any symbolic information about that address. For the old information sources, such as rooter, symbolizer, and brancher, we enabled this behavior automatically, by adding the path property to the sources. When the source is provided to the knowledge base, the stored procedure checks if the path in the source matches the origin path of the address.The objdump plugin is not using the old interface anymore and is fully rewritten. The idea is to devise the interface for providing information that will be used instead of the old symbolizers and rooters. It is still a work-in-progress that we have cherry-picked from the branch that delivers Ghidra support. The new implementation of the objdump plugin is a proof-of-concept that operates fully from the knowledge base (it is not even dependent on the Bap.Std interface). For each address, it obtains its unit and if it doesn't have any information about it, it opens and parses that file and then provides the obtained information as usual. It also provides a new service that is dual to the service that our symbolizers provide, it is able to resolve names to addresses. This latter service uncovered a long-term bug that we have in BAP but that went unnoticed. Namely, that
--llvm-base
option wasn't properly addressed by all our information providers. When this option was used, our llvm loader was rebasing the binary and, as a result, all addresses were different from what other information sources that rely on the original data were seeing. At the end, objdump, radare, and ida, were providing information for the real addresses not for the shifted. That also led to multiple conflicts.The real culprit was the original design, as we shouldn't have this option at all on the llvm level, but instead handle it globally. But it is too late to change anything so we provided a more general solution. We introduced the bias property to our unit class, which denotes that all addresses in this unit are biased with respect to the real addresses. Then we updated our information providers to respect that bias, both when we need to go from biased to real addresses and back from real to biased. To minimize the impact, we added automated handling of biases to our old information sources, rooter, symbolizer, and brancher. We assume that they are all providing unbiased addresses and automatically subtract the bias before passing the addresses to them. Only the information sources that were obtained from the image source are considered biased, so this extra correction is not needed. Right now we do not allow users to create explicit biased sources as we don't see the real need for that, but later we may publish this interface.