-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
361 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
# Capstone Architecture overview | ||
|
||
## Architecture of Capstone | ||
|
||
TODO | ||
|
||
## Architecture of a Module | ||
|
||
An architecture module is split into two components. | ||
|
||
1. The disassembler logic, which decodes bytes to instructions. | ||
2. The mapping logic, which maps the result from component 1 to | ||
a Capstone internal representation and adds additional detail. | ||
|
||
### Component 1 - Disassembler logic | ||
|
||
The disassembler logic consists exclusively of code from LLVM. | ||
It uses: | ||
|
||
- Generated state machines, enums and the like for instruction decoding. | ||
- Handwritten disassembler logic for decoding instruction operands | ||
and controlling the decoding procedure. | ||
|
||
### Component 2 - Mapping logic | ||
|
||
The mapping component has three different task: | ||
|
||
1. Serving as programmable interface for the Capstone core to the LLVM code. | ||
2. Mapping LLVM decoded instructions to a Capstone instruction. | ||
3. Adding additional detail to the Capstone instructions | ||
(e.g. operand `read/write` attributes etc.). | ||
|
||
### Instruction representation | ||
|
||
There exist two structs which represent an instruction: | ||
|
||
- `MCInst`: The LLVM representation of an instruction. | ||
- `cs_insn`: The Capstone representation of an instruction. | ||
|
||
The `MCInst` is used by the disassembler component for storing the decoded instruction. | ||
The mapping component on the other hand, uses the `MCInst` to populate the `cs_insn`. | ||
|
||
The `cs_insn` is meant to be used by the Capstone core. | ||
It is distinct from the `MCInst`. It uses different instruction identifiers, other operand representation | ||
and holds more details about an instruction. | ||
|
||
### Disassembling process | ||
|
||
There are two steps in disassembling an instruction. | ||
|
||
1. Decoding bytes to a `MCInst`. | ||
2. Decoding the assembler string for the `MCInst` AND mapping it to a `cs_insn` in the same step. | ||
|
||
Here is a boiled down explanation about these steps. | ||
|
||
**Step 1** | ||
|
||
``` | ||
Forward to | ||
getInstr(bytes) ┌───┐LLVM code ┌─────────┐ ┌──────────┐ | ||
┌──────────────────►│ A ├────────────► │ ├───────────►│ ├────┐ | ||
│ │ R │ │ LLVM │ │ LLVM │ │ Decode | ||
│ │ C │ │ │ │ │ │ Instr. | ||
│ │ H │ │ │decode(Op0) │ │◄───┘ | ||
┌────────┐ disasm(bytes) ┌──────────┴──┐ │ │ │ Disass- │ ◄──────────┤ Decoder │ | ||
│CS Core ├──────────────►│ ARCH Module │ │ │ │ embler ├──────────► │ State │ | ||
└────────┘ └─────────────┘ │ M │ │ │ │ Machine │ | ||
▲ │ A │ │ │decode(Op1) │ │ | ||
│ │ P │ │ │ ◄──────────┤ │ | ||
│ │ P │ │ ├──────────► │ │ | ||
│ │ I │ │ │ │ │ | ||
│ │ N │ │ │ │ │ | ||
└───────────────────┤ G │◄─────────────┤ │◄───────────┤ │ | ||
└───┘ └─────────┘ └──────────┘ | ||
``` | ||
|
||
In the first decoding step the instruction bytes get forwarded to the | ||
decoder state machine. | ||
After the instruction was identified, the state machine calls decoder functions | ||
for each operand to extract the operand values from the bytes. | ||
|
||
The disassembler and the state machine are equivalent to what `llvm-objdump` uses | ||
(in fact they use the same files, except we translated them from C++ to C). | ||
|
||
**Step 2** | ||
|
||
``` | ||
printInst( | ||
MCInst, | ||
┌───┐ asm_buf) ┌────────┐ ┌──────────┐ | ||
┌───────────►│ A ├──────────────► │ ├───────────►│ ├──────┐ | ||
│ │ R │ │ LLVM │ │ LLVM │ │ Decode | ||
│ │ C │ add_cs_detail │ │ │ │ │ Mnemonic | ||
│ │ H │ (Op0) │ │ print(Op0) │ │◄─────┘ | ||
│ │ │ ◄──────────────┤ │ ◄──────────┤ │ | ||
printer(MCInst, │ │ ├──────────────► │ ├──────────► │ Asm- │ | ||
┌────────┐ asm_buf)┌──────────┴──┐ │ │ │ Inst │ │ Writer │ | ||
│CS Core ├────────────────►│ ARCH Module │ │ │ │ Printer│ │ State │ | ||
└────────┘ └─────────────┘ │ M │ add_cs_detail │ │ │ Machine │ | ||
▲ │ A │ (Op1) │ │ print(Op1) │ │ | ||
│ │ P │ ◄──────────────┤ │ ◄──────────┤ │ | ||
│ │ P ├──────────────► │ ├──────────► │ │ | ||
│ │ I │ │ │ │ │ | ||
│ │ N │ │ │ │ │ | ||
└────────────┤ G │◄───────────────┤ │◄───────────┤ │ | ||
└───┘ └────────┘ └──────────┘ | ||
``` | ||
|
||
The second decoding step passes the `MCInst` and a buffer to the printer. | ||
|
||
After determining the mnemonic, each operand is printed by using | ||
functions defined in the `InstPrinter`. | ||
|
||
Each time an operand is printed, the mapping component is called | ||
to populate the `cs_insn` with the operand information and details. | ||
|
||
Again the `InstPrinter` and `AsmWriter` are translated code from LLVM, | ||
and with that mirror the behavior of `llvm-objdump`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,190 @@ | ||
# Auto-Sync | ||
|
||
`auto-sync` is the update tool for Capstone. | ||
Its purpose is to automate as many steps as possible in the update | ||
procedure. | ||
|
||
You can find it in `suite/auto-sync`. | ||
|
||
This document is split into four parts. | ||
|
||
1. An overview of the update process and which subcomponents of `auto-sync` do what. | ||
2. The instructions how to update an architecture which already supports `auto-sync`. | ||
3. Instructions how to refactor an architecture to use `auto-sync`. | ||
4. Notes about how to add a new architecture to Capstone with `auto-sync`. | ||
|
||
Please read the section about architecture module design in | ||
[ARCHITECTURE.md](ARCHITECTURE.md) before proceeding. | ||
The architectural understanding is important for the following. | ||
|
||
## Update procedure | ||
|
||
As already described in the `ARCHITECTURE` document, Capstone uses translated | ||
and generated source code from LLVM. | ||
|
||
Because LLVM is written in C++ and Capstone in C the update process is | ||
internally complicated but almost completely automated. | ||
|
||
`auto-sync` categorizes source files of a module into three groups. Each group is updated differently. | ||
|
||
| File type | Update method | Edits by hand | | ||
|-----------------------------------|----------------------|------------------------| | ||
| Generated files | Generated by patched LLVM backends | Never/Not allowed | | ||
| Translated LLVM C++ files | `CppTranslater` and `Differ` | Only changes which are too complicated for automation. | | ||
| Capstone files | By hand | all | | ||
|
||
Let's look at the update procedure for each group in detail. | ||
|
||
**Generated files** | ||
|
||
Generated files always have the file extension `.inc`. | ||
|
||
There are generated files for the LLVM code and for Capstone. They can be distinguished by their names: | ||
|
||
- For Capstone: `<ARCH>GenCS<NAME>.inc`. | ||
- For LLVM code: `<ARCH>Gen<NAME>.inc`. | ||
|
||
The files are generated by refactored [LLVM TableGen emitter backends](https://github.com/Rot127/llvm-capstone/tree/dev/llvm/utils/TableGen). | ||
|
||
The procedure looks roughly like this: | ||
|
||
``` | ||
┌──────────┐ | ||
1 2 3 4 │CS .inc │ | ||
┌───────┐ ┌───────────┐ ┌───────────┐ ┌──────────┐ ┌─►│files │ | ||
│ .td │ │ │ │ │ │ Code- │ │ └──────────┘ | ||
│ files ├────►│ TableGen ├────►│ CodeGen ├────►│ Emitter ├──┤ | ||
└───────┘ └──────┬────┘ └───────────┘ └──────────┘ │ ┌──────────┐ | ||
│ ▲ └─►│LLVM .inc │ | ||
└─────────────────────────────────┘ │files │ | ||
└──────────┘ | ||
``` | ||
|
||
|
||
1. LLVM architectures are defined in `.td` files. They describe instructions, operands, | ||
features and other properties of an architecture. | ||
|
||
2. [LLVM TableGen](https://llvm.org/docs/TableGen/index.html) parses these files | ||
and converts them to an internal representation. | ||
|
||
3. In the second step a TableGen component called [CodeGen](https://llvm.org/docs/CodeGenerator.html) | ||
abstracts the these properties even further. | ||
The result is a representation which is _not_ specific to any architecture | ||
(e.g. the `CodeGenInstruction` class can represent a machine instruction of any architecture). | ||
|
||
4. The `Code-Emitter` uses the abstract representation of the architecture (provided from `CodeGen`) to | ||
generated state machines for instruction decoding. | ||
Architecture specific information (think of register names, operand properties etc.) | ||
is taken from `TableGen's` internal representation. | ||
|
||
The result is emitted to `.inc` files. Those are included in the translated C++ files or Capstone code where necessary. | ||
|
||
**Translation of LLVM C++ files** | ||
|
||
We use two tools to translate C++ to C files. | ||
|
||
First the `CppTranslator` and afterward the `Differ`. | ||
|
||
The `CppTranslator` parses the C++ files and patches C++ syntax | ||
with its equivalent C syntax. | ||
|
||
_Note_: For details about this checkout `suite/auto-sync/CppTranslator/README.md`. | ||
|
||
Because the result of the `CppTranslator` is not perfect, | ||
we still have many syntax problems left. | ||
|
||
Those need to be fixed by hand. | ||
In order to ease this process we run the `Differ` after the `CppTranslator`. | ||
|
||
The `Differ` parses each _translated_ file and the corresponding source file _currently_ used in Capstone. | ||
It then compares specific nodes from the just translated file to the equivalent nodes in the old file. | ||
|
||
The user can choose if she accepts the version from the translated file or the old file. | ||
This decision is saved for every node. | ||
If there exists a saved decision for a node, the previous decision automatically applied again. | ||
|
||
Every other syntax error must be solved manually. | ||
|
||
## Update an architecture | ||
|
||
To update an architecture do the following: | ||
|
||
Rebase `llvm-capstone` onto the new LLVM release (if not already done). | ||
``` | ||
# 1. Clone Capstones LLVM | ||
git clone https://github.com/capstone-engine/llvm-capstone | ||
# 2. Rebase onto the new LLVM release and resolve the conflicts. | ||
# 3. Build tblgen | ||
mkdir build | ||
cd build | ||
cmake -G Ninja -DLLVM_TARGETS_TO_BUILD=<ARCH> -DCMAKE_BUILD_TYPE=Debug ../llvm | ||
cmake --build . --target llvm-tblgen --config Debug | ||
# 4. Run git log and copy the hash of the release commit for the next step. | ||
git log | ||
# 5. Run the updater | ||
cd ../../suite/auto-sync/ | ||
mkdir build | ||
cd build | ||
../Update-Arch.sh <ARCH> <PATH-TO-LLVM> <LLVM-RELEASE_HASH> | ||
``` | ||
|
||
The update script will execute the steps described above and copy the new files to their directories. | ||
|
||
Afterward try to build Capstone and fix any build errors left. | ||
|
||
If new instructions or operands were added, add test cases for those | ||
(recession tests for instructions are located in `suite/MC/`). | ||
|
||
TODO: Operand and detail tests | ||
<!-- TODO: Wait until `cstest` is rewritten and add description about operand testing. --> | ||
|
||
## Refactor an architecture for `auto-sync` | ||
|
||
To refactor an architecture to use `auto-sync`, you need to add it to the configuration. | ||
|
||
1. Add the architecture to the supported architectures list in `Update-Arch.sh`. | ||
2. Configure the `CppTranslator` for your architecture (`suite/auto-sync/CppTranslator/arch_config.json`) | ||
|
||
Now, manually run the update commands within `Update-Arch.sh` but *skip* the `Differ` step. | ||
|
||
The task after this is to: | ||
|
||
- Replace leftover C++ syntax with its C equivalent. | ||
- Implement the `add_cs_detail()` handler in `<ARCH>Mapping` for each operand type. | ||
- Add any missing logic to the translated files. | ||
- Make it build and write tests. | ||
- Run the Differ again and select always the old nodes. | ||
|
||
**Notes:** | ||
|
||
- If you find yourself fixing the same syntax error multiple times, | ||
please consider adding a `Patch` to the `CppTranslator` for this case. | ||
|
||
- Please check out the implementation of ARM's `add_cs_detail()` before implementing your own. | ||
|
||
- Running the `Differ` after everything is done, preserves your version of syntax corrections, and the next user can auto-apply them. | ||
|
||
- Sometimes the LLVM code uses a single function from a larger source file. | ||
It is not worth it to translate the whole file just for this function. | ||
Bundle those lonely functions in `<ARCH>DisassemblerExtension.c`. | ||
|
||
- Some generated enums must be included in the `include/capstone/<ARCH>.h` header. | ||
At the position where the enum should be inserted, add a comment like this (don't remove the `<>` brackets): | ||
|
||
``` | ||
// generate content <FILENAME.inc> begin | ||
// generate content <FILENAME.inc> end | ||
``` | ||
|
||
The update script will insert the content of the `.inc` file at this place. | ||
|
||
## Adding a new architecture | ||
|
||
Adding a new architecture follows the same steps as above. With the exception that you need | ||
to implement all the Capstone files from scratch. | ||
|
||
Check out an `auto-sync` supporting architectures for guidance and open an issue if you need help. |
Oops, something went wrong.