Add documentation about auto-sync.

capstone-engine · Apr 28, 2023 · 4780416 · 4780416
1 parent 5ffcb96
commit 4780416
Show file tree

Hide file tree

Showing 4 changed files with 361 additions and 3 deletions.
diff --git a/HACK.TXT b/HACK.TXT
@@ -59,10 +59,32 @@ Coding style
 - C code follows Linux kernel coding style, using tabs for indentation.
 - Python code uses 4 spaces for indentation.
 
+Updating an Architecture
+------------------------
+
+The update tool for Capstone is called `auto-sync` and can be found in `suite/auto-sync`.
+
+Not all architectures are supported yet.
+Run `suite/auto-sync/Update-Arch.sh -h` to get a list of currently supported architectures.
+
+The documentation how to update with `auto-sync` or refactor an architecture module
+can be found in [docs/AutoSync.md](docs/AutoSync.md).
+
+If a module does not support `auto-sync` yet, it is highly recommended to refactor it
+instead of attempting to update it manually.
+Refactoring will take less time and updates it during the procedure.
+
+The one exception is `x86`. `x86` which is not compatible with `auto-sync` and
+currently can't be refactored to use it.
 
 Adding an architecture
 ----------------------
 
+If your architecture is supported in LLVM or one of its forks, you can use `auto-sync` to
+add the new module.
+
+<!-- TODO: Move this info to the auto-sync docs -->
+
 Obviously, you first need to write all the logic and put it in a new directory arch/newarch
 Then, you have to modify other files.
 (You can look for one architecture such as EVM in these files to get what you need to do)

diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
@@ -0,0 +1,118 @@
+# Capstone Architecture overview
+
+## Architecture of Capstone
+
+TODO
+
+## Architecture of a Module
+
+An architecture module is split into two components.
+
+1. The disassembler logic, which decodes bytes to instructions.
+2. The mapping logic, which maps the result from component 1 to
+a Capstone internal representation and adds additional detail.
+
+### Component 1 - Disassembler logic
+
+The disassembler logic consists exclusively of code from LLVM.
+It uses:
+
+- Generated state machines, enums and the like for instruction decoding.
+- Handwritten disassembler logic for decoding instruction operands
+and controlling the decoding procedure.
+
+### Component 2 - Mapping logic
+
+The mapping component has three different task:
+
+1. Serving as programmable interface for the Capstone core to the LLVM code.
+2. Mapping LLVM decoded instructions to a Capstone instruction.
+3. Adding additional detail to the Capstone instructions
+(e.g. operand `read/write` attributes etc.).
+
+### Instruction representation
+
+There exist two structs which represent an instruction:
+
+- `MCInst`: The LLVM representation of an instruction.
+- `cs_insn`: The Capstone representation of an instruction.
+
+The `MCInst` is used by the disassembler component for storing the decoded instruction.
+The mapping component on the other hand, uses the `MCInst` to populate the `cs_insn`.
+
+The `cs_insn` is meant to be used by the Capstone core.
+It is distinct from the `MCInst`. It uses different instruction identifiers, other operand representation
+and holds more details about an instruction.
+
+### Disassembling process
+
+There are two steps in disassembling an instruction.
+
+1. Decoding bytes to a `MCInst`.
+2. Decoding the assembler string for the `MCInst` AND mapping it to a `cs_insn` in the same step.
+
+Here is a boiled down explanation about these steps.
+
+**Step 1**
+
+```
+                                                             Forward to               
+                                     getInstr(bytes)    ┌───┐LLVM code     ┌─────────┐            ┌──────────┐
+                                    ┌──────────────────►│ A ├────────────► │         ├───────────►│          ├────┐
+                                    │                   │ R │              │ LLVM    │            │ LLVM     │    │ Decode
+                                    │                   │ C │              │         │            │          │    │ Instr.
+                                    │                   │ H │              │         │decode(Op0) │          │◄───┘
+┌────────┐ disasm(bytes) ┌──────────┴──┐                │   │              │ Disass- │ ◄──────────┤ Decoder  │
+│CS Core ├──────────────►│ ARCH Module │                │   │              │ embler  ├──────────► │ State    │
+└────────┘               └─────────────┘                │ M │              │         │            │ Machine  │
+                                    ▲                   │ A │              │         │decode(Op1) │          │
+                                    │                   │ P │              │         │ ◄──────────┤          │
+                                    │                   │ P │              │         ├──────────► │          │
+                                    │                   │ I │              │         │            │          │
+                                    │                   │ N │              │         │            │          │
+                                    └───────────────────┤ G │◄─────────────┤         │◄───────────┤          │
+                                                        └───┘              └─────────┘            └──────────┘
+```
+
+In the first decoding step the instruction bytes get forwarded to the
+decoder state machine.
+After the instruction was identified, the state machine calls decoder functions
+for each operand to extract the operand values from the bytes.
+
+The disassembler and the state machine are equivalent to what `llvm-objdump` uses
+(in fact they use the same files, except we translated them from C++ to C).
+
+**Step 2**
+
+```
+                                                        printInst(
+                                                               MCInst,
+                                                   ┌───┐       asm_buf) ┌────────┐            ┌──────────┐
+                                      ┌───────────►│ A ├──────────────► │        ├───────────►│          ├──────┐
+                                      │            │ R │                │ LLVM   │            │ LLVM     │      │ Decode
+                                      │            │ C │ add_cs_detail  │        │            │          │      │ Mnemonic
+                                      │            │ H │ (Op0)          │        │ print(Op0) │          │◄─────┘
+                                      │            │   │ ◄──────────────┤        │ ◄──────────┤          │
+           printer(MCInst,            │            │   ├──────────────► │        ├──────────► │ Asm-     │
+┌────────┐         asm_buf)┌──────────┴──┐         │   │                │ Inst   │            │ Writer   │
+│CS Core ├────────────────►│ ARCH Module │         │   │                │ Printer│            │ State    │
+└────────┘                 └─────────────┘         │ M │ add_cs_detail  │        │            │ Machine  │
+                                      ▲            │ A │ (Op1)          │        │ print(Op1) │          │
+                                      │            │ P │ ◄──────────────┤        │ ◄──────────┤          │
+                                      │            │ P ├──────────────► │        ├──────────► │          │
+                                      │            │ I │                │        │            │          │
+                                      │            │ N │                │        │            │          │
+                                      └────────────┤ G │◄───────────────┤        │◄───────────┤          │
+                                                   └───┘                └────────┘            └──────────┘
+```
+
+The second decoding step passes the `MCInst` and a buffer to the printer.
+
+After determining the mnemonic, each operand is printed by using
+functions defined in the `InstPrinter`.
+
+Each time an operand is printed, the mapping component is called
+to populate the `cs_insn` with the operand information and details.
+
+Again the `InstPrinter` and `AsmWriter` are translated code from LLVM,
+and with that mirror the behavior of `llvm-objdump`.
diff --git a/docs/AutoSync.md b/docs/AutoSync.md
@@ -0,0 +1,190 @@
+# Auto-Sync
+
+`auto-sync` is the update tool for Capstone.
+Its purpose is to automate as many steps as possible in the update
+procedure.
+
+You can find it in `suite/auto-sync`.
+
+This document is split into four parts.
+
+1. An overview of the update process and which subcomponents of `auto-sync` do what.
+2. The instructions how to update an architecture which already supports `auto-sync`.
+3. Instructions how to refactor an architecture to use `auto-sync`.
+4. Notes about how to add a new architecture to Capstone with `auto-sync`.
+
+Please read the section about architecture module design in
+[ARCHITECTURE.md](ARCHITECTURE.md) before proceeding.
+The architectural understanding is important for the following.
+
+## Update procedure
+
+As already described in the `ARCHITECTURE` document, Capstone uses translated
+and generated source code from LLVM.
+
+Because LLVM is written in C++ and Capstone in C the update process is
+internally complicated but almost completely automated.
+
+`auto-sync` categorizes source files of a module into three groups. Each group is updated differently.
+
+| File type                         | Update method | Edits by hand |
+|-----------------------------------|----------------------|------------------------|
+| Generated files | Generated by patched LLVM backends | Never/Not allowed |
+| Translated LLVM C++ files         | `CppTranslater` and `Differ` | Only changes which are too complicated for automation. |
+| Capstone files                    | By hand | all |
+
+Let's look at the update procedure for each group in detail.
+
+**Generated files**
+
+Generated files always have the file extension `.inc`.
+
+There are generated files for the LLVM code and for Capstone. They can be distinguished by their names:
+
+- For Capstone: `<ARCH>GenCS<NAME>.inc`.
+- For LLVM code: `<ARCH>Gen<NAME>.inc`.
+
+The files are generated by refactored [LLVM TableGen emitter backends](https://github.com/Rot127/llvm-capstone/tree/dev/llvm/utils/TableGen).
+
+The procedure looks roughly like this:
+
+```
+                                                                   ┌──────────┐
+    1               2                 3                4           │CS .inc   │
+┌───────┐     ┌───────────┐     ┌───────────┐     ┌──────────┐  ┌─►│files     │
+│ .td   │     │           │     │           │     │ Code-    │  │  └──────────┘
+│ files ├────►│ TableGen  ├────►│  CodeGen  ├────►│ Emitter  ├──┤
+└───────┘     └──────┬────┘     └───────────┘     └──────────┘  │  ┌──────────┐
+                     │                                 ▲        └─►│LLVM .inc │
+                     └─────────────────────────────────┘           │files     │
+                                                                   └──────────┘
+```
+
+
+1. LLVM architectures are defined in `.td` files. They describe instructions, operands,
+features and other properties of an architecture.
+
+2. [LLVM TableGen](https://llvm.org/docs/TableGen/index.html) parses these files
+and converts them to an internal representation.
+
+3. In the second step a TableGen component called [CodeGen](https://llvm.org/docs/CodeGenerator.html)
+abstracts the these properties even further.
+The result is a representation which is _not_ specific to any architecture
+(e.g. the `CodeGenInstruction` class can represent a machine instruction of any architecture).
+
+4. The `Code-Emitter` uses the abstract representation of the architecture (provided from `CodeGen`) to
+generated state machines for instruction decoding.
+Architecture specific information (think of register names, operand properties etc.)
+is taken from `TableGen's` internal representation.
+
+The result is emitted to `.inc` files. Those are included in the translated C++ files or Capstone code where necessary.
+
+**Translation of LLVM C++ files**
+
+We use two tools to translate C++ to C files.
+
+First the `CppTranslator` and afterward the `Differ`.
+
+The `CppTranslator` parses the C++ files and patches C++ syntax
+with its equivalent C syntax.
+
+_Note_: For details about this checkout `suite/auto-sync/CppTranslator/README.md`.
+
+Because the result of the `CppTranslator` is not perfect,
+we still have many syntax problems left.
+
+Those need to be fixed by hand.
+In order to ease this process we run the `Differ` after the `CppTranslator`.
+
+The `Differ` parses each _translated_ file and the corresponding source file _currently_ used in Capstone.
+It then compares specific nodes from the just translated file to the equivalent nodes in the old file.
+
+The user can choose if she accepts the version from the translated file or the old file.
+This decision is saved for every node.
+If there exists a saved decision for a node, the previous decision automatically applied again.
+
+Every other syntax error must be solved manually.
+
+## Update an architecture
+
+To update an architecture do the following:
+
+Rebase `llvm-capstone` onto the new LLVM release (if not already done).
+```
+# 1. Clone Capstones LLVM
+git clone https://github.com/capstone-engine/llvm-capstone
+
+# 2. Rebase onto the new LLVM release and resolve the conflicts.
+
+# 3. Build tblgen
+mkdir build
+cd build
+cmake -G Ninja -DLLVM_TARGETS_TO_BUILD=<ARCH> -DCMAKE_BUILD_TYPE=Debug ../llvm
+cmake --build . --target llvm-tblgen --config Debug
+
+# 4. Run git log and copy the hash of the release commit for the next step.
+git log
+
+# 5. Run the updater
+cd ../../suite/auto-sync/
+mkdir build
+cd build
+../Update-Arch.sh <ARCH> <PATH-TO-LLVM> <LLVM-RELEASE_HASH>
+```
+
+The update script will execute the steps described above and copy the new files to their directories.
+
+Afterward try to build Capstone and fix any build errors left.
+
+If new instructions or operands were added, add test cases for those
+(recession tests for instructions are located in `suite/MC/`).
+
+TODO: Operand and detail tests
+<!-- TODO: Wait until `cstest` is rewritten and add description about operand testing. -->
+
+## Refactor an architecture for `auto-sync`
+
+To refactor an architecture to use `auto-sync`, you need to add it to the configuration.
+
+1. Add the architecture to the supported architectures list in `Update-Arch.sh`.
+2. Configure the `CppTranslator` for your architecture (`suite/auto-sync/CppTranslator/arch_config.json`)
+
+Now, manually run the update commands within `Update-Arch.sh` but *skip* the `Differ` step.
+
+The task after this is to:
+
+- Replace leftover C++ syntax with its C equivalent.
+- Implement the `add_cs_detail()` handler in `<ARCH>Mapping` for each operand type.
+- Add any missing logic to the translated files.
+- Make it build and write tests.
+- Run the Differ again and select always the old nodes.
+
+**Notes:**
+
+- If you find yourself fixing the same syntax error multiple times,
+please consider adding a `Patch` to the `CppTranslator` for this case.
+
+- Please check out the implementation of ARM's `add_cs_detail()` before implementing your own.
+
+- Running the `Differ` after everything is done, preserves your version of syntax corrections, and the next user can auto-apply them.
+
+- Sometimes the LLVM code uses a single function from a larger source file.
+It is not worth it to translate the whole file just for this function.
+Bundle those lonely functions in `<ARCH>DisassemblerExtension.c`.
+
+- Some generated enums must be included in the `include/capstone/<ARCH>.h` header.
+At the position where the enum should be inserted, add a comment like this (don't remove the `<>` brackets):
+
+    ```
+    // generate content <FILENAME.inc> begin
+    // generate content <FILENAME.inc> end
+    ```
+
+The update script will insert the content of the `.inc` file at this place.
+
+## Adding a new architecture
+
+Adding a new architecture follows the same steps as above. With the exception that you need
+to implement all the Capstone files from scratch.
+
+Check out an `auto-sync` supporting architectures for guidance and open an issue if you need help.