Skip to content

Commit

Permalink
Add documentation about auto-sync.
Browse files Browse the repository at this point in the history
  • Loading branch information
Rot127 committed Apr 28, 2023
1 parent 5ffcb96 commit 4780416
Show file tree
Hide file tree
Showing 4 changed files with 361 additions and 3 deletions.
22 changes: 22 additions & 0 deletions HACK.TXT
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,32 @@ Coding style
- C code follows Linux kernel coding style, using tabs for indentation.
- Python code uses 4 spaces for indentation.

Updating an Architecture
------------------------

The update tool for Capstone is called `auto-sync` and can be found in `suite/auto-sync`.

Not all architectures are supported yet.
Run `suite/auto-sync/Update-Arch.sh -h` to get a list of currently supported architectures.

The documentation how to update with `auto-sync` or refactor an architecture module
can be found in [docs/AutoSync.md](docs/AutoSync.md).

If a module does not support `auto-sync` yet, it is highly recommended to refactor it
instead of attempting to update it manually.
Refactoring will take less time and updates it during the procedure.

The one exception is `x86`. `x86` which is not compatible with `auto-sync` and
currently can't be refactored to use it.

Adding an architecture
----------------------

If your architecture is supported in LLVM or one of its forks, you can use `auto-sync` to
add the new module.

<!-- TODO: Move this info to the auto-sync docs -->

Obviously, you first need to write all the logic and put it in a new directory arch/newarch
Then, you have to modify other files.
(You can look for one architecture such as EVM in these files to get what you need to do)
Expand Down
118 changes: 118 additions & 0 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Capstone Architecture overview

## Architecture of Capstone

TODO

## Architecture of a Module

An architecture module is split into two components.

1. The disassembler logic, which decodes bytes to instructions.
2. The mapping logic, which maps the result from component 1 to
a Capstone internal representation and adds additional detail.

### Component 1 - Disassembler logic

The disassembler logic consists exclusively of code from LLVM.
It uses:

- Generated state machines, enums and the like for instruction decoding.
- Handwritten disassembler logic for decoding instruction operands
and controlling the decoding procedure.

### Component 2 - Mapping logic

The mapping component has three different task:

1. Serving as programmable interface for the Capstone core to the LLVM code.
2. Mapping LLVM decoded instructions to a Capstone instruction.
3. Adding additional detail to the Capstone instructions
(e.g. operand `read/write` attributes etc.).

### Instruction representation

There exist two structs which represent an instruction:

- `MCInst`: The LLVM representation of an instruction.
- `cs_insn`: The Capstone representation of an instruction.

The `MCInst` is used by the disassembler component for storing the decoded instruction.
The mapping component on the other hand, uses the `MCInst` to populate the `cs_insn`.

The `cs_insn` is meant to be used by the Capstone core.
It is distinct from the `MCInst`. It uses different instruction identifiers, other operand representation
and holds more details about an instruction.

### Disassembling process

There are two steps in disassembling an instruction.

1. Decoding bytes to a `MCInst`.
2. Decoding the assembler string for the `MCInst` AND mapping it to a `cs_insn` in the same step.

Here is a boiled down explanation about these steps.

**Step 1**

```
Forward to
getInstr(bytes) ┌───┐LLVM code ┌─────────┐ ┌──────────┐
┌──────────────────►│ A ├────────────► │ ├───────────►│ ├────┐
│ │ R │ │ LLVM │ │ LLVM │ │ Decode
│ │ C │ │ │ │ │ │ Instr.
│ │ H │ │ │decode(Op0) │ │◄───┘
┌────────┐ disasm(bytes) ┌──────────┴──┐ │ │ │ Disass- │ ◄──────────┤ Decoder │
│CS Core ├──────────────►│ ARCH Module │ │ │ │ embler ├──────────► │ State │
└────────┘ └─────────────┘ │ M │ │ │ │ Machine │
▲ │ A │ │ │decode(Op1) │ │
│ │ P │ │ │ ◄──────────┤ │
│ │ P │ │ ├──────────► │ │
│ │ I │ │ │ │ │
│ │ N │ │ │ │ │
└───────────────────┤ G │◄─────────────┤ │◄───────────┤ │
└───┘ └─────────┘ └──────────┘
```

In the first decoding step the instruction bytes get forwarded to the
decoder state machine.
After the instruction was identified, the state machine calls decoder functions
for each operand to extract the operand values from the bytes.

The disassembler and the state machine are equivalent to what `llvm-objdump` uses
(in fact they use the same files, except we translated them from C++ to C).

**Step 2**

```
printInst(
MCInst,
┌───┐ asm_buf) ┌────────┐ ┌──────────┐
┌───────────►│ A ├──────────────► │ ├───────────►│ ├──────┐
│ │ R │ │ LLVM │ │ LLVM │ │ Decode
│ │ C │ add_cs_detail │ │ │ │ │ Mnemonic
│ │ H │ (Op0) │ │ print(Op0) │ │◄─────┘
│ │ │ ◄──────────────┤ │ ◄──────────┤ │
printer(MCInst, │ │ ├──────────────► │ ├──────────► │ Asm- │
┌────────┐ asm_buf)┌──────────┴──┐ │ │ │ Inst │ │ Writer │
│CS Core ├────────────────►│ ARCH Module │ │ │ │ Printer│ │ State │
└────────┘ └─────────────┘ │ M │ add_cs_detail │ │ │ Machine │
▲ │ A │ (Op1) │ │ print(Op1) │ │
│ │ P │ ◄──────────────┤ │ ◄──────────┤ │
│ │ P ├──────────────► │ ├──────────► │ │
│ │ I │ │ │ │ │
│ │ N │ │ │ │ │
└────────────┤ G │◄───────────────┤ │◄───────────┤ │
└───┘ └────────┘ └──────────┘
```

The second decoding step passes the `MCInst` and a buffer to the printer.

After determining the mnemonic, each operand is printed by using
functions defined in the `InstPrinter`.

Each time an operand is printed, the mapping component is called
to populate the `cs_insn` with the operand information and details.

Again the `InstPrinter` and `AsmWriter` are translated code from LLVM,
and with that mirror the behavior of `llvm-objdump`.
190 changes: 190 additions & 0 deletions docs/AutoSync.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
# Auto-Sync

`auto-sync` is the update tool for Capstone.
Its purpose is to automate as many steps as possible in the update
procedure.

You can find it in `suite/auto-sync`.

This document is split into four parts.

1. An overview of the update process and which subcomponents of `auto-sync` do what.
2. The instructions how to update an architecture which already supports `auto-sync`.
3. Instructions how to refactor an architecture to use `auto-sync`.
4. Notes about how to add a new architecture to Capstone with `auto-sync`.

Please read the section about architecture module design in
[ARCHITECTURE.md](ARCHITECTURE.md) before proceeding.
The architectural understanding is important for the following.

## Update procedure

As already described in the `ARCHITECTURE` document, Capstone uses translated
and generated source code from LLVM.

Because LLVM is written in C++ and Capstone in C the update process is
internally complicated but almost completely automated.

`auto-sync` categorizes source files of a module into three groups. Each group is updated differently.

| File type | Update method | Edits by hand |
|-----------------------------------|----------------------|------------------------|
| Generated files | Generated by patched LLVM backends | Never/Not allowed |
| Translated LLVM C++ files | `CppTranslater` and `Differ` | Only changes which are too complicated for automation. |
| Capstone files | By hand | all |

Let's look at the update procedure for each group in detail.

**Generated files**

Generated files always have the file extension `.inc`.

There are generated files for the LLVM code and for Capstone. They can be distinguished by their names:

- For Capstone: `<ARCH>GenCS<NAME>.inc`.
- For LLVM code: `<ARCH>Gen<NAME>.inc`.

The files are generated by refactored [LLVM TableGen emitter backends](https://github.com/Rot127/llvm-capstone/tree/dev/llvm/utils/TableGen).

The procedure looks roughly like this:

```
┌──────────┐
1 2 3 4 │CS .inc │
┌───────┐ ┌───────────┐ ┌───────────┐ ┌──────────┐ ┌─►│files │
│ .td │ │ │ │ │ │ Code- │ │ └──────────┘
│ files ├────►│ TableGen ├────►│ CodeGen ├────►│ Emitter ├──┤
└───────┘ └──────┬────┘ └───────────┘ └──────────┘ │ ┌──────────┐
│ ▲ └─►│LLVM .inc │
└─────────────────────────────────┘ │files │
└──────────┘
```


1. LLVM architectures are defined in `.td` files. They describe instructions, operands,
features and other properties of an architecture.

2. [LLVM TableGen](https://llvm.org/docs/TableGen/index.html) parses these files
and converts them to an internal representation.

3. In the second step a TableGen component called [CodeGen](https://llvm.org/docs/CodeGenerator.html)
abstracts the these properties even further.
The result is a representation which is _not_ specific to any architecture
(e.g. the `CodeGenInstruction` class can represent a machine instruction of any architecture).

4. The `Code-Emitter` uses the abstract representation of the architecture (provided from `CodeGen`) to
generated state machines for instruction decoding.
Architecture specific information (think of register names, operand properties etc.)
is taken from `TableGen's` internal representation.

The result is emitted to `.inc` files. Those are included in the translated C++ files or Capstone code where necessary.

**Translation of LLVM C++ files**

We use two tools to translate C++ to C files.

First the `CppTranslator` and afterward the `Differ`.

The `CppTranslator` parses the C++ files and patches C++ syntax
with its equivalent C syntax.

_Note_: For details about this checkout `suite/auto-sync/CppTranslator/README.md`.

Because the result of the `CppTranslator` is not perfect,
we still have many syntax problems left.

Those need to be fixed by hand.
In order to ease this process we run the `Differ` after the `CppTranslator`.

The `Differ` parses each _translated_ file and the corresponding source file _currently_ used in Capstone.
It then compares specific nodes from the just translated file to the equivalent nodes in the old file.

The user can choose if she accepts the version from the translated file or the old file.
This decision is saved for every node.
If there exists a saved decision for a node, the previous decision automatically applied again.

Every other syntax error must be solved manually.

## Update an architecture

To update an architecture do the following:

Rebase `llvm-capstone` onto the new LLVM release (if not already done).
```
# 1. Clone Capstones LLVM
git clone https://github.com/capstone-engine/llvm-capstone
# 2. Rebase onto the new LLVM release and resolve the conflicts.
# 3. Build tblgen
mkdir build
cd build
cmake -G Ninja -DLLVM_TARGETS_TO_BUILD=<ARCH> -DCMAKE_BUILD_TYPE=Debug ../llvm
cmake --build . --target llvm-tblgen --config Debug
# 4. Run git log and copy the hash of the release commit for the next step.
git log
# 5. Run the updater
cd ../../suite/auto-sync/
mkdir build
cd build
../Update-Arch.sh <ARCH> <PATH-TO-LLVM> <LLVM-RELEASE_HASH>
```

The update script will execute the steps described above and copy the new files to their directories.

Afterward try to build Capstone and fix any build errors left.

If new instructions or operands were added, add test cases for those
(recession tests for instructions are located in `suite/MC/`).

TODO: Operand and detail tests
<!-- TODO: Wait until `cstest` is rewritten and add description about operand testing. -->

## Refactor an architecture for `auto-sync`

To refactor an architecture to use `auto-sync`, you need to add it to the configuration.

1. Add the architecture to the supported architectures list in `Update-Arch.sh`.
2. Configure the `CppTranslator` for your architecture (`suite/auto-sync/CppTranslator/arch_config.json`)

Now, manually run the update commands within `Update-Arch.sh` but *skip* the `Differ` step.

The task after this is to:

- Replace leftover C++ syntax with its C equivalent.
- Implement the `add_cs_detail()` handler in `<ARCH>Mapping` for each operand type.
- Add any missing logic to the translated files.
- Make it build and write tests.
- Run the Differ again and select always the old nodes.

**Notes:**

- If you find yourself fixing the same syntax error multiple times,
please consider adding a `Patch` to the `CppTranslator` for this case.

- Please check out the implementation of ARM's `add_cs_detail()` before implementing your own.

- Running the `Differ` after everything is done, preserves your version of syntax corrections, and the next user can auto-apply them.

- Sometimes the LLVM code uses a single function from a larger source file.
It is not worth it to translate the whole file just for this function.
Bundle those lonely functions in `<ARCH>DisassemblerExtension.c`.

- Some generated enums must be included in the `include/capstone/<ARCH>.h` header.
At the position where the enum should be inserted, add a comment like this (don't remove the `<>` brackets):

```
// generate content <FILENAME.inc> begin
// generate content <FILENAME.inc> end
```

The update script will insert the content of the `.inc` file at this place.

## Adding a new architecture

Adding a new architecture follows the same steps as above. With the exception that you need
to implement all the Capstone files from scratch.

Check out an `auto-sync` supporting architectures for guidance and open an issue if you need help.
Loading

0 comments on commit 4780416

Please sign in to comment.