Core Theory based ARM lifter #1174

Phosphorus15 · 2020-07-10T13:26:36Z

This pr presents an ARM lifter based on Core Theory/KB, intended to replace the legacy BIL lifter and provides more function & scalability than it does.

Currently the move and bits instructions are refactored respectively, while the rests remains implemented.

The new lifter code architecture is with a DSL module on its own, which is a extension of Core Theory operations to enables not only ARM itself but ARM Thumb (#1122) or even other lifters clearer in semantics, see the details in dsl_common.ml.

Phosphorus15 · 2020-07-12T15:17:37Z

By now, all the instructions in legacy ARM lifter has been transcript and re-structured into Core Theory knowledge format, there's just steps from being a completely working lifter:

connect the instructions definition with llvm-mc input from knowledge base
make branch instructions actually works with knowledge base info., and try to handle the switching to Thumb mode correctly
adds VFP floating point supports, considering we have a nice IEEE754 FP implementation in Core Theory, this shall not be difficult

ivg · 2020-07-14T01:04:04Z

plugins/arm-lifter/arm_main.ml

+  | Some addr, Some insn, Some mem -> 
+    run_lifter label addr insn mem Lifter.lift_with >>= fun sema ->
+    provide_sematics label sema
+  | _ -> raise (Defs.Lift_Error "insufficient knowledge") (* should I have not raised an error here? *)


no-no, no raises, the lifter shall never raise an exception in case if it is unable to provide knowledge. It is totally normal not to know something. Basically, the idea is that each lifter is providing as much knowledge as it can, but no more. Also, the simple rule in BAP for exceptions - exceptions are for indicating programmer errors.

TL;DR; when there is not enough information to provide the semantics, just return the bottom value of the semantics, e.g., Insn.empty which is the short alias for Knowledge.Value.empty Theory.Semantics.cls.

Is there a way to embed the error message into some kinds of empty knowledge like BIL.special do ?

TL;DR; when there is not enough information to provide the semantics, just return the bottom value of the semantics, e.g., Insn.empty which is the short alias for Knowledge.Value.empty Theory.Semantics.cls.

By the way, speaking of BIL.special, how do I (or is it necessary to) express the same semantics as BIL.cpuexn ?

ivg · 2020-07-14T01:15:31Z

OK, so I turned the code into a lifter that is integrated with the rest of BAP (see the last commit message for the details). To be able to use it, you will need to remove the current arm plugin since you're using the same module names, e.g.,

bapbundle remove arm.plugin

and now we can see the new lifter in work,

$ bap mc --arch=armv7 --show-bil  -- 00 B0 8D E2
{
  if (1) {
    r11 := sp
  }
}

The general plan would be to overhaul the old lifter with the new code, but I think that for some time we will need them both at least for testing purposes. One of the testing strategies is to compare the results of the two lifters. This is a long road, but the first step would be renaming the modules in this plugin so that they are not in conflict with the existing plugin. I suggest adding ng suffix, e.g., armng_main.ml as a working approach. We may later rename them back when we will integrate the new arm plugin with the existing arm code.

I will provide some more feedback soon. Otherwise, really impressive work. Many of the things that we need are currently in #1119, it is where I am currently paving the road towards arm/thumb switches, so I need to finish it first. I think that are good on this PR and you can continue working and write tests in the vein of the x86 test suite. And one of the first steps would be getting rid of that if(1) :)

ivg · 2020-07-16T16:40:07Z

Let's start some discussion on interworking for the lack of a better place. Interworking, in ARM terminology, enables the mixing of two different instruction sets, in particular, the ARM32 architecture supports two instruction sets A32 and T32. From the point of view of ARM designers, we have to distinguish between the architecture and the instruction set. This point of view is not really shared by llvm or bap (but we can at least change our point of view).

I have started a branch that adds interworking to our disassembling framework git@github.com:ivg/bap#enable-interworking so we can switch the decoder between arm/thumb modes on the fly. What is left is to write the analysis that will identify the architecture of destination based on the call site (which is easier than it sounds) as well as an analysis that will identify the architecture of the function starts (which ended up harder than it sounded).

So let's focus on the latter. When we have a function start provided by our user we also need to guess the correct architecture. If we will start wrong the whole graph will turn up into a mess. We know from the specs that thumb functions will have the least significant bit set in the table. Unfortunately, llvm is unsetting this bit (and according to the llvm-objdump output, llvm really doesn't have any option to restore it, though I am still researching). Another potential source would be radare2, e.g., we can use ahf (@XVilka, how reliable is it?) to get the classification, but there is a major caveat, with the interworking binaries radare2 breaks so just enabling radare2 already messes things up. E.g., given the binary from #951, we can see that the most recent radare2 for some reason is providing incorrect information even about plt entries, e.g.,

$ r2 -version
radare2 4.5.0-git 24948 @ linux-x86-64 git.4.4.0-429-ga933ba8be
commit: a933ba8bebab7c97b8ffdb56ee8bb5394cfbab2e build: 2020-07-16__11:05:33
$ r2 test
 -- This is an unregistered copy.
[0x00010a38]> is | grep read
482  0x00000f40 0x00010f40 GLOBAL FUNC   176      spec_read
490  0x00000ff0 0x00010ff0 GLOBAL FUNC   220      spec_fread
538  0x00005008 0x00015008 GLOBAL FUNC   72       BZ2_bzread
20   0x000006a8 0x000106a8 GLOBAL FUNC   16       imp.read

but read@plt is actually at 0x106ac not 0x106a8 (off by 4 error)

$ objdump -d test  | grep read@plt
000106ac <read@plt>:
   10e8a:       f7ff fc0f       bl      106ac <read@plt>

(also confirmed independently in Ghidra)

So far, the most reliable source of information about the ISA is either to call objdump or readelf. I am currently investigating if we can somehow grab this information from llvm backend, or maybe revive our elf-loader to get the unmangled STT_FUNC entries. But so far, the story is surprisingly unpleasant, e.g., what I was assuming would be a no brainer ended up to be a major hassle.

As an alternative direction, I am also thinking about employing byteweight algorithm just to guess the instruction set of the function start.

Thoughts?

Phosphorus15 · 2020-07-17T09:27:18Z

One of the noticeable things here is that, we should probably enable the lifters to tell if some known destinations (addresses) should be executed in Thumb mode. Which can gives an exact answer in respect of current instructions chain.

However, only 2 of the 5 BX series instructions (which alters the Thumb mode flag) always holds a statically deterministic target address (thus knowing if the target instruction is executed in Thumb), while the rests depend on the lsb of target register.

As for the byteweight algorithm, I just check it here (tell me if I get it wrong), and for my two cents, while it is not a bad idea to recognize whether a function should be executed in Thumb mode by its instructions' pattern, I do think this could introduce unwanted probabilistic factors into the lifter and knowledge base.

XVilka · 2020-07-20T04:40:21Z

I opened a bug to track this in radare2 - radareorg/radare2#17300
When we fix it we will release 4.5.1. Regarding on how this information is reliable in general - it's a big question. Moreover, you shouldn't trust this kind of information for mangled binaries - malware, some exotic platforms with major differences in ELF format structure (e.g. QNX), CTF tasks, etc.

ivg · 2020-07-21T13:12:27Z

Status update, I managed to extract the original symbol table values from the llvm backend so we now have the reliable roots tagged as arm or thumb, no I am finishing the work on enabling interworking inside the disassembler. Looks more or less promising.

ivg

Same requests as for the thumb lifter, we need to get rid of PC and capitalize all the other registers.

ivg · 2020-07-21T21:20:41Z

plugins/arm-lifter/armng_env.ml

+  let qf = Theory.Var.define bit "qf"
+  let ge = Theory.Var.define half_byte "ge"
+  (* psuedo register storing temporary computations *)
+  let tmp = Theory.Var.define value "tmp"


we should get rid of this, and use Theory.Var.fresh or, when possible, Theory.Var.scoped.

ivg · 2020-07-21T21:22:04Z

plugins/arm-lifter/armng_env.ml

+  let memory = Theory.Var.define heap_type "mem"
+
+  (** define grps *)
+  let r0 = Theory.Var.define value "r0"


please, synchronize the register names with what we had in the old lifter, i.e., capitalize, although I also dislike the caps, let's keep the tradition (the real reason, we may break a lot of downstream analysis and make comparison with the old versions hard)

ivg · 2020-07-21T21:23:27Z

plugins/arm-lifter/armng_env.ml

+  let r12 = Theory.Var.define value "r12"
+
+  let lr = Theory.Var.define value "lr"
+  let pc = Theory.Var.define value "pc"


The PC register is not really a register so it should be removed from here. All accesses to the PC register should be resolved to static constants that are equal to the address of the current instruction + some ISA specific offset. See how it was done in all other lifters.

For this, one of the problem I'm encountering now is that: different from BIL, Core Theory represents data effect and control effect with different polymorphic variable (Theory.data and Theory.ctrl), that could only be unified with blk, which seems to me to be representing a program block (like that of LLVM IR), leads me to the conclusion that this can only be produced once each instruction. (tell me if it's not right)

So, currently, to make the lifter sub-routines(for each instruction) smooth, I made those functions with both effects return a (data effect, ctrl effect) tuple and finally resolve them with blk, I'm feeling comfortable with this because only a small set of instructions needs this trick, and they do not much affect the over-all consistency.

However, considering that we are now abolishing the PC register var, and that any instruction with full GRPs access can access PC before ARMv8, the small set almost extends to every single ARM instruction (a catastrophe for current implemetation).

I do think the previous approach should be abolished, while hesitate over how to properly resolve this, one of the thoughts is to define the DSL over a BiMonad to carry both effects, but I'm afraid this could complicate things.

Another minor problem is that how can we make this substitution possible in Core Theory/KB system:

(** Substitute PC with its value *) let resolve_pc mem = Stmt.map (object(self) inherit Stmt.mapper as super method! map_var var = if Var.(equal var CPU.pc) then Bil.int (CPU.addr_of_pc mem) else super#map_var var end)

Phosphorus15 · 2020-07-27T11:17:43Z

I believe this lifter is up-to-standard for the next stage, to be noticed that a ref of bitvec is introduced to share current PC address within DSL, thus having side effects in putting/reading like the following snippet:

bap/plugins/arm-lifter/armng_main.ml

Lines 39 to 44 in d89f7a7

    
           let lift_move (insn : Defs.insn ) ops address = 
        
             let ( !% ) list = DSL.expand list in 
        
             let open Mov in 
        
             let () = Mov.DSL.put_addr address in 
        
             match insn, ops with 
        
             | `MOVi,  [|dest; src; cond; _; wflag|] ->

Phosphorus15 · 2020-07-29T10:35:00Z

Another important thing about this ARM32 lifter is that we're going to add VFP support.

As I was previously worried that the default bap-mc provides no attributes to let llvm-mc decodes vfp instructions, bap-mc actually does work fine with it, we have:

phosphorus@phosphorus-virtual-machine:~$ bap mc --arch=armv7 --show-insn -- 0x17 0x2b 0x53 0xec
VMOVRRD(R2,R3,D7,0xe,Nil)

So, the only thing I'm not sure about for now is the representation of single-precision(fp32) and double-precision(fp64) in BIR, considering the fact that fp32 registers overlaps with fp64 registers, e.g., the fp64 register D0 is consist of two fp32 registers S0 and S1, that they share the same Bitvec, so naturally we would like to make them the same Theory.Var, however, I'm worried that this would cause the BIR representation on single-precision too complicated for human readers and even for analysis passes, for instance, a simply

S0 := abs(S0)

might eventually become something like:

D0 := float(concat(bits(high(D0, 32)), bits(abs(low(D0, 32)))), 64)

which is absolutely a living hell even for such simple semantics

ivg · 2020-08-03T14:12:39Z

So, the only thing I'm not sure about for now is the representation of single-precision(fp32) and double-precision(fp64) in BIR, considering the fact that fp32 registers overlaps with fp64 registers, e.g., the fp64 register D0 is consist of two fp32 registers S0 and S1, that they share the same Bitvec, so naturally we would like to make them the same Theory.Var, however, I'm worried that this would cause the BIR representation on single-precision too complicated for human readers and even for analysis passes, for instance, a simply

The traditional approach is to define a variable that will cover the whole register and express operations on its parts via extract and concat, so yes, it will look like D0 := float(concat(bits(high(D0, 32)), bits(abs(low(D0, 32)))), 64) but with let expressions, it would be a little bit more readable, e.g.,

#1 = extract:31:16[D0]
#2 = extract:15:0[D0]
D0 := #1.abs(#2)

or, using let-scoped expressions,

DO := let $1 = extract:31:16[D0] in
      let $2 = extract:15:0[D0] in
      D0 := $1.abs($2)

XVilka · 2020-11-04T03:07:49Z

@ivg what should be done with this PR? Am I right that is was subsumed by the merged Thumb one?

+ arm dsl

+ complete cond and shift utils

+ extended dsl with loop-gen

A lifter is a piece of code that promises `Theory.Program.Semantics.slot`. I had to remove the address, because `bap mc` doesn't provide the address for instruction (sic). Not a big deal, as we have the address in the memory chunk. We will update the upstream later, but for now, let's just not ask for the address (note this is only specific to `bap mc`, for normal `bap` the address is provided).

ivg · 2020-11-06T18:56:16Z

@ivg what should be done with this PR? Am I right that is was subsumed by the merged Thumb one?

Not really, we plan in the future that the thumb plugin will be eventually subsumed by this one. Ideally, we would like to have all our lifters rewritten using the Core Theory representation. The thumb lifter PR showed that it is possible and actually easy. The problem with the thumb PR that I had discovered too late is that there was no need for it :) That is because every thumb instruction could be recoded as an ARM instruction. Therefore we need only one ARM lifter which will handle both ARM and Thumb instructions. With that said, we decided to keep the thumb plugin, so far, as an independent entity, since it is a core theory implementation of a small subset of the ARM Theory. The next step would be to add more instructions to it and eventually merge it with the ARM lifter. But this is after the 2.2.0 release. And after I will get some vacation, I didn't have one for a couple of years and it looks like that I need them :)

ivg · 2021-04-02T20:11:26Z

Closing as it is now possible (and is much easier and more productive) to write lifters in Primus Lisp. Many thanks to all involved in this PR!

Phosphorus15 force-pushed the arm-refactor branch 2 times, most recently from 4ea1780 to c5a0638 Compare July 13, 2020 08:52

ivg reviewed Jul 14, 2020

View reviewed changes

Phosphorus15 mentioned this pull request Jul 14, 2020

enables ARM Thumb support #1122

Closed

Phosphorus15 force-pushed the arm-refactor branch from a216417 to 5f358d6 Compare July 14, 2020 10:26

ivg mentioned this pull request Jul 16, 2020

BAP does not support well for thumb instruction set #951

Closed

XVilka mentioned this pull request Jul 16, 2020

Elf - incorrect information about PLT entries radareorg/radare2#17300

Closed

ivg requested changes Jul 21, 2020

View reviewed changes

XVilka mentioned this pull request Jul 22, 2020

Provide a standalone static IR JonathanSalwan/Triton#473

Closed

Phosphorus15 requested a review from ivg July 27, 2020 11:15

Phosphorus15 mentioned this pull request Jul 29, 2020

Adding ARM Vector Floating Point (VFP) support #1193

Closed

Phosphorus15 force-pushed the arm-refactor branch from f66f583 to a0d3a87 Compare August 24, 2020 10:42

Phosphorus15 force-pushed the arm-refactor branch from a0d3a87 to 00c268e Compare September 4, 2020 08:01

Phosphorus15 force-pushed the arm-refactor branch from 00c268e to a27b8de Compare September 16, 2020 07:48

Phosphorus15 force-pushed the arm-refactor branch from a27b8de to b9405aa Compare September 28, 2020 11:38

Phosphorus15 added 5 commits November 6, 2020 12:02

+ arm lifter refactor draft

6dbdcb2

+ arm dsl

simplified dsl extension

2b847c0

+ dsl update

4c04bb9

+ complete cond and shift utils

completed move-like instructions

424806e

+ completed bits operations

324f664

+ extended dsl with loop-gen

Phosphorus15 and others added 16 commits November 6, 2020 12:03

+ enabled clz with repeat+ fix typo

cb1c361

multiplication instructions

1d5e163

mem, multi-mem and special instructions

5167b4a

arm branch instructions (functionally incomplete)

fc88f58

VFP support boilerplate

44c62fb

+ lifter KB interfaces+ move instructions integration

a8d712a

eliminated branch with constant condition

9ba93bd

rename lifter files & modules to avoid conflicts

d24bcbc

completed instruction sets

f2f1e89

lift multiple mem instructions

256ee6b

adds arch check

d3d9bda

+ rename grps

98d4180

specialized PC grp's R/W in DSL semantics

469de88

get rid of use of Env.pc

131b6eb

top-level exception handling

3f078d3

Phosphorus15 force-pushed the arm-refactor branch from b9405aa to 3f078d3 Compare November 6, 2020 04:03

XVilka mentioned this pull request Dec 10, 2020

PLT stub names not being resolved properly rizinorg/rizin#153

Open

ivg closed this Apr 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core Theory based ARM lifter #1174

Core Theory based ARM lifter #1174

Phosphorus15 commented Jul 10, 2020 •

edited

Loading

Phosphorus15 commented Jul 12, 2020

ivg Jul 14, 2020

Phosphorus15 Jul 14, 2020 •

edited

Loading

ivg commented Jul 14, 2020 •

edited

Loading

ivg commented Jul 16, 2020 •

edited

Loading

Phosphorus15 commented Jul 17, 2020 •

edited

Loading

XVilka commented Jul 20, 2020 •

edited

Loading

ivg commented Jul 21, 2020

ivg left a comment

ivg Jul 21, 2020

ivg Jul 21, 2020

ivg Jul 21, 2020

Phosphorus15 Jul 22, 2020 •

edited

Loading

Phosphorus15 commented Jul 27, 2020 •

edited

Loading

Phosphorus15 commented Jul 29, 2020 •

edited

Loading

ivg commented Aug 3, 2020

XVilka commented Nov 4, 2020

ivg commented Nov 6, 2020 •

edited

Loading

ivg commented Apr 2, 2021

Core Theory based ARM lifter #1174

Core Theory based ARM lifter #1174

Conversation

Phosphorus15 commented Jul 10, 2020 • edited Loading

Phosphorus15 commented Jul 12, 2020

ivg Jul 14, 2020

Choose a reason for hiding this comment

Phosphorus15 Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

ivg commented Jul 14, 2020 • edited Loading

ivg commented Jul 16, 2020 • edited Loading

Phosphorus15 commented Jul 17, 2020 • edited Loading

XVilka commented Jul 20, 2020 • edited Loading

ivg commented Jul 21, 2020

ivg left a comment

Choose a reason for hiding this comment

ivg Jul 21, 2020

Choose a reason for hiding this comment

ivg Jul 21, 2020

Choose a reason for hiding this comment

ivg Jul 21, 2020

Choose a reason for hiding this comment

Phosphorus15 Jul 22, 2020 • edited Loading

Choose a reason for hiding this comment

Phosphorus15 commented Jul 27, 2020 • edited Loading

Phosphorus15 commented Jul 29, 2020 • edited Loading

ivg commented Aug 3, 2020

XVilka commented Nov 4, 2020

ivg commented Nov 6, 2020 • edited Loading

ivg commented Apr 2, 2021

Phosphorus15 commented Jul 10, 2020 •

edited

Loading

Phosphorus15 Jul 14, 2020 •

edited

Loading

ivg commented Jul 14, 2020 •

edited

Loading

ivg commented Jul 16, 2020 •

edited

Loading

Phosphorus15 commented Jul 17, 2020 •

edited

Loading

XVilka commented Jul 20, 2020 •

edited

Loading

Phosphorus15 Jul 22, 2020 •

edited

Loading

Phosphorus15 commented Jul 27, 2020 •

edited

Loading

Phosphorus15 commented Jul 29, 2020 •

edited

Loading

ivg commented Nov 6, 2020 •

edited

Loading