-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core Theory based ARM lifter #1174
Core Theory based ARM lifter #1174
Conversation
By now, all the instructions in legacy ARM lifter has been transcript and re-structured into Core Theory knowledge format, there's just steps from being a completely working lifter:
|
4ea1780
to
c5a0638
Compare
plugins/arm-lifter/arm_main.ml
Outdated
| Some addr, Some insn, Some mem -> | ||
run_lifter label addr insn mem Lifter.lift_with >>= fun sema -> | ||
provide_sematics label sema | ||
| _ -> raise (Defs.Lift_Error "insufficient knowledge") (* should I have not raised an error here? *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no-no, no raises, the lifter shall never raise an exception in case if it is unable to provide knowledge. It is totally normal not to know something. Basically, the idea is that each lifter is providing as much knowledge as it can, but no more. Also, the simple rule in BAP for exceptions - exceptions are for indicating programmer errors.
TL;DR; when there is not enough information to provide the semantics, just return the bottom value of the semantics, e.g., Insn.empty
which is the short alias for Knowledge.Value.empty Theory.Semantics.cls
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to embed the error message into some kinds of empty knowledge like BIL.special
do ?
TL;DR; when there is not enough information to provide the semantics, just return the bottom value of the semantics, e.g.,
Insn.empty
which is the short alias forKnowledge.Value.empty Theory.Semantics.cls
.
By the way, speaking of BIL.special
, how do I (or is it necessary to) express the same semantics as BIL.cpuexn
?
OK, so I turned the code into a lifter that is integrated with the rest of BAP (see the last commit message for the details). To be able to use it, you will need to remove the current arm plugin since you're using the same module names, e.g.,
and now we can see the new lifter in work,
The general plan would be to overhaul the old lifter with the new code, but I think that for some time we will need them both at least for testing purposes. One of the testing strategies is to compare the results of the two lifters. This is a long road, but the first step would be renaming the modules in this plugin so that they are not in conflict with the existing plugin. I suggest adding I will provide some more feedback soon. Otherwise, really impressive work. Many of the things that we need are currently in #1119, it is where I am currently paving the road towards arm/thumb switches, so I need to finish it first. I think that are good on this PR and you can continue working and write tests in the vein of the x86 test suite. And one of the first steps would be getting rid of that |
a216417
to
5f358d6
Compare
Let's start some discussion on interworking for the lack of a better place. Interworking, in ARM terminology, enables the mixing of two different instruction sets, in particular, the ARM32 architecture supports two instruction sets A32 and T32. From the point of view of ARM designers, we have to distinguish between the architecture and the instruction set. This point of view is not really shared by llvm or bap (but we can at least change our point of view). I have started a branch that adds interworking to our disassembling framework git@github.com:ivg/bap#enable-interworking so we can switch the decoder between arm/thumb modes on the fly. What is left is to write the analysis that will identify the architecture of destination based on the call site (which is easier than it sounds) as well as an analysis that will identify the architecture of the function starts (which ended up harder than it sounded). So let's focus on the latter. When we have a function start provided by our user we also need to guess the correct architecture. If we will start wrong the whole graph will turn up into a mess. We know from the specs that thumb functions will have the least significant bit set in the table. Unfortunately, llvm is unsetting this bit (and according to the llvm-objdump output, llvm really doesn't have any option to restore it, though I am still researching). Another potential source would be radare2, e.g., we can use
but
(also confirmed independently in Ghidra) So far, the most reliable source of information about the ISA is either to call objdump or readelf. I am currently investigating if we can somehow grab this information from llvm backend, or maybe revive our elf-loader to get the unmangled STT_FUNC entries. But so far, the story is surprisingly unpleasant, e.g., what I was assuming would be a no brainer ended up to be a major hassle. As an alternative direction, I am also thinking about employing byteweight algorithm just to guess the instruction set of the function start. Thoughts? |
One of the noticeable things here is that, we should probably enable the lifters to tell if some known destinations (addresses) should be executed in Thumb mode. Which can gives an exact answer in respect of current instructions chain. However, only 2 of the 5 As for the byteweight algorithm, I just check it here (tell me if I get it wrong), and for my two cents, while it is not a bad idea to recognize whether a function should be executed in Thumb mode by its instructions' pattern, I do think this could introduce unwanted probabilistic factors into the lifter and knowledge base. |
I opened a bug to track this in radare2 - radareorg/radare2#17300 |
Status update, I managed to extract the original symbol table values from the llvm backend so we now have the reliable roots tagged as arm or thumb, no I am finishing the work on enabling interworking inside the disassembler. Looks more or less promising. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same requests as for the thumb lifter, we need to get rid of PC and capitalize all the other registers.
plugins/arm-lifter/armng_env.ml
Outdated
let qf = Theory.Var.define bit "qf" | ||
let ge = Theory.Var.define half_byte "ge" | ||
(* psuedo register storing temporary computations *) | ||
let tmp = Theory.Var.define value "tmp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should get rid of this, and use Theory.Var.fresh
or, when possible, Theory.Var.scoped
.
plugins/arm-lifter/armng_env.ml
Outdated
let memory = Theory.Var.define heap_type "mem" | ||
|
||
(** define grps *) | ||
let r0 = Theory.Var.define value "r0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, synchronize the register names with what we had in the old lifter, i.e., capitalize, although I also dislike the caps, let's keep the tradition (the real reason, we may break a lot of downstream analysis and make comparison with the old versions hard)
plugins/arm-lifter/armng_env.ml
Outdated
let r12 = Theory.Var.define value "r12" | ||
|
||
let lr = Theory.Var.define value "lr" | ||
let pc = Theory.Var.define value "pc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PC register is not really a register so it should be removed from here. All accesses to the PC register should be resolved to static constants that are equal to the address of the current instruction + some ISA specific offset. See how it was done in all other lifters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this, one of the problem I'm encountering now is that: different from BIL, Core Theory represents data effect and control effect with different polymorphic variable (Theory.data
and Theory.ctrl
), that could only be unified with blk
, which seems to me to be representing a program block
(like that of LLVM IR), leads me to the conclusion that this can only be produced once each instruction. (tell me if it's not right)
So, currently, to make the lifter sub-routines(for each instruction) smooth, I made those functions with both effects return a (data effect, ctrl effect)
tuple and finally resolve them with blk
, I'm feeling comfortable with this because only a small set of instructions needs this trick, and they do not much affect the over-all consistency.
However, considering that we are now abolishing the PC
register var, and that any instruction with full GRPs access can access PC
before ARMv8, the small set almost extends to every single ARM instruction (a catastrophe for current implemetation).
I do think the previous approach should be abolished, while hesitate over how to properly resolve this, one of the thoughts is to define the DSL over a BiMonad to carry both effects, but I'm afraid this could complicate things.
Another minor problem is that how can we make this substitution possible in Core Theory/KB system:
(** Substitute PC with its value *)
let resolve_pc mem = Stmt.map (object(self)
inherit Stmt.mapper as super
method! map_var var =
if Var.(equal var CPU.pc) then
Bil.int (CPU.addr_of_pc mem)
else super#map_var var
end)
I believe this lifter is up-to-standard for the next stage, to be noticed that a bap/plugins/arm-lifter/armng_main.ml Lines 39 to 44 in d89f7a7
|
Another important thing about this ARM32 lifter is that we're going to add As I was previously worried that the default phosphorus@phosphorus-virtual-machine:~$ bap mc --arch=armv7 --show-insn -- 0x17 0x2b 0x53 0xec
VMOVRRD(R2,R3,D7,0xe,Nil) So, the only thing I'm not sure about for now is the representation of single-precision(fp32) and double-precision(fp64) in
might eventually become something like:
which is absolutely a living hell even for such simple semantics |
The traditional approach is to define a variable that will cover the whole register and express operations on its parts via extract and concat, so yes, it will look like
or, using let-scoped expressions,
|
f66f583
to
a0d3a87
Compare
a0d3a87
to
00c268e
Compare
00c268e
to
a27b8de
Compare
a27b8de
to
b9405aa
Compare
@ivg what should be done with this PR? Am I right that is was subsumed by the merged Thumb one? |
+ arm dsl
+ complete cond and shift utils
+ extended dsl with loop-gen
A lifter is a piece of code that promises `Theory.Program.Semantics.slot`. I had to remove the address, because `bap mc` doesn't provide the address for instruction (sic). Not a big deal, as we have the address in the memory chunk. We will update the upstream later, but for now, let's just not ask for the address (note this is only specific to `bap mc`, for normal `bap` the address is provided).
b9405aa
to
3f078d3
Compare
Not really, we plan in the future that the thumb plugin will be eventually subsumed by this one. Ideally, we would like to have all our lifters rewritten using the Core Theory representation. The thumb lifter PR showed that it is possible and actually easy. The problem with the thumb PR that I had discovered too late is that there was no need for it :) That is because every thumb instruction could be recoded as an ARM instruction. Therefore we need only one ARM lifter which will handle both ARM and Thumb instructions. With that said, we decided to keep the thumb plugin, so far, as an independent entity, since it is a core theory implementation of a small subset of the ARM Theory. The next step would be to add more instructions to it and eventually merge it with the ARM lifter. But this is after the 2.2.0 release. And after I will get some vacation, I didn't have one for a couple of years and it looks like that I need them :) |
Closing as it is now possible (and is much easier and more productive) to write lifters in Primus Lisp. Many thanks to all involved in this PR! |
This pr presents an ARM lifter based on Core Theory/KB, intended to replace the legacy BIL lifter and provides more function & scalability than it does.
Currently the
move
andbits
instructions are refactored respectively, while the rests remains implemented.The new lifter code architecture is with a DSL module on its own, which is a extension of Core Theory operations to enables not only ARM itself but ARM Thumb (#1122) or even other lifters clearer in semantics, see the details in dsl_common.ml.