Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structuring switch statements #13

Open
maximumspatium opened this issue Jan 4, 2018 · 9 comments
Open

Structuring switch statements #13

maximumspatium opened this issue Jan 4, 2018 · 9 comments

Comments

@maximumspatium
Copy link
Contributor

maximumspatium commented Jan 4, 2018

Hello Paul,

I recently picked up another real-world code snippet that makes use of the switch control operator.
You'll find its Pseudo-C code here.

Recovering switch statements is another big task, so I decided to start somewhere and to conduct a couple of experiments.

Compiler usually employ at least the following two strategies in order to convert (aka lower) high-level switch blocks into assembly:

  • balanced binary search trees if case values are sparse or contain large gaps
  • jump table (if case values are dense)

The above mentioned snippet utilizes the former strategy. Although SABl does currently emit the correct C-output (what's the good thing you can be proud of!), humans will have hard time understanding that if-else spaghetti.

Because I wasn't able to find any suitable algorithm in the (de)compilation literature I tried to code up a recognition of such structures. Here is what I got so far.

Although I'm now getting all switch-related CFG nodes recognized and classified properly, I have no success trying to collapse all these nodes into a generic switch control class. I replace the header node with Switch(BBlock), remove all related nodes from CFG and then connect the header node with the switch successor node. This unfortunately seems to screw up the graph.

It looks like I'd need to implement another collapsing strategy because switch cases can contain other nested control structures to be structured first.

I'm thinking about removing all search nodes (i.e. the nodes required for binary search) and construct a n-way node branching to the case nodes. That would be probably the result of converting a jump table into CFG.

Do you have any suggestions on how to proceed?

@pfalcon
Copy link
Owner

pfalcon commented Jan 5, 2018

Some quick replies:

Recovering switch statements is another big task

My attitude towards that was: "well, I'm ok with having just if/if else chains" ;-).

balanced binary search trees if case values are sparse or contain large gaps

I believe I read up on that somewhere, and it makes good sense. But in all fairness, I don't remember ever seeing such a code - for balanced binary tree. Well, maybe because it's indeed spaghetti to human eye, and I didn't recognize an optimize search tree in cases I've seen.

I replace the header node with Switch(BBlock), remove all related nodes from CFG and then connect the header node with the switch successor node. This unfortunately seems to screw up the graph.

The reason for that would be that you're doing something wrong ;-). And the usual way to resolve it would be: 1) make a sample file with by minifying number of branches to process in it; 2) remove other unimportant code from it; 3) NEW! (of the last month) contribute as a testcase; 4) patiently work towards resolving issues with it.

Re: 3), I've added f15358b which might address some concerns with contributing real-world code as testcases. It's quite adhoc, and btw, I recommend to generate PseudoC files with addresses prefixed to each instruction anyway (or it's pretty hard to review stages of processing).

It looks like I'd need to implement another collapsing strategy because switch cases can contain other nested control structures to be structured first.

Yes, what should be structured first, should be structured first. SABl currently does it in a typical monte-carlo way, which is neither optimal nor ideal. But that "allows" one to tighten up matching patterns (like a case you witnessed, where While matcher matched completely unrelated node and created broken graph, I bet you have a similar case). Refactoring that to a proper post-order matching is in my todo list, which requires first growing test corpus.

@pfalcon
Copy link
Owner

pfalcon commented Jan 5, 2018

I'm thinking about removing all search nodes (i.e. the nodes required for binary search) and construct a n-way node branching to the case nodes.

Sure, that's how it should be done. No idea if you can find such nodes all at once, and then convert all at once, or would collapse one by one into Switch node. Majority of current structuring passes are such incremental (I believe I can put it that way).

I'd also suggest to start with handling just "proper structural switch", without C hacks like missing break's.

@pfalcon
Copy link
Owner

pfalcon commented Jan 5, 2018

Btw, I played once with handling jumptables, yeah (was quite adhoc).

@maximumspatium
Copy link
Contributor Author

make a sample file with by minifying number of branches to process in it
contribute as a test case

Here is a testcase...

I suppose that the output diff should contain the desired switch statement despite the test breakage, right?

@pfalcon
Copy link
Owner

pfalcon commented Jan 6, 2018

Here is a testcase...

Looks simple enough to be grokkable by a human. Please submit as a PR.

I suppose that the output diff should contain the desired switch statement despite the test breakage, right?

No-no, the idea is that testsuite always passes and enables fear-free refactoring. So, initial submission should contain whatever master produces, and it should be updated as master updates (hopefully with improvements to output, not regressions).

@pfalcon
Copy link
Owner

pfalcon commented Jan 6, 2018

Btw, @maximumspatium, didn't you want to improve/elaborate .dot CFG output? ;-) (just trying to shove more work items from my queue to yours ;-) ).

@pfalcon
Copy link
Owner

pfalcon commented Jan 6, 2018

@maximumspatium : Continuing from #19 (comment), did you see: https://github.com/pfalcon/ScratchABlock/blob/master/decomp.py#L176 ?

That should look familiar, and can be summarized as: SABl already contains passes to structure sequences of if statements as a switch like statement ("well, I'm ok with having just if/if else chains"). Exception they work with "linear" sequences, not trees.

New pass(es) you write would thus subsume that processing (error in #19 (comment) hints about that). Of course, in my list, first my passes should be "recovered" to work as expected (they did once upon a time). The testcase for that is https://github.com/pfalcon/ScratchABlock/tree/master/tests/decomp/030-fun_bfffd0eb

@pfalcon
Copy link
Owner

pfalcon commented Jan 19, 2018

@maximumspatium, When dealing with if's, pay attention to abnormal selection patterns. (Google "abnormal selection" structure if in doubt. The general idea is that all branches of if/else should be dominated by the if header. If not, it's abnormal selection.) Currently, SABl doesn't test for abnormal selection thoroughly, which leads to incorrect decompilation, e.g. https://github.com/pfalcon/ScratchABlock/blob/master/tests/decomp/_xtos_set_interrupt_handler_arg/_xtos_set_interrupt_handler_arg.lst.exp.clean

@pfalcon
Copy link
Owner

pfalcon commented Jan 20, 2018

Generally, there's a gap with true structuredness and C-structuredness, which is more expressive. It may seem that it applies to loops, but it turns out it applies even to if too. "True" structuredness is when each structure element has one entry and one exit, up to a complete function. C-structuredness allow multiple exits (but not entries).

For example, following nice-looking C if is only nice-looking thanks to 2 returns:

if (cond1) {
  // do sth
  if (cond2) {
    // do sth
    return 1;
  }
}
return 0;

That however contains abnormal selection. Converted to fully structured shape, it's more verbose and less nicely looking.

So, the point is that SABl so far follows "fully structured" way, which means boring splitting nodes to resolve abnormal selection. But to get nice concise C-structuredness, that will later need to be reverted ;-I.

Or maybe not, maybe we can detect ladder-style abnormal selection (when one branch is dominated by header, and another - not) and apply multiple-return transformation. It's hard to decide what to do when and even harder to actually do and prove that it's right.

So, apparently stick to fully structured way for now. At least that makes it MISRA-compliant decompiler, LOL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants