-
-
Notifications
You must be signed in to change notification settings - Fork 413
Operator Precedence
In Python you can write expressions, sometimes in various ways. Consider:
a*x**2 + b*x + c
This is a concise way of writing:
(a*(x**2)) + (b*x) + c
In the more concise expression, parentheses do not appear. It is understood that when an expression with exponentiation (**
) occurs combined with an expression with multiplication (*
), do the exponentiation first and then combine with the multiplication. Similarly, in the absence of parentheses dictating otherwise, perform multiplication (*
) before addition (+
).
Decompilation outputs expressions with parenthesis omitted when they are not necessary. The process by which the semantic actions, or tree phase of the decompiler figures this out is by using a precedence table which specifies which operator to perform when it is appears next to another operator.
Again, operator precedence is used to decide whether or not to add parenthesis around an expression. Note that an operator like multiplication is involved here and in the context of another expression. In the expression:
sin(2 * theta) + pi
the multiplication appears in the expression 2 * theta
, but that expression doesn't butt up with another expression. Instead the expression is used as the argument in a function call. The call is used as the first operand of an addition operation, but the important thing is that two expressions with operators don't appear next to each other as was the case in a*x**2 + b*x + c
.
To recap, in a*x**2
the expression with a multiplication operator appears next to an expression with an exponentiation operator. How can we specify whether this means (a*x)**2
or a*(x**2)
to someone who does not know this?
In Python's the reference manual operators are separated into a hierarchy. Here is an excerpt of a much larger table:
operator | Notes |
---|---|
.. . | ... |
** |
Exponentiation |
+x , -x , ~x
|
Positive, negative, bitwise NOT |
* , @ , / , // , %
|
Multiplication, matrix multiplication, division, floor division, remainder |
... | ... |
In this table, we see that **
appears in a table entry line before *
. And that means to put the expression with **
in a group before grouping an expression with *
when the two expressions are adjacent.
Using the table above, grouping behavior would be identical if we had written a@x**2
, a/x**2
, etc.
The way this is specified in the decompiler though is to give for each operator a numeric value. Here is the corresponding code in consts.py
that captures the above:
PRECEDENCE = {
# ...
"BINARY_POWER": 4, # Exponentiation: **
"unary_op": 6, # Positive, negative, bitwise NOT: +x, -x, ~x
"BINARY_MULTIPLY": 8, # *
"BINARY_MATRIX_MULTIPLY": 8, # @
"BINARY_DIVIDE": 8, # /
"BINARY_FLOOR_DIVIDE": 8, # //
"BINARY_MODULO": 8, # Remainder, %
# ...
}
In the actual dictionary in the Python code, we list things from high number (8) to low (4) rather than as shown above. I ordered things this way just to make the correspondences with the Python reference manual more clear.
The thing to note is that lower numbers, like 4 group first, and higher numbers like 8, group last.
Above, we described how we see things from the human side, and how to specify precedence in a way the decompiler can act on. But from the standpoint of the decompiler, how does this all work?
We will describe this in terms of some examples before summarizing with the general process. Using the example a*x**2
, The decompiler in the semantic or tree phase sees:
1 expr
2 bin_op (BINARY_MULTIPLY, precedence 8)
3 0. expr
4 L. 1 0 LOAD_NAME a
5 1. expr
6 bin_op (BINARY_POWER, precedence 4)
7 0. expr
8 2 LOAD_NAME x
9 1. expr
10 constant
11 4 LOAD_CONST 2
12 2. binary_operator
13 6 BINARY_POWER
14 2. binary_operator
15 8 BINARY_MULTIPLY
...
As we traverse from the top-level tree for expr
at line 1 down, four more expr
s are encountered below it. The expr
at line 3 we don't have to worry about since that covers LOAD_NAME
at line 4 and LOAD_NAME
is not an operator. However, the tree node for the expr
at line 5 we do have to take into account operator precedence.
At the expr
tree node at line 5, the precedence value set from line 2 in BINARY_MULTIPLY
is passed down. The expr
first child in line 6 is bin_op
, an operator which has a precedence of 4. Since parent precedence value 8 is greater than child precedence value 4, the expression with the parent should group first. When the parent groups first, we don't have to add parenthesis.
The other expr
s under the second bin_op
we don't have to worry about either since, again, those involve LOAD_NAME
or LOAD_CONST
which are not operators.
Now let's compare that with (a*x)**2
:
The decompiler in the semantic or tree phase sees:
1 expr
2 bin_op (BINARY_POWER, precedence 4)
3 0. expr
4 bin_op (BINARY_MULTIPLY, precedence 8)
5 0. expr
6 L. 1 0 LOAD_NAME a
7 1. expr
8 2 LOAD_NAME x
9 2. binary_operator
10 4 BINARY_MULTIPLY
11 1. expr
12 constant
13 6 LOAD_CONST 2
14 2. binary_operator
15 8 BINARY_POWER
Again, as we descend the top-level tree for expr
at line 1, our more expr
s are seen: the first expr
at line 3 involving another bin_op
at line 4.
Passed down to the expr
at line 3 is the precedence BINARY_POWER
with value 4. But in contrast to its bin_op
parent at line 2, the bin_op
child at line 4 has a precedence value 8. Since the parent value 4 is less than the child value 8, the child expression needs to group first and that is done by adding a parentheses in the handling of the expr
at line 3.
In general, in the function that handles expressions, n_expr
, a precedence value is passed to it via self.prec
, That is compared to the expression's single child, if that happens to be an operator with a precedence value. If the precedence value of the child is less than the precedence value of the parent, then a set parenthesis needs to be added.
There are situations where operator composition appears in the tree in a more implicit way. This occurs when the template engine expands strings that have operators mentioned in a template string.
Going through an example will make this more clear.
Consider this Python code:
x = "system" if not "site-packages" in file else "local"
We have a grammar rule if_exp_not
that corresponds roughly to IfExp(test=[UnaryOp(...)])
in a Python AST.
The output template an if_exp_not
node of a parse tree is:
"if_exp_not":
( "%p if not %p else %p",
...
(0, "expr", PRECEDENCE["unary_not"]),
...)
The way to read this is that if the nonterminal type of the parse tree is a if_exp_not
node, then the first child of the parse tree is an expression (or "expr") that will get expanded in between the words "not" and "else".
The thing to note though is that in contrast to the Python AST for this, where there is an explicit UnaryOp, the UnaryOp operator has been folded into the nonterminal if_exp_not
. But we still need a way to know whether we need parenthesis around the expression. In the example given above, ... if not "site-packages" in file else ...
, we don't need additional parenthesis.
However if the expression is a or b
, then, yes, do need parenthesis around this expression. Otherwise Python will interpret ... if not a or b else ...
as ... if (not a) or b else ...
rather than ... if not (a or b) else ...
.
The way indicate the correct meaning we intend is to set the precedence of the parent precedence for that particular "expr" to be the precedence of "unary_not": logically that is the surrounding context that this expression is getting used. Other "expr"s that might appear as placeholders in other parts of the template can have different precedences or no precedence.
In particular, there probably will never be a parenthesis required by the expression before "if not" or the expression after "else". So that precedence can be set to a high value.
If we change first and last "%p" to "%c", the precedence value is whatever was passed down to us from above. Rather than rely on that we set that to be something equal to what matches the precedence of an if ... else
expression.
%P description to be continued...