-
-
Notifications
You must be signed in to change notification settings - Fork 413
How does this code work?
Table of Contents
- Introduction
- A Tale of Two Kinds of Decompilers
- Use of a Context-Free Grammar Parser
- Decompilation Phases
- Scanning
- Parsing
- Semantic Analysis
- Footnotes
Also see my talk given at BlackHat Asia 2024. Slides and text intermingled are here.
When I first looked at uncompyle2, I was a little confused as to how this code worked, even though I was very familiar with compilers and compiler technology.
Upon reflection, I think the idea of treating decompilation as a traditional compiler is a good one: it moves this kind of technology towards automation and into the realm of better-understood theory and technology where there has been a lot of research and engineering.
The success of uncompyle6 and decompyle3 I think is due to this aspect. Using this approach it has been able to cover 30 or so separate versions of Python spanning over two decades. And it is able to produce such good results, while largely been maintained and updated by one person.
But from the standpoint of most decompilers, this approach is not mainstream and not the way most decompilers work. I think that if it were more mainstream, the state of decompiler coverage across programming languages and quality of decompilation would be improved.
And from the standpoint of traditional compilers, as described in the ground-breaking dragon book, this compiler is a little bit different. And therefore a little foreign and unfamiliar. Here, one has to take a step back and understand some of the more foundational principles of compilers, look at things a little bit differently, and adapt a little bit.
From the earliest days, there wasn't much written about how this decompiler worked, let along the broader idea of how compiling technology needed to be adapted for decompilation. To the extent this was understood by original authors, it was understood implicitly and not widely communicated.
I hope to rectify that here.
If you look at decompiler on Wikipedia, there are sections on data-flow analysis and type analysis; these are not needed in any Python decompiler for Python bytecode. That is because objects in a Python bytecode interpreter are the objects of Python; the types don't need to be resolved. Variables have the same name, scope and use as you find in the Python program.
The Wikipedia article is focused more on general purpose decompilers. These decompilers start from some machine language and via a single intermediate representation, or IR, give you some result that doesn't take much or any consideration of the programming language you started out with.
In order to do this, the result is often in a "high-level" language of the decompiler-writer's choice. There may be some slight syntactic sugar to make that language look more like a particular high-level language, but you probably won't be able to run the result unmodified through a compiler or interpreter for that high-level language.
An advantage of this approach is that it can result in reduced maintenance cost from the decompiler writer's point of view. One language fits all, putting the additional burden of learning this language onto the person using the decompiler.
In contrast, this Python decompiler decompiles to Python, and specifically the Python language variant or version that the bytecode says it was compiled from. Although there is a common instruction format, there is no single intermediate representation language or IR. Each Python version is free to define whatever instructions it prefers[^1].
[^1]: Lee Benfield reports that the Java Decompiler Cfr works this way too. See his article Anatomy of a Java Decompiler.
In practice the drift in Python bytecode is similar to the drift in the programming language. These two are loosely coupled — a big change in the language will result in a big change in bytecode instructions and possibly how the instructions are generated. But there is no hard dependency here: either side can change without a corresponding change on the other side.
I also find interesting the parallel and symmetry in the way the compiler and this decompiler work: just as the standard compiler for CPython doesn't use one of the "common" intermediate instruction representations such as the one by LLVM, or WASM, or JVM, so this decompiler doesn't use it either. (In theory a decompiler could be written this way.) By far, Python isn't the only programming language that takes this approach of using its own custom intermediate language.
To summarize, there are decompilers that use a common intermediate language internally and output source text in a common end-target language which is different from the source-code language. But this decompiler while it does not enforce a common intermediate language, it does require that the end target language is the same as the source language. If this decompiler produces something syntactically invalid in for the version of Python specified in the Python bytecode, that is a bug.
There are a number of advantages to an approach where the decompiler end result is in the same language as the source code that produced the bytecode. The most obvious one is that anyone using this kind of decompiler does not need to learn another programming language understand the result, i.e. a language invented by the decompiler writer. But related to this are a number of other benefits. The decompiled program, if it is correct, can then be fed right back into the compiler/interpreter. This too has obvious benefits for someone using the decompiler.
Additionally, this gives a means for checking accuracy of the decompiler. For example, there shouldn't be any syntactic errors in the result produced. That should be true in the situation where the output is in some sort of decompiler output language. But an individual would be hard pressed to determine the output is legal, since typically decompilers do not come with a language checker let alone an interpreter for the language they've invented.
More significantly, if we were to decompile an entire Python module or Python package and run it, it should behave exactly as it did before decompilation. And this makes any of the tens of thousand Python packages out there that come with tests (which is just about all of them), potential tests of the decompilation system. Typically Python packages come with rigorous tests to ensure their integrity. And there are some pretty large Python packages out there that depend on other huge Python packages. When you consider everything making up say SciPy, or Mathics-omnibus (an open-source Wolfram Language clone) which pulls in SciPy, Django, NLTK, NumPy and SciKit, the number of lines of Python code that is testable is easily in the 100's of thousands of lines.
This second kind of decompiler — a decompiler for a high-level dynamic-language interpreter that produces a high-level intermediate results — is in a class that is not yet described or circumcised in the Wikipedia article of 2022. I think of it as in the class of bytecode for high-level dynamic languages. Lua, Ruby and Emacs Lisp are also in this category. Possibly some versions of Java and JavaScript, e.g. V8, have aspects that are like this too. I write about this a little in Decompilation at Runtime and the Design of a Decompiler for Dynamic Interpreted Languages.
In Python, or any given version of Python, there is more or less one translation from Python to bytecode. PyPy alters or adds a few opcodes, but for the most part, its translation to bytecode is exactly the same as CPython's. Graal and Jython are different and compile to JVM. This decompiler doesn't make an attempt to deparse JVM back to Python.
Where we sometimes see variation is in the interpreter. Specifically Pyston and PyPy are JITin'g interpreters. Because that is below the level of Python bytecode, decompilation for them works the same (or pretty much the same for PyPy) as it does for CPython.
We use a grammar to parse Python bytecode back to Python.
There are a number of factors that make this possible and "easy" to do:
- For any given version of Python there is basically one kind of translation from Python to Python bytecode, specifically the CPython translation.
- Python bytecode is pretty high-level
- There are relatively short grammars for both Python and Python's AST which we can use to guide us in writing such a grammar for Python bytecode back to Python.
Although we use a grammar, it is an ambiguous grammar. I repeat: the grammar is not LALR(1), LR, SLR, LL(1), etc. It is not something you can use in Yacc or Bison.
For the most part however, the grammar is an LR grammar. The particular Earley algorithm parser used is pretty efficient at handling LR-like grammars. We have though a few non-LR, right-recursive rules instead of LR-favored left-recursive rules.
See again Decompilation at Runtime and the Design of a Decompiler for Dynamic Interpreted Languages for more details on this kind of parser and potential problems with it.
The use of an ambiguous grammar is frowned on in designing a programming language.
In writing code in a programming language there shouldn't be ambiguity in what is intended. However when you decompile bytecode or machine language there can be several valid representations of that bytecode in Python. Something might decompile to:
if a and b:
y = 1
or
if a:
if b:
y = 1
And both of these are equally valid.
Even more extreme, it can be impossible to disambiguate which source code was used. The Python 3.8 bytecode for:
[x for x in 'mother' if x != 'e' if x != 'r']
[x for x in 'mother' if x != 'e' and x != 'r']
is exactly the same for the two list comprehensions. This is despite the fact that the source code is distinct even up to Python's AST which tends to smooth out and canonicalize insignificant differences.
Writing a decompilation grammar is hard enough without the burden of trying to make it unambiguous, if that can be done at all.
In some situations, even though it is theoretically possible to write an unambiguous grammar for some kinds of things, using an ambiguous grammar makes the grammar simpler and easier to understand.
Here is an example that is usually mentioned in traditional compiler books. It has to do with expressions and precedence of operators.
Consider this standard LR grammar for expressions with binary plus and times:
expr ::= expr + factor | factor
factor ::= factor * term | term | (expr)
This grammar is expresses that binary "times" operator (*
) binds more tightly than the binary "plus" operator (+
). The rule expr ::= factor
is key to expressing this.
A simpler but ambiguous grammar for this removes the nonterminal factor
:
expr ::= expr + expr | expr * expr | (expr) | term
To see that this is ambiguous, if you parse term + term * term
, the first two expr
s around the addition could be grouped together in a single expr
first, or we could group the last two expr
s into a single expr
first. The standard and often-used suggestion is to handle operator precedence after a tree has been built.
In this decompiler, something similar arises in the handling of control flow.
For Python 3.8, we have a set of rules for list comprehension with if
expressions as part of the comprehension iteration. Furthermore, the "if" conditionals might start off with a not
expression or might have an or
operator in the expression. Here is an excerpt of the grammar:
comp_if_not ::= expr POP_JUMP_IF_TRUE comp_iter
comp_if ::= expr POP_JUMP_IF_TRUE comp_iter
comp_iter ::= comp_if_or
comp_if_or ::= expr POP_JUMP_IF_FALSE_LOOP expr ... comp_iter
comp_if_or ::= expr POP_JUMP_IF_TRUE_LOOP expr ... comp_iter
The above is ambiguous because we have the same rules for comp_if_not
and comp_if
. And semantically, these mean opposite things.
What distinguishes which one is valid requires looking at at further derivations via comp_if_or
: if the jump to the iteration loop is when the "or" condition is false, then we have a simple "if" condition, and a comp_if
rule selected should be selected.
On the other hand, if the jump to the loop is when the "or" condition is true, the comp_if_not
rule should be selected.
We could have avoided this in the grammar by having two variants of nonterminals for comp_if
, e.g. comp_if_true_loop
and comp_if_false_loop
, and then
we would have to propagate that true/false behavior to comp_iter
, replacing that rule with rules for comp_iter_true_loop
and comp_iter_false_loop
.
All of this increases grammar complexity.
Instead, with live with comp_if
and comp_if_not
rules that are the same, even though only one can be correct for any situation. Which rule is the correct one is decided at the time of the reduction rule by looking down the into the parse tree fragment associated with that reduction.
This decompiler has 3 distinct phases which roughly correspond to a traditional compiler:
-
Scanning Bytecode instructions to a Token stream
-
Parsing a Token Stream to a Parse Tree
-
Python Code Generation from a Parse Tree to Python Source Code
These are described below. Each phase has subparts which we will describe in their respective sections.
The process is kicked off by calling Code Generation which first calls the scanner, gets a token stream back from that and passes that on to the parsing, gets a parse tree back from that and then produces Python source code text.
Before we get to scanning, a little bit about Python bytecode which is a big part of the raw data that is decompiled.
Python bytecode in is just an array of bytes. It is theco_code
field of a Python code object.
Instructions are formed by carving out consecutive sequences of bytes of the bytearray block. Before Python 3.6, these consecutive sequences of bytes were either one or three bytes long depending on the opcode which appears first in the sequence. In more recent Python, instructions are always two bytes. A bytecode instruction is an opcode and a single operand field that appears somewhere in the byte array or co_code
field of a code object. The operand field sometimes encodes several pieces of information as flags bits of the operand or by packing into the operand several logical distinct fields. When an operand value needs to be outside what can be encoded in one byte, there is an EXTENDED_ARG
instruction before it gets applied to the operand of the instruction.
In contrast to languages which compile to machine code, dynamic language interpreters have no "linking" or "binding" phase. Instead, everything is bound together as part of executing bytecode.
Therefore there are bytecode instructions that create functions, classes, methods in classes and so on.
From a decompiler's standpoint then decompilation has to follow this same kind of process: the top level module or file is decompiled and in the process of doing that will be bytecode instructions for creating classes, functions, lambdas, generators, or comprehensions. In a number of cases information that is seen in the source code is split between information in the code object and information in bytecode instructions that create or "link" the code object. For example, in a list comprehension the code object it uses just knows that it has a parameter named .0
which is the parameter of the list collection it iterates over. The actual parameter name, the list collection, is found only through the function call. If you use the one of the options that shows details of decompilation like --asm
,or --tree++
you will see the parameter name .0
shown.
The implication of this is that a decompiler has to meld information from the caller and the from the code object produce source text that hides the underlying function call. This is done in semantics/gencomp.py
. In this or other decompilers, when you see .0
as a parameter name in the code it is a sign that the decompiler has failed to meld in this information.
Another decompiler, pycdc
, does not do this. It is instructive to use its output to given an example of what I mean here and to show how Python works for certain kinds of comprehensions.
Consider bytecode for this:
[c for c in __file__]
Here is an abbreviated disassembly of the 3.1 bytecode for this:
# Method Name: <module>
# Names:
# 0: __file__
1: 0 LOAD_CONST (<Code3 code object <listcomp> at 0x7f74934d0910, file lc.py>, line 1)
3 MAKE_FUNCTION (0 default parameters)
6 LOAD_NAME (__file__)
9 GET_ITER
10 CALL_FUNCTION (1 positional, 0 named)
13 POP_TOP
14 LOAD_CONST (None)
17 RETURN_VALUE
# Method Name: <listcomp>
# Varnames:
# .0, c
# Positional arguments:
# .0
# Local variables:
# 1: c
1: 0 BUILD_LIST 0
3 LOAD_FAST (.0)
>> 6 FOR_ITER (to 21)
1: 9 STORE_FAST (c)
12 LOAD_FAST (c)
15 LIST_APPEND 2
18 JUMP_ABSOLUTE (to 6)
>> 21 RETURN_VALUE
Here is how way pycdc
decompiles this bytecode:
(lambda .0: [ c for c in .0 ])(__file__)
The lambda function parameter, .0
, isn't a valid variable name However, if you were to change that parameter into something valid like x
:
(lambda x [ c for c in x ])(__file__)
you will find that, indeed, this does exactly the same thing as the original: it turns the file name of the Python code into a list of characters in the name.
"Scanner" is a term used in programming-language compilers and usually refers to the token recognition from ASCII or Unicode characters. Often tokens are described by regular expressions. Typical kinds of tokens you'll find are identifiers, reserved words, strings, numbers and so on.
In the decompiler, the input to the scanner is not ASCII characters, but CPython bytecode which we first disassemble into bytecode instructions. In contrast to a flat array of bytes, instructions are structures that have instruction information parsed out already.
In Python, the modules load and unmarshal from the marshal library assist in the process of extracting bytecode; load
reads a CPython-compiled file and unmarshal
extracts from the module object "code" objects. To convert bytecode into instructions, the Python module dis can be used.
However these modules assume that the bytecode you want to disassemble is the same bytecode as the running Python interpreter. When this is not the case, we cannot use these modules. Furthermore, earlier versions of the dis module do not provide the convenient instruction abstraction. And when using a Python interpreter that is running bytecode different from the bytecode that is to be decompiled, we need to deal with the fact that Python the "code" object can vary. We handle all of these problems using a cross-version disassembly library called xdis.
As the term "disassembly" (in contrast to the compiler-centric word "scanning") implies, and as can be gathered from the above description, there is more to this than what could easily be done reading a file and running regular-expression on that alone. For one thing, regular-expression matching can be done in a single pass over the input. Routines in the xdis module make more than a single pass over the extracted bytecode. In particular, a pass is made over the instruction sequence to find the targets of jump instructions.
Newer versions of the decompiler, like decompyle3 in particular, use the richer instruction format that xdis provides. However in uncompyle6 there are still remnants of the Python bytecode format in various scanners for the older Python versions. Until someone rewrites this code, this tends to make the older code a little more awkward to work with. This is also true if you look at the scanner code in alternate but similar decompilers like uncompyle2.
As an example, the Python 2.6 scanner that is in uncompyle6, scanner26.py
, you may see a reference to self.code[offset+3]
. The magic value offset+3
at that point in the code refers to the offset of next instruction. The value 3 to increment is determined by value of the opcode at self.code[offset]
. In this place in the code, the opcode is JUMP_ABSOLUTE
; the JUMP_ABSOLUTE
opcode with its two-byte operand is a total of 3 bytes in Python 2.6.
Newer code works with instructions rather than bytecode. This simplifies things: at instruction i
the next instruction is always self.insts[i+1]
no matter what the opcode value is or what Python version the bytecode comes from.
When xdis converts bytecode to bytecode instructions, EXTENDED_ARG
opcodes and information associated with that, i.e. its offset value, is folded into the instruction operand that follows.
After converting the bytecode stream into bytecode instructions, the bytecode instructions are munged or ingested into a token stream. In this stream, the name part of the token is used in matching in the parser grammar. Both the bytecode instruction stream and the token stream use the same instruction format (or an extension of that) that xdis uses to produces the bytecode instructions.
We describe next the difference between a bytecode instruction and a token (instruction).
In a traditional compiler, the end result of scanning ASCII or Unicode characters is a sequence of tokens which classifies each of the string fragments seen in the source code while also retaining other information. The idea here is that at the next level, parsing, you don't care about whether the string segment was the string "3" or "5", but rather that both of these are of the same category, here, a number. A similar thing happens with a variable name: the classification into a variable is sufficient, and the variable name is not needed in parsing. For punctuation though, comma, colon, and semicolon are different so those do need to be separate.
A tokenizer assigns a token category and a value to each token that it finds. The token class in a tokenizer for a conventional programming language might be string, reserved word, or identifier. For punctuation often there is no value: most punctuation comes in one form. And from the parser's standpoint, matching is done on the token category, called the token's name, not its specific value, e.g. "3" or "5". Actually, the value here for a number would be the number representation not its string representation.
Theoretically, the categorization process isn't strictly necessary, e.g. you could add additional grammar rules for a number for all the different numbers that are possible. Clearly this isn't practical though. Judicious classification of tokens, makes the grammar smaller and easier to follow in the compilation process.
Note that even though the grammar parses on the token category or name, it still can get access to the specific value that was seen to create the token. In particular, a compiler needs to access this value when the compiler goes to produces machine-code instructions. But the way it would produce code for working with the number 3 versus 5 would, for the most part, be the same except the the value in an operand would be different, e.g. an operand value of 3 versus 5.
In this decompiler, the token name is often the instruction's opcode name as found in the bytecode. However, there are some important tweaks that we need to pay attention to.
Some bytecode opcode names are more general than the decompiler's needs; there is benefit by classifying such opcode names further.
An example of this is Python's LOAD_CONST
instruction. The operand here can refer to a number, a string, a dictionary, a code object, an assertion, a comprehension argument, or other things.
In some grammar rules, the constant has to be of a specific type. For example, the rule for calling functions needs a code object at a particular place in the rule. While it is acceptable to use LOAD_CONST
for that code object in the grammar, things are clearer and less ambiguous (and thus faster) if we can match only against constants that are code objects. So when the scanner detects a LOAD_CONST
instruction with a code object operand, it will use the token name LOAD_CODE
instead of the opcode name LOAD_CONST
in the token instruction.
The LOAD_CONST
opcode can also be turned into:
ADD_VALUE
: used in long list, dictionary or set literals
LOAD_ARG
: used in comprehensions and generators which have a single argument .0
LOAD_ASSERT
: used in assert
statements; the operand is an AssertionError
name
LOAD_GENEXPR
: used in generator expressions
LOAD_LAMBDA
: used in lambda
expressions; the operand is a code object with method name "lambda"
LOAD_LISTCOMP
: used in list comprehension; the operand is a code object with method name <listcomp>
LOAD_SETCOMP
: used in set comprehension; the operand is a code object with method name <setcomp>
LOAD_STR
: used in format strings; the operand is a string
Here are other opcode name specializations that can happen: JUMP_ABSOLUTE
can turn into CONTINUE
, JUMP_FORWARD
, JUMP_LOOP
, or BREAK
as appropriate. Currently in Python, Jump instructions to an offset lower than offset the instruction is positioned at, currently is an indicator of a looping jump. CONTINUE
and BREAK
also have to appear in loops.
To speed parsing up greatly, and to disambiguate unnecessary parsing ambiguity, the decompiler creates custom grammar rules per code object (function, lambda, generator, or comprehension). These custom rules are triggered when the scanner finds specific instructions in the instruction stream. Some of the terminal symbols in these custom rules often contain a token name that is fabricated from the opcode plus information that is only available in the operand.
An example of this is found in bytecode to build a list. The opcode name used in the bytecode here is BUILD_LIST
. If a particular build-list instruction has an operand value of 2 (this pops two items off of the evaluation stack and the list will contain those two values), then the decompiler scanner will emit a token instruction whose token name is the opcode and the operand value folded into the the resulting token name. Here, the token name would be BUILD_LIST_2
which effectively replaces the BUILD_LIST
instruction. Below when we talk about MAKE_FUNCTION
we'll see there is a special grammar rule included for this.
In massaging the bytecode instruction stream into a token stream, a common theme we see is that information from somewhere is made explicit in the token name. Above, this information has been derived from the operand value of an instruction.
There is one broad area, especially in newer versions of Python, where it is very helpful to have additional information made explicit in the token stream. This broad area is in expressing control flow information. And this can't be obtained simply by looking at operand information in isolation.
In the early days of Python, code generation was so template driven that it was easy to parse control structures. For example even though:
if a and b:
y = 1
and:
if a:
if b:
y = 1
are semantically exactly the same, the template-driven system would produce different opcode sequences. In the second situation with the nested if
's, instructions might be added to address the fact that in the general case a new scope might be added, even though that is not the situation above.
These extra instructions generated by the Python could then be used to reconstruct control flow. However quickly this became insufficient.
Currently control flow was then handled by adding pseudo token instructions called COME_FROM
. The operand of the instruction indicates where the instruction that jumps to this location. For each jump, there is a separate COME_FROM
instruction.
Since the parser was under my control, I had the idea of doing additional checks before a rule is reduced. This works well, but to my knowledge something that is generally done in traditional parsers.
Just before a parser reduces a sequence of symbols, the parser can be instructed to call code to check the validity of the rule. This check has access to the token stream including these pseudo instructions. The call back function is registered based on the non-terminal name, but passed to the registered callback function is the exact rule that is to take place, and the callback function can decide what to do based on the specific rule. When registering the callback function, a Boolean indicates whether the callback function needs to inspect the tree fragment that would get created.
Without first analyzing the code for control-flow information, the coding of the rules can be slow and cumbersome to write. In the future, extended basic block and control flow information will be available from the extended instructions. The COME_FROM
pseudo-instructions will be replaced by basic-block and control-flow dominator begin and end information.
As just described, "scanning" or ingesting has already done a fair bit of set up to the tokens that get fed into the parser. Even with all of this, the code allows for even more parser setup before the parsing occurs. These are called "custom parse rules".
CPython bytecode is stack based: instructions push and pop values from an evaluation stack. For example CPython bytecode for a + b
is
0 LOAD_NAME 0 (a)
3 LOAD_NAME 1 (b)
6 BINARY_ADD
and the grammar rule that covers this is:
bin_op ::= expr expr binary_op
Where bin_op
roughly corresponds the the AST node BinOp
.
Above, bin_op
uses two stack arguments. That makes sense since we think of the binary "+" as having two operands or expressions. Some expressions though like a call instruction or the function definition can have an arbitrary number of operands or parameters.
This translates in bytecode to some opcodes accepting an arbitrary number of stack entries.
The semantics for the opcode MAKE_FUNCTION
is such an instruction that operates on an arbitrary number stacked entries. As mentioned in the last section, before parsing proper, we have changed the opcode to include the number of required positional arguments, e.g. MAKE_FUNCTION_0
. For a program to parse correctly, every instruction sequence has to be covered by a grammar rule.
But what grammar rule would you write for an instruction with an arbitrary number of arguments like MAKE_FUNCTION
?
We don't want to handle these with grammar rules like this:
mkfunc ::= LOAD_CODE MAKE_FUNCTION_0
mkfunc ::= expr LOAD_CODE MAKE_FUNCTION_1
mkfunc ::= expr expr MAKE_FUNCTION_2
...
which would slow down parsing immensely. More of that is described in the next section. And we'd also have to put a limit on the number of rules with MAKE_FUNCTION_x
. So instead we add the grammar rules that are specific to those kinds of calls that will be encountered in the instruction sequence to be parsed. That is, we will add:
mkfunc ::= expr LOAD_CODE MAKE_FUNCTION_1
only if we encounter a MAKE_FUNCTION
in the instruction sequence that requires exactly one positional argument. And we add that grammar rule only once, no matter how many single parameter functional calls there are.
Interpreters prefer the reverse Polish way of doing things: operand data is pushed on an evaluation stack before an operation opcode , like BINARY_ADD
, for those operands is encountered. In Polish Prefix or Infix order, an interpreter needs to do bookkeeping for separate boundaries of operands.
However for context free parsers this approach is slow when the number of operands is large. Consider this sequence for building a list:
8 LOAD_CONST 1
10 LOAD_CONST 2
12 LOAD_CONST 3
14 BUILD_LIST_3 3
We have a custom rule like list ::= expr expr expr BUILD_LIST_3
. However the poor parser on seeing the 3rd LOAD_CONST
, and just before seeing BUILD_LIST_3
needs to consider the possiblity we might have any of the following situations involving expr expr expr
It can be part of:
<expr expr expr BUILD_LIST_3>
or instead:
expr <expr expr expr BUILD_LIST_3>
or instead:
expr expr <expr expr expr BUILD_LIST_3>
or instead:
expr expr expr <expr expr expr BUILD_LIST_3>
The longer the list, the worse this problem of keeping track of possibilities becomes. And the more kinds of different-length lists, the number of possibilities also grows.
In this particular kind of case where we list items are always a fixed length (here length one), we can handle this in the scanning phase by looking back when hitting an opcode like BUILD_LIST
.
If all the entires are "LOAD_CONST" we can then add pseudo-instruction to mark the beginning of the list.
We don't do this for small lists of size three, but for a list of size 5 this becomes:
8_0 COLLECTION_START 0 'CONST_LIST'
8 ADD_VALUE 1 1
10 ADD_VALUE 2 2
12 ADD_VALUE 3 3
14 ADD_VALUE 4 4
16 ADD_VALUE 5 5
18 BUILD_CONST_LIST 5
Adding that COLLECTION_START
make parsing efficient; the grammar rules become:
add_consts ::= ADD_VALUE*
const_list ::= COLLECTION_START add_consts BUILD_CONST_LIST
Now there can be no ambiguity in where in the collection we are when we see anther entry to the list, and so parsing state that needs to be maintained is reduced.
By changing LOAD_CONST
to a unique token like ADD_VALUE
, we further reduce the possibility of interaction with other rules.
The semantic analysis pass generates Python source code or source code fragments from the parse tree produced by the parser. You can see the tree build using the --tree
(short: -t
) or --tree++
(short: -T
) option.
There has been some confusion over what comes out from the parse. Is it an "abstract syntax tree" or AST just like Python's AST?
I would say that what we have is a "parse tree", not an AST. it is true that in the process of building the tree, there is some of compaction going on which we describe later. But these changes are just to make the tree more readable, compact, and thus easier to work with.
However the essential difference in my mind is that an AST removes some of the tokens that were originally found in the token stream. In conventional languages these would be punctuation symbols like colon, comma, or semicolon. In our case, every instruction that appears in the input appears in the parse tree. And this is useful in fragment decompilation when we want to know where in the source code when a certain instruction is being processed.
That said, the parser tree at the upper levels does resemble Python's AST. We try to keep the names the same. But to emphasize that these are different we change the Python's AST CamelCase to snake_case. For example Pythons AST FunctionDef
appears in the parse tree as function_def
.
Certain list-like projection are turned into a list head with its items instead of a chain of lists each with two items. For example, for
stmts ::= stmts stmt
stmts ::= stmt
Instead of producing a tree:
stmts
stmts
stmts
stmt
stmt
stmt
we produce:
stmts
stmt
stmt
stmt
The list of nonterminals that is handled like this is in variable nt_list
inside class PythonParser in module uncompyle6.parser
.
Another simplication performed is removing singleton rules when they add any value.
An example of this
stmt ::= ifelsestmt
_stmts ::= stmt
suite_stmts ::= _stmts
In the tree going from suite_stmts
to ifelsestmt
is simpler and clearer. The variable singleton
in the same place that nt_list
is defined is what controls this.
Before turning the parse tree into source text there is a "transform" phase which transforms idioms of the tree (and hence source code) into other idioms (and hence presumably more idiomatic source code).
Here are some kinds of transformations that occur:
if x:
a
else:
if y:
b
becomes:
if x:
a
elif y:
b
Or:
if not a:
raise AssertionError
becomes:
assert a
Docstrings that are found in the code object are added as a child node of the thing the docstring is documenting.
In uncompyle/decompile the class that walks the parse tree to produce source text is the Walker
class in the Walker module. In uncompyle6 there are two modules pysource
and fragments
and the main classes that walk the tree are called SourceWalker
and FragmentsWalker
. The two are similar and FragmentsWalker is a subclass of SourceWalker. However since FragmentsWalker's needs are a little different, it is a separate class and module.
As with parsing, the semantic analysis phase allows for custom rules to added before walking the tree. In particular, semantic actions for each of the custom function rules need a semantic-action counterpart.
The Walker classes are subclasses of GenericASTTraversal which is included in the spark
module. The main function in there is the
method to traverse the tree. It is called, somewhat erroneously, "preorder" For each node with typestring name name, if the instance has a method called n_name, call that before walking
children. If there is no method defined, the method self.default() is called instead. the node has a method called name_exit, that is called after all children have been called. So in this sense this
function is both preorder and postorder combined.
For a detailed explanation of the machinations in semantic analysis see Table‐driven AST‐to‐Python Translation.