Skip to content

Deparsing technology and its use in exact location reporting

R. Bernstein edited this page Dec 20, 2015 · 29 revisions

One of my pet peeves has been the vagueness of reporting a line number as a position in a program. Go's [Pos type](https://golang.org/pkg/go/token/#Pos type) that is used in Go's AST is more informed in that it can encode the line number, file name, column number and byte offset all very compactly as a pointer. Even better, though, would be extend this to an interval or a pointer to a tuple of Pos'es. What makes this work or makes it sparse is that the number of "interesting locations" in a program, whether a single point or a pair of points, is sparse compared to the number of bytes of a program.

I used this idea in the debugger I did for Go. It would be cool if Go used its Pos idea more pervasively.

But now here's another totally different idea. In any language you can always get something like a program counter. In CPython it is a virtual-machine instruction offset. The packages uncompyle or uncompyle2, go back over a decade. They take bytecode and turn it back into Python. The code itself is written in a pretty cool way as it is all pattern-match driven, using a grammar-parsing tool written by John Aycock.

Suppose my CPython byte code is:

 LOAD_CONST 0
 STORE_NAME a

A scanner turns this into a token sequence "LOAD_CONST" and "STORE_NAME" with attributes 0, and "a" respectively attached to the two tokens. There is grammar rule, "assign", that matches LOAD_CONST STORE_NAME (via grammar rules for "expr" and "designator").

From those tokens (or rather, CPython instructions), we can derive non-terminal grammar symbol "assign". And then we can associate that with Python code: a = 0 using the abstract syntax tree nodes (and their attributes) rooted at that the "assign" nonterminal. Of course, close to the grammar "start" symbol there's a grammar for Python that has the expected Python grammar derivations:

    stmts ::= stmts stmt
    stmt ::= assign |  ...

But one thing to notice is that if the code generation in later Python releases changes, the top-level grammar generally doesn't. So we don't need to change those grammar rules above describing the kinds of Python statements.

And vice versa. If new grammar is added to Python, as it probably uses the same kinds of code generation at lower levels for assignment statements so we don't have to change those patterns either.

I'm waving my hands and simplifying a lot. See here and here for the more of the gory details of bytecode deparsing. I omitted, for example, operator precedence, and template-driven semantic actions.

But one eerie thing about this code is that people, myself included, are amazed at how exactly the program can be reconstructed. That is probably because Python code generation isn't very sophisticated and doesn't do a lot of optimization: it always handles a single Python construct in a single way.

But now back to my use. The existing uncompyle2 code just reads bytecodes and produces a string. For my purpose, for an offset I need a fragment of Python code around that. So I need to the keep the intermediate results, or at least from the instruction back up to the root of the parsed tree.

In sum if my code is:

def fib(x):
    if x <= 1: return 1
    return fib(x-1) + fib(x-2)

I can give results like:

opcode: CALL_FUNCTION_1
return fib(x - 1) + fib(x - 2)
                    ----------
Contained in...
return fib(x - 1) + fib(x - 2)
       -----------------------
parsed type: binary_expr

The reason I give the "opcode" or "parsed type" is that I realized that

 fib(x - 2)

can be ambiguous. Am I in the middle of the call as above, or am I about to use that value in the binary_expr? The opcode/parsed type makes it clear.

I should say this has been an incredible amount of work to pull off, and there are still lots of bugs. But the bugs don't necessarily impair the program's usefulness for my purposes; however they are more serious for the original use. For me, the program does what's needed in most circumstances, although possibly not all.