-
-
Notifications
You must be signed in to change notification settings - Fork 410
Table‐driven AST‐to‐Python Translation
Table of Contents
- Introduction
- Mapping Parse Tree Nodes to a Format Table
- Format specifiers
- Additional Fragment Format Specifiers
A number of semantic actions rules are specified via pattern-matching rules in some dictionaries. This greatly reduces the need to write methods for semantic actions that are very similar. Furthermore, if you know the codes, you can get a more compact description of actions are to be done.
See Writing Semantic-action Rules for a general overview of semantic actions. Here we focus on how we have customized the framework for deparsing. Specifically, we describe the kinds of tables used and how the table entries direct what to output and how to traverse the AST tree.
The instance method gen_source()
is the top level routine for
walking the AST tree produced by parsing. But the instance method
default()
is the routine that is called for each node when walking
the AST. It introspects or searches the class it is in for nonterminal instance
methods and some tables to determine what to do. Instance method definitions that
start with n_
and end in the nonterminal name considered first.
When that isn't found, the tables are then consulted.
When a table entry is found using the nonterminal name as the key,
that entry and the node name are passed as parameters to the
template_engine()
method. Entries in tables are tuples: first entry is
a string often with %-escaped format specifiers; after that is the arguments to the
be used by the format specifiers. The number of items and types of the arguments
after the first argument is determined by the format string. This is
similar to how the C and other languages handle formatted output.
Some examples of valid format rules are given below. The only thing you need to know below is that
-
%c
requires one integer index argument or a tuple of an integer index and a string nonterminal name, -
%%
takes no argument, and - %p` requires a tuple of two integers or a tuple of an integer, a string nonterminal name, and another integer
'INPLACE_MODULO': ( '%%=',),
'INPLACE_OR': ( '|=' ,),
'binary_expr': ( '%c %c %c', 0,
(-1, 'binary_op'),
( 1, 'expr' ) ),
'unary_not': ( 'not %c', (0 'expr') ),
'slice1': ( '%c[%p:]', (0, 'expr'), (1, 100) ),
Here are some invalid rules:
'INPLACE_MODULO': ( '%=',), # there is no %= specifier
'INPLACE_OR': ( '|=' , 0), # no argument has been called for in the string
'binary_expr': ( '%c %c %c', 0, -1 ), # not enough arguments
'unary_not': ( 'not %c', 'a' ), # expecting an integer argument
'slice1': ( '%c[%p:]', 0, 1 ), # %p expects a tuple
In some case, an instance method may do some setup and call another
non-terminal method or call template_engine
directly.
Aside from default()
and template_engine
, another routine that is
called a lot is prune()
. This is called when processing for a node
is finished. It raises an "prune" exception which may percolate up
several calls.
Here is an example of that kind of situation. There is a method for
the async_call_function
nonterminal called
n_async_call_function()
. It simply needs to prepend the word
"async" at the front of a call. After outputting "aysnc" it changes
the type of its node to call function
and then calls
template_engine()
with this modified node. Since template_engine()
has already made a pass over the node, after template_engine()
finishes, self.prune()
is called to ensure another pass over the
node doesn't appear.
When we don't have a custom method for a nonterminal, we look up the the action to perform on the nonterminal via one of two tables.
The format actions of the rules in each table entry is the same in each of the table. What is different the AST node is located with repect to the nonterminal name we look up.
Below, "K" is the nonterminal key name we lookup up, and "N" is the node where that rule should be applied to.
The tables and how they apply to the nonterminal key in are:
N&K N
/ | ... \ / | ... \
O O O O O K
TABLE_DIRECT TABLE_R
The dictionary values in the tables are format specifiers which are described in a later. But before this we describe in more detail tables types and give examples of each.
By far, the most used table is TABLE_DIRECT
.
Here both nonterminal key name to lookup up and the node to apply that rule are in the same place: at the parent node.
The degenerate case is where there a grammar item has no children. In other words it is an instruction that has text associated with that needs to be output.
Examples include instructions like BINARY_ADD
(+
),
INPLACE_ADD
(+=
), or UNARY_POSITIVE
(+
) .
What about instructions that sometimes don't have text associated
with them, like POP_TOP
? They just don't appear in a table or are
referenced inside a template for any rule that they appear in.
A common situation you may encounter when adding a new instruction is
for text that should be associated with it not to show up in
output. Typically that is because it needs to be added to TABLE_DIRECT
.
An example of a true non-terminal that has a child is unary not
.
A fragment of an AST for unary_not
this would look like this
unary_not (2)
0. expr
L. 1 0 LOAD_NAME 'a'
and the TABLE_DIRECT
entry for this would look like:
'unary_not': ( 'not %c', 0 ),
The nonterminal key appears as the last child of its siblings, while the node to apply the rule is in its parent.
The reason this kind of thing happens and that we are interested in the last child is that instructions are in reverse polish where we have operand first followed by operator. And here we are interested in the operator.
An example of something that uses this is del_stmt
. The AST fragment
for del sys.a
is:
del_stmt (2)
0. expr
L. 1 0 LOAD_NAME 'sys'
1. 2 DELETE_ATTR 'a'
And the TABLE_R
entry for this is
'DELETE_ATTR': ( '%|del %c.%[-1]{pattr}\n', 0 )
However we need to record in del_stmt
that we are using the TABLE_R
pattern. The
way that is done is by entering it in MAP
like this:
MAP = {
'del_stmt': MAP_R', # ...
}
Something similar goes on for storing attributes as is found in the
x.a
of x.a = 5
and those are grouped under designator
.
The current Python grammar doesn't use this very much.
In this section we describe the meaning of specific format specifiers like %c
or %|
.
The %
may optionally be followed by a number in square brackets,
which makes the engine walk down to that node before evaluating the
escape code. A few TABLE_R
entries make use of this. For example
'STORE_ATTR': ( '%c.%[1]{pattr}', 0),
'DELETE_ATTR': ( '%|del %c.%[-1]{pattr}\n', 0 ),
This is probably the most common specifier. Its argument is a single integer representing a node index, or a tuple where the first entry of the tuple is an integer representing a node index, and a string which is the node 'kind' field (the terminal/nonterminal name) that this node is asserted to be. That is, if a string is there, it is checked against the node.
The format specifier indicates that the node and its children should
be descended. For example, For a unary expression -val
. We might
have the AST tree fragment:
unary_expr (2)
0. expr
L. 2 4 LOAD_NAME 'val'
1. unary_op
6 UNARY_NEGATIVE
The below template rule for unary_expr
indicates that we should
descend into the unary_op
subtree first, (the second child), and
then into expr
:
'unary_expr': ( '%c%c', 1, 0) )
There is just one grammar rule for unary_expr
, and it is:
unary_expr ::= expr unary_op
Therefore, a more stringent template format specifier can be used:
'unary_expr': ( '%c%c',
(1, 'unary_op'),
(0, 'expr') ),
The above specifier documents that node 0 is an "expression' node, while node 1 is an 'unary_op' node. Note only is this documented in the code, but the template engine also will check this at runtime when encountering this template. Sure, the assert check slows things down a little, but not noticably, and I think there is great benefit here.
Like %c
but sets default operator precedence for all the nodes under
the one specified. Its argument then is a tuple indicating the node
index and the precedence value, an integer, or a triple of the node index,
the node name, and a precedence.
Operator precedence can change output in requiring grouping symbols or changing where they appear.
For example, in arithmetic expressions it determines whether we need parenthesis or not.
Consider this tree:
binary_expr (3)
0. expr
binary_expr (3)
0. expr
L. 2 4 LOAD_NAME 'x'
1. expr
6 LOAD_NAME 'y'
2. binary_op
8 BINARY_MULTIPLY
1. expr
10 LOAD_NAME 'z'
2. binary_op
12 BINARY_ADD
The tree is unambiguous that "x" and "y" should be multiplied before "z" is added. That also matches Pythons default or non-parenthesized operator precedence.
When the precedence value of BINARY_MULTIPLY
is less than that of
BINARY_ADD
as it is done in the code, we are indicating that
BINARY_MULTIPLY
in the absence of parenthesis, it groups first and
closest. So the output will for this portion will be: x * y + z
and
no parenthesis is needed.
However suppose we instead set the precedence of BINARY_MULTIPLY
to
a larger value than BINARY_ADD
. In this case the code will think
that the meaning of x * y + z
is by default equivalent to x * (y + z)
, but the tree indicates that multiplication must be done
first. So in the output the code will add the explicit parenthesis
giving (x * y) + z
.
The above is all nice and good, but alas it doesn't explain how %p
is used specifically in the code. The above is handle all by setting values for
BINARY_ADD
and BINARY_MULTIPLY
in the PRECEDENCE
dictionary.
So let me describe how it is typically used in the code. Often it is
used to set a node to a very high value which ensures that node groups last
and grabs the most. For example in slices like a[5:]
we have the
rule:
'slice1': ( '%c[%p:]', 0, (1, 100) ),
I am not sure this is strictly needed. Perhaps someone can come up with a better example and description.
This is used for list-like nodes that can have an arbitrary number of similar children. It needs a tuple of 3 items: a starting node, the maximum value of an end node, and a string to be inserted between sibling children. Consider the Python statement:
a, b, c = (1, 2, 3)
The AST fragment for this is:
assign (2)
0. expr
L. 1 0 LOAD_CONST (1, 2, 3)
1. designator
unpack (4)
0. 2 UNPACK_SEQUENCE_3 3
1. designator
4 STORE_NAME 'a'
2. designator
6 STORE_NAME 'b'
3. designator
8 STORE_NAME 'c'
Notice that here we want ,
added between "a" and "b" and "b" and "c".
The TABLE_DIRECT
entry for unpack
then is:
'unpack': ( '%C%,', (1, maxint, ', ') )
Which indicates we skip the first node UNPACK_SEQUENCE_3
. To indicate
we want all remaining items, we give a high number maxint
which is
defined in the program. And finally we indicate that terms are separated with ,
.
Notice there is also %,
described next.
Append a comma (,
) if the last %C
only printed one item. This is mostly for tuples
on the left-hand side of an assignment statement since BUILD_TUPLE_n
pretty-prints
other tuples. This specifier takes no arguments.
Similar to %C
but you also include an additional precedence value as
you add when going from %c
to %p
. As with %C
, I don't have a
good explanation yet of why it is needed. But here is an example from
the code:
'build_tuple2': ( '%P', (0, -1, ', ', 100) )
Note that index -1, is used to indicate the last item in the list.
Same as %C
this s for left-recursive lists like kwargs where
goes to epsilon at the beginning. If we were to use %C
an extra separator
with an epsilon would appear at the beginning.
Example:
'kwargs': ( '%D', (0, maxint, ', ') )
Insert spaces to the current indentation level. Takes no arguments.
Increase current indentation level. Takes no arguments.
Decrease current indentation level. Takes no arguments.
Runs Python eval(expr) the context of the current parent node. This specifier takes no arguments.
The most common and simplest situation is to pick out the printable attribute field of an instruction. For example in Python 2.7 for from os import path as p
,
the Syntax tree fragment generated is:
alias (2)
0. 9 IMPORT_FROM 1 'path'
1. store
12 STORE_NAME 2 'p'
Here we want to pull at the printable attribute of IMPORT_NAME
and STORE_NAME
name. So the TABLE_DIRECT
entries are:
'IMPORT_FROM': ( '%{pattr}', ),
'STORE_NAME': ( '%{pattr}', ),
The resulting string for the above is path
and p
, while the
alias
pattern joins these together to get path as p
A more involved case is in Python 3.6 format specifiers. The AST fragment for f'{abc!s}'
is:
fstring_single (2)
0. expr
L. 1 0 LOAD_NAME 'abc'
1. 2 FORMAT_VALUE 1 '!s'
The TABLE_DIRECT
entry for fstring_single
is:
'fstring_single': ( "f'{%c%{conversion}}'", 0),
There is also an n_fstring_method()
which gets called first. That stores into the conversion
attribute.
Runs Python eval(expr) the context of the current node[n]. This specifier takes no arguments.
This is similar to the above but instead of working on the parent node, we work on one of the child nodes
Example:
Here is the table-driven fragment for the attribute
grammar rule:
'attribute': (
'%c.%[1]{pattr}',
(0, 'expr')
),
So when we see this:
attribute (2): ('%c.%[1]{pattr}', (0, 'expr'))
0. expr
L. 40 70 LOAD_FAST 'self'
1. 72 LOAD_ATTR dry_run
node[1].pattr
is evaluated giving dry_run
. The entire rule produces self.dry_run
.
Evaluates/rescurses on child n using specifier %x where x can be any one of the the above format specifiers, e.g. %c
, %p
, etc. The arguments are the same as for the specifier used.
Example:
For an async for
pattern rule we have
list_afor (2): (
" async for %[1]{%c} in %c%[1]{%c}",
(1, "store"),
(0, "get_aiter"),
(3, "list_iter"))
When we see async for i in f
this might get parsed in Python 3.7 as:
list_afor (2): (' async for %[1]{%c} in %c%[1]{%c}', (1, 'store'), (0, 'get_aiter'), (3, 'list_iter'))
0. get_aiter (2)
0. expr
...
1. 26 GET_AITER
1. list_afor2 (11)
0. func_async_prefix (5)
...
1. store
36 STORE_FAST 'i'
2. func_async_middle (9)
...
3. list_iter: ('%c', 0)
We want to pick out node[1][1]
which should be a "store" rule, and node[1][3]
which should be a list_iter
rule.
The specifier %c
is applied to both of these children.
Write a %
. Takes no arguments. It is used for example when we
want to output the in place modulo operator %=
.
When we are handling fragment deparsing, all of the above format specifiers are used, but we have the additional work to do in associating instruciton offsets with nodes. To accomplish this, additional format specifiers are needed.
This specifier a tuple the first item is a node index, the source node, and another tuple of node indexis which indicates all of the nodes that the information from the source node should be propogated to.
For example in:
'importstmt': ( '%|import %c%x\n', 2, (2,(0,1)), ),
information from node 2 is propogated to nodes 0 and 1.
%r
associates recursively location information for the string that follows
For example in:
'break_stmt': ( '%|%rbreak\n', ),
The node will be associated with the text break, excluding the trailing newline.
Note we associate the accumulated text with the node normally, but we just don't do it recursively which is where offsets are probably located.
%b
associates the text from the previous start node up to what we have now
For example in:
'importmultiple': ( '%|import%b %c%c\n', 0, 2, 3 ),
The node position 0 will be associated with "import".