Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include parentheses around expressions in token range #11

Open
yurikhan opened this issue Jan 12, 2018 · 27 comments
Open

Include parentheses around expressions in token range #11

yurikhan opened this issue Jan 12, 2018 · 27 comments

Comments

@yurikhan
Copy link

I am trying to use asttokens for source-level transformations in a code editor. For example, I would like to position the cursor on a binary operation, press a key, and the operands would be transposed: (a+b) * cc * (a+b).

The way I’m trying to do that is:

  1. Parse a small portion of the file into an asttokens.ASTTokens. (The smaller the better because the whole file cannot be guaranteed to be syntactically correct all the time. For now, I operate at the level of function definitions.)
  2. Find the target node by the cursor position. Check if it’s a BinOp.
  3. Determine the character ranges corresponding to the left and right operands, and swap them. (This is to preserve as much of the original code formatting as possible. For instance, re-generating the whole function using astor will strip all comments and possibly mess up spacing and line structure.)

However, I find that at step 3, if either operand is enclosed in parentheses, they do not count towards the token range:

import asttokens

tokens = asttokens.ASTTokens('(a + b) * c', parse=True)
print('\n'.join(
    '# {} at {}: {!r}'.format(node.__class__.name,
                              tokens.get_text_range(node),
                              tokens.get_text(node))
    for node in asttokens.util.walk(tokens.tree))

# Module at (0, 11): '(a + b) * c'
# Expr at (0, 11): '(a + b) * c'
# BinOp at (0, 11): '(a + b) * c'
# BinOp at (1, 6): 'a + b'
# Name at (1, 2): 'a'
# Name at (5, 6): 'b'
# Name at (10, 11): 'c'

so if I swap the corresponding character ranges, I get (c) * a + b which is wrong.

It would be nice if any parentheses that are not part of the parent node’s syntax were included in the token range of the child nodes:

# Module at (0, 11): '(a + b) * c'
# Expr at (0, 11): '(a + b) * c'
# BinOp at (0, 11): '(a + b) * c'
# BinOp at (0, 7): '(a + b)'
# Name at (1, 2): 'a'
# Name at (5, 6): 'b'
# Name at (10, 11): 'c'

(Or am I overlooking some other Obvious Way to Do It?)

@mbdevpl
Copy link

mbdevpl commented Jan 13, 2018

Since the parens of lhs and rhs seem to be included in the parent expression range, can't you partition the code at the operator and swap the strings? A bad implementation would be...

lhs, operator, rhs = code.partition(operator_str)
code = ''.join([rhs, operator, lhs])

@yurikhan
Copy link
Author

Splitting the whole parent expression on the first occurrence of the parent expression’s operator will obviously not work:

(a * b + c) * d
^^^ ~~~~~~~~~~~
lhs         rhs

 b + c) * d*(a 

What can work is take the range between the end of lhs and start of rhs and re.fullmatch that against r'(.*)\s*{}\s*(.*)'.format(re.escape(parent.op)), and then adjust the range of lhs to (start of parent, end of lhs + len(match[1])) and then range of rhs to (start of rhs - len(match[2]), end of parent).

~~~~~~~~~~~~~~~  parent
(a * b + c) * d
 ^^^^^^^^^    ^
 lhs        rhs  as reported by asttokens

          ~~~~   fullmatch against (.*)\s*\*\s*(.*)
(a * b + c) * d
          ^      match[1]  (match[2] is empty)
^^^^^^^^^^^   ^
lhs         rhs  adjusted

Not the most elegant way, but it will do for now.

@mbdevpl
Copy link

mbdevpl commented Jan 13, 2018

Well, I wrote "A bad implementation would be..." So of course my naive implementation doesn't work in all cases ;) And BTW regex also doesn't work 100%. After all, expression also can contain arbitrary whitespace and comments.

E.g. what about this expression:

(a # this is my lhs
# *** end of lhs ***
*
# *** start of rhs ***
b # this is my rhs
)

I'd not worry about parentheses in particular. Instead I'd get the location of the operator by locating the actual token of the operator. If comments and whitespace between sub-expressions and the operator are not included in sub-expressions' text, then you'd still need custom code to handle this.

lhs_last = node.left.last_token
rhs_first = node.right.first_token
node_tokens = list(tokens.get_tokens(node))

Then get sublist based on lhs_last and rhs_first, find token of right type (i.e. tokenize.OP) and right value, and get correct location from it.
It seems like a lot work, but if you want to preserve all formatting properly, and want to anticipate all possible formatting, it seems like the only approach ;)

@yurikhan
Copy link
Author

Ugh. I played a bit with your example. It’s much uglier than you expected. The BinOp spans from the a up to and including b. The comment # this is my rhs is considered to be outside the BinOp. So token-based transposition (without breaking out of the operator’s token range into Expr) yields this result:

(
# *** start of rhs ***
b*a # this is my lhs
# *** end of lhs ***
 # this is my rhs
)

Which is syntactically correct but will not meet any reasonable user expectations.

@mbdevpl
Copy link

mbdevpl commented Jan 15, 2018

I see :D Well, in any case the authors of such convoluted expressions are probably not the target audience of your extension ;) Good luck with your project! Is it on GitHub?

@yurikhan
Copy link
Author

Not yet, I’m still assessing if it’s viable.

@dsagal
Copy link
Member

dsagal commented Jan 16, 2018

Parentheses are a bit special in that they don't cause creation of extra nodes in the AST (only affect order of operations). So they need a bit of extra work to capture them. Luckily, ASTTokens contains .prev_token() and .next_token() methods which can skip non-coding tokens such as comments. Putting this together, here's a proposal:

import asttokens

def range_with_parens(atokens, node):
  """Returns (first_token, last_token) containing node and any surrounding parentheses."""
  first = node.first_token
  last = node.last_token
  while (asttokens.util.match_token(atokens.prev_token(first), token.OP, '(') and
         asttokens.util.match_token(atokens.next_token(last), token.OP, ')')):
    first = atokens.prev_token(first)
    last = atokens.next_token(last)
  return (first.startpos, last.endpos)

def text_with_parens(atokens, node):
  """Returns text of node including any surrounding parentheses."""
  first, last = range_with_parens(atokens, node)
  return atokens.text[first:last]

@yurikhan
Copy link
Author

yurikhan commented Jan 18, 2018

Thanks, this kinda works.

#!/usr/bin/python3

import sys
import textwrap
import token

import asttokens


def conservative_extent(atokens, node):
    prev = atokens.prev_token(node.first_token)
    next = atokens.next_token(node.last_token)
    if (asttokens.util.match_token(prev, token.OP, '(') and
        asttokens.util.match_token(next, token.OP, ')')):
        return prev.endpos, next.startpos
    return node.first_token.startpos, node.last_token.endpos


def greedy_extent_limited(atokens, node, start_limit, end_limit):
    first = node.first_token
    last = node.last_token
    while True:
        prev = atokens.prev_token(first)
        next = atokens.next_token(last)
        if not (asttokens.util.match_token(prev, token.OP, '(') and
                asttokens.util.match_token(next, token.OP, ')') and
                prev.startpos >= start_limit and
                next.endpos <= end_limit):
            return (atokens.next_token(prev, True).startpos,
                    atokens.prev_token(next, True).endpos)
        first = prev
        last = next
    return (first.startpos, last.endpos)


def test_transpose(text):
    print(text)
    print('↓')
    atokens = asttokens.ASTTokens(text, parse=True)
    node = atokens.tree.body[0].value
    start, end = conservative_extent(atokens, node)
    l_start, l_end = greedy_extent_limited(atokens, node.left, start, end)
    r_start, r_end = greedy_extent_limited(atokens, node.right, start, end)
    print(''.join((text[:l_start],
                   text[r_start:r_end],
                   text[l_end:r_start],
                   text[l_start:l_end],
                   text[r_end:])))
    print()


def main():
    test_transpose(textwrap.dedent('''\
        ((a * b + c) * d)'''))
    test_transpose(textwrap.dedent('''\
        (a # this is my lhs
        # *** end of lhs ***
        *
        # *** start of rhs
        b # this is my rhs
        )'''))


if __name__ == '__main__':
    sys.exit(main() or 0)
((a * b + c) * d)
↓
(d * (a * b + c))

(a # comment about a
# *** end of comment about a ***
*
# *** start of comment about b
b # comment about b
)
↓
(
# *** start of comment about b
b # comment about b
*a # comment about a
# *** end of comment about a ***
)

The common case is nice and correct. The ugly case is still ugly but correct.

@dsagal
Copy link
Member

dsagal commented Jan 18, 2018

Good point to limit to the parent's extent. You could probably simplify the interface a bit to still just take one node, and expand its range to include parentheses, limiting the range to the range of the node's parent.

Although I am a little confused: in which case does limiting to the parent actually cause different behavior? Is one of your examples affected?

@yurikhan
Copy link
Author

As Mateusz pointed out, it’s not only about parentheses. I do want to expand each sub-node’s range to include parentheses, but I also want to include comments that are within the parent’s parentheses.

Consider the “ugly example” above. I will tokenize it for clarity and mark out the various ranges involved. I will also add a few extra pairs of parentheses.

Here’s the starting position. The original parent range is too small so limiting to it will not let me extend over the comment after b:

    ┌─────────────────────parent──────────────────────┐
 ( ( ( a ) COMMENT NL COMMENT NL * NL COMMENT NL ( b ) COMMENT NL ) )
      └─┘ .left                                   └─┘ .right

If I extend each range to include each matching pair of parentheses and only those, I get this:

┌───────────────────────greedy_extent(parent)────────────────────────┐
 ( ( ( a ) COMMENT NL COMMENT NL * NL COMMENT NL ( b ) COMMENT NL ) )
    └─────┘ .left                               └─────┘ .right

Still not greedy enough. So I extend the operands’ ranges all the way up to but not including the nearest significant tokens that do not form a pair of parentheses:

 ( ( ( a ) COMMENT NL COMMENT NL * NL COMMENT NL ( b ) COMMENT NL ) )
    └───────────.left───────────┘ └────────────.right────────────┘ 

After drawing these schematics, I do believe the parent range is not strictly necessary. Still, I’m inclined to keep it simply because it will remind my future self what the maximum possible scope of the transformation is.

    ┌────────────────conservative_extent(parent)─────────────────┐
 ( ( ( a ) COMMENT NL COMMENT NL * NL COMMENT NL ( b ) COMMENT NL ) )
    └───────────.left───────────┘ └────────────.right────────────┘ 

@dsagal
Copy link
Member

dsagal commented Jan 18, 2018

Makes sense. The convention in Python is that end-of-line comments are associated with the code on that line, and lines that only contain a comment are associated with the following line. So it might be a helpful utility to expand a node's token range to include any end-of-line comments AND any preceding comment-only lines (or maybe only preceding comment-only lines until you hit a blank line). And also to include surrounding parentheses.

(One difference from your proposal is if you have an end-of-line comment after the operator * -- it seems most natural to let it stay there, rather than be considered part of the following node.)

There was just a discussion of a similar requirement in #10 -- about finding end-of-line comments after a token.

Also, to make this completely general-purpose, we'd run into a problem with parentheses and function calls. E.g. in foo(a + b), we do NOT want the argument expansion to be (a + b).

@yurikhan
Copy link
Author

if you have an end-of-line comment after the operator * -- it seems most natural to let it stay there, rather than be considered part of the following node

I have no problem with that. In my code above, I’m expanding right’s range to include the NL immediately following the * because that’s what is easy to do with existing means, but it would be better to stop at the COMMENT.

( ( a ) COMMENT NL COMMENT NL * NL COMMENT NL ( b ) COMMENT NL )
 └───────────.left───────────┘    └──────────.right───────────┘

in foo(a + b), we do NOT want the argument expansion to be (a + b)

Yes, that’s what I had in mind when I said “parentheses that are not part of the parent node’s syntax” in the original comment. However, in case a function is called with a single argument, that argument’s expression can be wrapped in an arbitrary number of pairs of parentheses, and only the outermost is syntactic:

foo ( ( ( a + b ) ) )
     └──.args[0]───┘ after expansion

@dsagal
Copy link
Member

dsagal commented Jan 18, 2018

That all makes sense. Well, if you think such a utility belongs into asttokens, I would welcome a pull request. And if you'd rather keep it in your code, it's fine too, and we can revisit if anyone else has a similar need.

@yurikhan
Copy link
Author

I will try out the concept privately to gain some hands-on experience with it. If/when it proves useful, I may propose a pull request.

Let me see if I get this right: you want it as a utility only, not as a change to the default behavior, right?

@dsagal
Copy link
Member

dsagal commented Jan 19, 2018

I guess different use cases might want different concept of which range of tokens comprises a node. E.g. in a * (b+c) # foo, the "sum" node could be b+c or (b+c) or (b+c) # foo. To make all of these possible, I suggest keeping the default behavior of marking node with the minimal range (b+c in this case), and a utility to expand it to include parentheses and optionally comments.

@alexmojaki
Copy link
Contributor

While working on #36 / #28 I've found that in Python 3.8 the parentheses surrounding a tuple are now being included, because the col_offset of the node has moved to include it. You can see it by running this script on 3.7 and 3.8:

from asttokens import ASTTokens

source = """
(1, 2)
"""
atok = ASTTokens(source, parse=True)
t = atok.tree.body[0].value
print(t)
print(atok.get_text(t))
print(t.col_offset)

There's still no difference for the source (1 + 2), so maybe this isn't the right issue to talk about this, but it seems related.

So at least for tuples, the choices are:

  1. Have differing behaviour from 3.8 onwards.
  2. Explicitly remove the parentheses.
  3. Ensure the parentheses are always included.

I'm in favour of 3 because I think it's more intuitive. People tend to think of tuples as being defined by parentheses even if they're not, which is probably why Python has made this change. Of course this will mean changing the behaviour of the library which may cause some people problems. @dsagal what do you think?

@alexmojaki
Copy link
Contributor

This issue with tuples and parentheses is also causing other bugs. Because asttokens tries to include trailing commas within a tuple, when it already has the surrounding parentheses it ends up including commas outside of the tuple. This causes test_fixture8 to fail. Here's a minimal example:

import ast
from asttokens import ASTTokens

source = """
foo((x,),)
"""
tree = ast.parse(source)
for node in ast.walk(tree):
    if isinstance(node, ast.Tuple):
        print(node, node.lineno, node.col_offset)
atok = ASTTokens(source, tree=tree)
t = tree.body[0].value.args[0]
print(t)
print(atok.get_text(t))

In Python 3.7, get_text correctly returns x,. In 3.8 it returns (x,),.

@dsagal
Copy link
Member

dsagal commented Oct 7, 2019

I agree it makes sense to change behavior to always include tuple parens, even pre-3.8. I don't know if anyone relies on them NOT being included, but since it's a change, it should perhaps get a major version bump.

We'd need to keep in mind, however, that sometimes tuples exist without parens.

@ssbr
Copy link
Contributor

ssbr commented Oct 31, 2019

I relied on them not being included so that I could have minimal diffs. (This change broke my tests. :]) I can work around it though.

I don't agree that it makes sense. Whether parentheses are required is a property of the substitution-site, not the expression. Suppose I take a statement like x = [$y] and replace it with x = $y; x = [x] . If the original statement was x = [(a, b)], then the parentheses are rendered unnecessary: ideally this should become x = a, b; x = [x]. (not x = (a, b); x = [x], and also not x = a, b; x = [(x)]).

So whether I want to eat as many parens as possible, or as few as possible, depends on what I am doing: if I'm deleting something and replacing it with something else, I probably want to delete all the parens and decide only how many I need for my replacement expression. And when I am inserting an expression into a blank spot where an expression belongs,I only want to add as many parens as are necessary, and no more.

(The way I handle this during substitution is to try adding parentheses and see if it changes the parse tree. If it does, those parens are necessary to preserve the intended effect.)

IMO the "ideal" is for asttokens to allow the user to decide what they want: as many parens as can possibly be attributed to this expression, or none of them. Doing it inconsistently between expression types definitely doesn't sound great, and choosing only one or the other forces the user to work around it somehow (although really that isn't too hard.)

@alexmojaki
Copy link
Contributor

If the original statement was x = [(a, b)], then the parentheses are rendered unnecessary: ideally this should become x = a, b; x = [x]. (not x = (a, b); x = [x]

I disagree, I think more people would write x = (a, b) rather than x = a, b in general. Tuples are a bit of a weird case in that parentheses are often included even when they are not needed.

@ssbr
Copy link
Contributor

ssbr commented Oct 31, 2019

Maybe, but I hope it's clear that making an opinionated decision here makes it less convenient for tools that build on top of asttokens to make different decisions.

Edit: and we can probably agree on it entirely for other kinds of expressions, like arithmetic. (a + b) * c, vs x = a + b; x * c

@dsagal
Copy link
Member

dsagal commented Nov 1, 2019

That's an interesting point, and sorry, @ssbr, about creating hassle for you with the backward-incompatible change.

In your example, in the expression x = [(a, b)], the Tuple node is definitely (a, b). In an expression like x = [(a + b)], the parenthesized node is Expr and it is (a + b) (but in the first example, there is no containing Expr node). This library tells you what input code corresponds to the nodes, and I think the inclusion of parens in tuples made this correspondence more correct.

What you were relying on wasn't a very useful behavior. In fact, I imagine to get what you want, you needed to add your own handling to strip unneeded parentheses from an expression like (a + b).

So I think adding an option for whether to include parens solves a very particular need, and doesn't solve it very well. At least given your description, a more useful feature would be a function that takes an AST node and returns a concise code for it. But that already exists, I think -- e.g. in astroid, node.as_string() method that can help with that.

@alexmojaki
Copy link
Contributor

In an expression like x = [(a + b)], the parenthesized node is Expr and it is (a + b) (but in the first example, there is no containing Expr node).

Not sure what you were trying to say here @dsagal? There is no Expr node anywhere in x = [(a + b)].

In fact, I imagine to get what you want, you needed to add your own handling to strip unneeded parentheses from an expression like (a + b).

If (a + b) exists as its own statement, that statement will include the parentheses. But usually it will be part of another expression and the BinOp node will have .get_text(node) == 'a + b'. asttokens doesn't include those parentheses, that's what this issue was originally about.

I think the current behaviour is a good default behaviour which matches people's natural expectations for simple uses of the library. But clearly there are at least two people who want an option for different behaviour, albeit in opposite directions: @yurikhan wants to be able to include them more, and @ssbr less. Including such behaviour will at the very least require proposing an API, and will probably also need a PR as implementing this stuff is not trivial.

@dsagal
Copy link
Member

dsagal commented Nov 1, 2019

Doh, you are right. Sorry to comment based only on recollections which were incorrect :) And without even looking carefully at which issue the thread is about.

So scratch my comments, and listen to @alexmojaki instead.

If anyone wants to have a go at offering an option for this, it seems to me that expanding token ranges to include parens, or stripping parens, is fine to do after parsing. I.e. any node's range could be expanded or stripped of surrounding parentheses with no change in meaning (might be wrong on this, but can't think of anything). In this case something along the lines of the _gobble_parens() internal helper would do the job. And the question remains what API to offer to make this convenient and useful.

@alexmojaki
Copy link
Contributor

@ssbr do you have any thoughts on #50 ?

@ssbr
Copy link
Contributor

ssbr commented Jan 19, 2020

@alexmojaki Thanks for the ping, will leave a comment on that issue.

FWIW I am absolutely happy and fine with any decision asttokens makes. The worst it will do is break my tests and maybe make diffs include slightly more parens than they would've otherwise. I'll try to open source my thing this year instead of just using it internally at work so that I have lines of code I can point at. :)

@PeterJCLaw
Copy link
Collaborator

I don't know if anyone relies on them NOT being included, but since it's a change, it should perhaps get a major version bump.

Not quite rely, but I'm pretty sure that tuck assumes that this will be the case in a few places. A major version bump around this would be appreciated.

That said, tuck already has its own logic for finding the outermost set of parens which belong to an expression: https://github.com/PeterJCLaw/tuck/blob/a542aacd000d25c409c6d68b8e62ca18c9375b8e/tuck/wrappers.py#L30-L62
That logic aims to account for things like functions which take a single non-parenthesised generator argument (for example func(x for x in '')) as well as explicitly parenthesised things that don't need the parens (e.g: func(('something'))).

I suspect the concern the original author had here (and on I share) is that the inclusion of parens sometimes feels inconsistent between node types. I can't immediately point to an example and I'm not sure how much of this is a result of Python's AST itself rather than what ASTTokens does, though I definitely recall being tripped up in this area in the past.

An alternative that would be non-breaking might be to have a util that can walk outwards from a token/pair of tokens/ast node and find the outer bounds, assuming we can find a useful semantic for what the outcome of that is. (I see there was some discussion about that above, though I'll admit I've not completely followed it all).

If the spelling that I've linked to above would be useful either as-is or as the basis for something I'd be happy to see it upstreamed into ASTTokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants