Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Real-world code snippets which libcst fails to parse #930

Closed
Zac-HD opened this issue May 26, 2023 · 8 comments
Closed

Real-world code snippets which libcst fails to parse #930

Zac-HD opened this issue May 26, 2023 · 8 comments
Labels
bug Something isn't working parsing Converting source code into CST nodes

Comments

@Zac-HD
Copy link
Contributor

Zac-HD commented May 26, 2023

@DRMacIver and I have been on a bit of a bughunt lately: the following cases are accepted by Python 3.10's compile() function, but cause libcst 1.0.0 to raise an exception:

import libcst

def compiles(source: str):
    try:
        compile(source, "<string>", "exec")
        return True
    except Exception:
        return False

def libcst_parses(source):
    try:
        libcst.parse_module(source)
        return True
    except Exception:
        return False

for s in [
    # Assignment expression while indexing
    '_[_:=0]',

    # Complex f-string expression
    'f"{_:{_:}{_}}"',

    # Expecting indentation?
    '(\n \\\n)',
    '(\n    """\n""")',
    'if _:\n    """\n)"""',
    'if _:\n    ("""\n""")',
    'if _:\n     """\n  """',
    'if _:\n        """\n    """ ',

    # (legally) missing whitespace around keywords
    'lambda*_:_',
    'lambda**_:_',
    '_ if 0else _',
    'with _ as():_',
    '(a:=()for _ in _)',

    '_ if _ else""if _ else _',
    '_ if _ else()if _ else _',
    '(_ if _ else 0for _ in _)',
    '(_ if _ else""for _ in _)',
    '(_ if _ else()for _ in _)',

    '(lambda:()for _ in _)',
    'def _():return()if _ else _',
]:
    assert compiles(s), repr(s)
    assert not libcst_native_parses(s), repr(s)

Note that each was extracted from code in the wild, so there's already more user impact than from (say) #446.

I was hoping to finish this before 1.0 was released, but you're too fast 💖

@zsol zsol added bug Something isn't working parsing Converting source code into CST nodes labels May 26, 2023
zsol added a commit that referenced this issue May 27, 2023
zsol added a commit that referenced this issue May 27, 2023
For an expression like `f"{one:{two:}{three}}"`, `three` is not in an f-string spec, and should be tokenized accordingly.

This PR fixes the `format_spec_count` bookkeeping in the tokenizer, so it properly decrements it when a closing `}` is encountered but only if the `}` closes a format_spec.

Reported in #930.
zsol added a commit that referenced this issue May 27, 2023
This is an obscure one.

`_ if 0else _` failed to parse with some very weird errors. It turns out that the tokenizer tries to parse `0else` as a single number, but when it encounters `l` it realizes it can't be a single number and it backtracks.

Unfortunately the backtracking logic was broken, and it failed to correctly backtrack one of the offsets used for whitespace parsing (the byte offset since the start of the line). This caused whitespace nodes to refer to incorrect parts of the input text, eventually resulting in the above behavior.

This PR fixes the bookkeeping when the tokenizer backtracks.

Reported in #930.
zsol added a commit that referenced this issue May 27, 2023
Python accepts code where `lambda` follows a `*`, so this PR relaxes validation rules for Lambdas.

Raised in #930.
@zsol
Copy link
Member

zsol commented May 27, 2023

OK I've got a fix for some of these, and have a pretty good idea about how to fix the remaining whitespace validation issues (similar to the lambda situation).

I haven't looked into the indentation issues yet.

About to head out on vacation, so if anyone wants to jump in on these, feel free to 😝

zsol added a commit that referenced this issue May 28, 2023
For example in `_ if _ else""if _ else _`.

Raised in #930
zsol added a commit that referenced this issue May 28, 2023
For example in `_ if _ else""if _ else _`.

Raised in #930. Also fixes #854.
zsol added a commit that referenced this issue May 28, 2023
zsol added a commit that referenced this issue May 28, 2023
Like in `[_:=''for _ in _]`

Raised in #930.
zsol added a commit that referenced this issue May 28, 2023
Like in `[lambda:()for _ in _]`

Reported in #930.
zsol added a commit that referenced this issue May 28, 2023
When the input doesn't have a trailing newline, but the last line had
exactly the amount of bytes as the current indentation level, the
tokenizer didn't emit a fake newline, causing parse errors (the grammar
expects newlines to conform with the Python spec).

I don't see any reason for fake newlines to be omitted in these cases,
so this PR removes that condition from the tokenizer.

Reported in #930.
zsol added a commit that referenced this issue May 28, 2023
When the input doesn't have a trailing newline, but the last line had
exactly the amount of bytes as the current indentation level, the
tokenizer didn't emit a fake newline, causing parse errors (the grammar
expects newlines to conform with the Python spec).

I don't see any reason for fake newlines to be omitted in these cases,
so this PR removes that condition from the tokenizer.

Reported in #930.
zsol added a commit that referenced this issue May 28, 2023
When the input doesn't have a trailing newline, but the last line had
exactly the amount of bytes as the current indentation level, the
tokenizer didn't emit a fake newline, causing parse errors (the grammar
expects newlines to conform with the Python spec).

I don't see any reason for fake newlines to be omitted in these cases,
so this PR removes that condition from the tokenizer.

Reported in #930.
@zsol
Copy link
Member

zsol commented May 28, 2023

With the stack at #939 and #940, all of these parse successfully 🎉

@DRMacIver
Copy link

DRMacIver commented May 30, 2023

BTW owing to the way we found these, it's likely that many of them are actually different bugs than the ones we found in the wild and we may have a bunch more for you once we've retested with #939 and #940. Will let you know when we do!

(It's a test-case reduction thing called "slippage": Automatically transforming a test case into a smaller one subject only to the condition that it still has a bug in it doesn't guarantee that it's the same bug. In particular I think probably these no-whitespace-between-tokens ones are something that was introduced by deleting the whitespace rather than whatever the original bug was)

zsol added a commit that referenced this issue Jun 7, 2023
#936, #935, #934, #933, #932, #931)

* Allow walrus in slices

See python/cpython#23317

Raised in #930.

* Fix parsing of nested f-string specifiers

For an expression like `f"{one:{two:}{three}}"`, `three` is not in an f-string spec, and should be tokenized accordingly.

This PR fixes the `format_spec_count` bookkeeping in the tokenizer, so it properly decrements it when a closing `}` is encountered but only if the `}` closes a format_spec.

Reported in #930.

* Fix tokenizing `0else`

This is an obscure one.

`_ if 0else _` failed to parse with some very weird errors. It turns out that the tokenizer tries to parse `0else` as a single number, but when it encounters `l` it realizes it can't be a single number and it backtracks.

Unfortunately the backtracking logic was broken, and it failed to correctly backtrack one of the offsets used for whitespace parsing (the byte offset since the start of the line). This caused whitespace nodes to refer to incorrect parts of the input text, eventually resulting in the above behavior.

This PR fixes the bookkeeping when the tokenizer backtracks.

Reported in #930.

* Allow no whitespace between lambda keyword and params in certain cases

Python accepts code where `lambda` follows a `*`, so this PR relaxes validation rules for Lambdas.

Raised in #930.

* Allow any expression in comprehensions' evaluated expression


This PR relaxes the accepted types for the `elt` field in `ListComp`, `SetComp`, and `GenExp`, as well as the `key` and `value` fields in `DictComp`.

Fixes #500.

* Allow no space around an ifexp in certain cases

For example in `_ if _ else""if _ else _`.

Raised in #930. Also fixes #854.

* Allow no spaces after `as` in a contextmanager in certain cases

Like in `with foo()as():pass`

Raised in #930.

* Allow no spaces around walrus in certain cases

Like in `[_:=''for _ in _]`

Raised in #930.

* Allow no whitespace after lambda body in certain cases

Like in `[lambda:()for _ in _]`

Reported in #930.
zsol added a commit that referenced this issue Jun 7, 2023
When the input doesn't have a trailing newline, but the last line had
exactly the amount of bytes as the current indentation level, the
tokenizer didn't emit a fake newline, causing parse errors (the grammar
expects newlines to conform with the Python spec).

I don't see any reason for fake newlines to be omitted in these cases,
so this PR removes that condition from the tokenizer.

Reported in #930.
@zsol
Copy link
Member

zsol commented Jun 7, 2023

Sounds good. Closing this for now, as all fixes have landed. Thanks for reporting!

@zsol zsol closed this as completed Jun 7, 2023
@Zac-HD
Copy link
Contributor Author

Zac-HD commented Jun 8, 2023

I installed libcst == 1.0.1, and reproduced the following examples:

handcrafted = [
    # looks like an attribute lookup, actually keyword + float
    '_ in.0',
    'while.0:_',
    '_ if _ else.0',

    # escaped indentation of some kind?
    'fr"\\\n"',
    'if _:\n _\n\\\nif _:\ _',

    # mixed tabs and spaces; many variations here
    'if _:\n if _:\n\t _',

    # various ways to look like you're calling a keyword
    '(_ if _ else _)in _',
    '_ if _ else(lambda:_)',
    '_ if(_ if _ else _) else _',
    '_ if _ else(_ if _ else _)',
    'def _(): return(lambda:_)',
    'def _(): return(_ if _ else _)',
]
for s in handcrafted:
    assert compiles(s), repr(s)
    assert not libcst_native_parses(s), repr(s)

@zsol
Copy link
Member

zsol commented Jun 8, 2023

Some of these trigger an existential crisis in me. Let me open separate issues. How do you find these?

@DRMacIver
Copy link

How do you find these?

I think part of this is that we're accidentally fuzzing libcst with our test-case reducer.

Short version is:

  1. We've got a large corpus of Python code from real world datasets and are checking for examples in it that parse with compile but not libcst
  2. We've got a test-case reducer (takes strings satisfying some condition, turns them into smaller strings that still satisfy that condition) we're using to try to turn those examples into small ones we can report.
  3. One of the things that test-case reducer does is delete individual characters from the string.
  4. Unfortunately that often causes it to "slip" off the original bug, where it finds a string that compiles but doesn't parse with libcst because Python turns out to have a lot of cases where you don't need a space.

Which is why we get monstrosities like 'while.0:_' - almost certainly nobody has ever deliberately written while.0, but there was some real world example they had written that triggers a bug, but when you delete characters from it turns into something exhibiting this issue.

I'll have a go at trying to write a more precise condition that tries to capture the original bug, but it's necessarily a bit tricky to do.

@DRMacIver
Copy link

To my total bafflement, it turns out that no this isn't actually slippage from our reducer and people really do write code like this. e.g. 0.15 if 20<x<30 else.2 is a real piece of code found in the wild that lead to one of these examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working parsing Converting source code into CST nodes
Projects
None yet
Development

No branches or pull requests

3 participants