Real-world code snippets which `libcst` fails to parse #930

Zac-HD · 2023-05-26T21:36:26Z

@DRMacIver and I have been on a bit of a bughunt lately: the following cases are accepted by Python 3.10's compile() function, but cause libcst 1.0.0 to raise an exception:

import libcst

def compiles(source: str):
    try:
        compile(source, "<string>", "exec")
        return True
    except Exception:
        return False

def libcst_parses(source):
    try:
        libcst.parse_module(source)
        return True
    except Exception:
        return False

for s in [
    # Assignment expression while indexing
    '_[_:=0]',

    # Complex f-string expression
    'f"{_:{_:}{_}}"',

    # Expecting indentation?
    '(\n \\\n)',
    '(\n    """\n""")',
    'if _:\n    """\n)"""',
    'if _:\n    ("""\n""")',
    'if _:\n     """\n  """',
    'if _:\n        """\n    """ ',

    # (legally) missing whitespace around keywords
    'lambda*_:_',
    'lambda**_:_',
    '_ if 0else _',
    'with _ as():_',
    '(a:=()for _ in _)',

    '_ if _ else""if _ else _',
    '_ if _ else()if _ else _',
    '(_ if _ else 0for _ in _)',
    '(_ if _ else""for _ in _)',
    '(_ if _ else()for _ in _)',

    '(lambda:()for _ in _)',
    'def _():return()if _ else _',
]:
    assert compiles(s), repr(s)
    assert not libcst_native_parses(s), repr(s)

Note that each was extracted from code in the wild, so there's already more user impact than from (say) #446.

I was hoping to finish this before 1.0 was released, but you're too fast 💖

The text was updated successfully, but these errors were encountered:

See python/cpython#23317 Raised in #930.

For an expression like `f"{one:{two:}{three}}"`, `three` is not in an f-string spec, and should be tokenized accordingly. This PR fixes the `format_spec_count` bookkeeping in the tokenizer, so it properly decrements it when a closing `}` is encountered but only if the `}` closes a format_spec. Reported in #930.

This is an obscure one. `_ if 0else _` failed to parse with some very weird errors. It turns out that the tokenizer tries to parse `0else` as a single number, but when it encounters `l` it realizes it can't be a single number and it backtracks. Unfortunately the backtracking logic was broken, and it failed to correctly backtrack one of the offsets used for whitespace parsing (the byte offset since the start of the line). This caused whitespace nodes to refer to incorrect parts of the input text, eventually resulting in the above behavior. This PR fixes the bookkeeping when the tokenizer backtracks. Reported in #930.

Python accepts code where `lambda` follows a `*`, so this PR relaxes validation rules for Lambdas. Raised in #930.

zsol · 2023-05-27T19:25:52Z

OK I've got a fix for some of these, and have a pretty good idea about how to fix the remaining whitespace validation issues (similar to the lambda situation).

I haven't looked into the indentation issues yet.

About to head out on vacation, so if anyone wants to jump in on these, feel free to 😝

For example in `_ if _ else""if _ else _`. Raised in #930

For example in `_ if _ else""if _ else _`. Raised in #930. Also fixes #854.

Like in `with foo()as():pass` Raised in #930.

Like in `[_:=''for _ in _]` Raised in #930.

Like in `[lambda:()for _ in _]` Reported in #930.

When the input doesn't have a trailing newline, but the last line had exactly the amount of bytes as the current indentation level, the tokenizer didn't emit a fake newline, causing parse errors (the grammar expects newlines to conform with the Python spec). I don't see any reason for fake newlines to be omitted in these cases, so this PR removes that condition from the tokenizer. Reported in #930.

zsol · 2023-05-28T21:04:33Z

With the stack at #939 and #940, all of these parse successfully 🎉

DRMacIver · 2023-05-30T13:01:32Z

BTW owing to the way we found these, it's likely that many of them are actually different bugs than the ones we found in the wild and we may have a bunch more for you once we've retested with #939 and #940. Will let you know when we do!

(It's a test-case reduction thing called "slippage": Automatically transforming a test case into a smaller one subject only to the condition that it still has a bug in it doesn't guarantee that it's the same bug. In particular I think probably these no-whitespace-between-tokens ones are something that was introduced by deleting the whitespace rather than whatever the original bug was)

#936, #935, #934, #933, #932, #931) * Allow walrus in slices See python/cpython#23317 Raised in #930. * Fix parsing of nested f-string specifiers For an expression like `f"{one:{two:}{three}}"`, `three` is not in an f-string spec, and should be tokenized accordingly. This PR fixes the `format_spec_count` bookkeeping in the tokenizer, so it properly decrements it when a closing `}` is encountered but only if the `}` closes a format_spec. Reported in #930. * Fix tokenizing `0else` This is an obscure one. `_ if 0else _` failed to parse with some very weird errors. It turns out that the tokenizer tries to parse `0else` as a single number, but when it encounters `l` it realizes it can't be a single number and it backtracks. Unfortunately the backtracking logic was broken, and it failed to correctly backtrack one of the offsets used for whitespace parsing (the byte offset since the start of the line). This caused whitespace nodes to refer to incorrect parts of the input text, eventually resulting in the above behavior. This PR fixes the bookkeeping when the tokenizer backtracks. Reported in #930. * Allow no whitespace between lambda keyword and params in certain cases Python accepts code where `lambda` follows a `*`, so this PR relaxes validation rules for Lambdas. Raised in #930. * Allow any expression in comprehensions' evaluated expression This PR relaxes the accepted types for the `elt` field in `ListComp`, `SetComp`, and `GenExp`, as well as the `key` and `value` fields in `DictComp`. Fixes #500. * Allow no space around an ifexp in certain cases For example in `_ if _ else""if _ else _`. Raised in #930. Also fixes #854. * Allow no spaces after `as` in a contextmanager in certain cases Like in `with foo()as():pass` Raised in #930. * Allow no spaces around walrus in certain cases Like in `[_:=''for _ in _]` Raised in #930. * Allow no whitespace after lambda body in certain cases Like in `[lambda:()for _ in _]` Reported in #930.

When the input doesn't have a trailing newline, but the last line had exactly the amount of bytes as the current indentation level, the tokenizer didn't emit a fake newline, causing parse errors (the grammar expects newlines to conform with the Python spec). I don't see any reason for fake newlines to be omitted in these cases, so this PR removes that condition from the tokenizer. Reported in #930.

zsol · 2023-06-07T11:41:08Z

Sounds good. Closing this for now, as all fixes have landed. Thanks for reporting!

Zac-HD · 2023-06-08T06:30:59Z

I installed libcst == 1.0.1, and reproduced the following examples:

handcrafted = [
    # looks like an attribute lookup, actually keyword + float
    '_ in.0',
    'while.0:_',
    '_ if _ else.0',

    # escaped indentation of some kind?
    'fr"\\\n"',
    'if _:\n _\n\\\nif _:\ _',

    # mixed tabs and spaces; many variations here
    'if _:\n if _:\n\t _',

    # various ways to look like you're calling a keyword
    '(_ if _ else _)in _',
    '_ if _ else(lambda:_)',
    '_ if(_ if _ else _) else _',
    '_ if _ else(_ if _ else _)',
    'def _(): return(lambda:_)',
    'def _(): return(_ if _ else _)',
]
for s in handcrafted:
    assert compiles(s), repr(s)
    assert not libcst_native_parses(s), repr(s)

zsol · 2023-06-08T11:17:47Z

Some of these trigger an existential crisis in me. Let me open separate issues. How do you find these?

DRMacIver · 2023-06-08T11:55:16Z

How do you find these?

I think part of this is that we're accidentally fuzzing libcst with our test-case reducer.

Short version is:

We've got a large corpus of Python code from real world datasets and are checking for examples in it that parse with compile but not libcst
We've got a test-case reducer (takes strings satisfying some condition, turns them into smaller strings that still satisfy that condition) we're using to try to turn those examples into small ones we can report.
One of the things that test-case reducer does is delete individual characters from the string.
Unfortunately that often causes it to "slip" off the original bug, where it finds a string that compiles but doesn't parse with libcst because Python turns out to have a lot of cases where you don't need a space.

Which is why we get monstrosities like 'while.0:_' - almost certainly nobody has ever deliberately written while.0, but there was some real world example they had written that triggers a bug, but when you delete characters from it turns into something exhibiting this issue.

I'll have a go at trying to write a more precise condition that tries to capture the original bug, but it's necessarily a bit tricky to do.

DRMacIver · 2023-06-08T14:59:39Z

To my total bafflement, it turns out that no this isn't actually slippage from our reducer and people really do write code like this. e.g. 0.15 if 20<x<30 else.2 is a real piece of code found in the wild that lead to one of these examples.

zsol added bug Something isn't working parsing Converting source code into CST nodes labels May 26, 2023

zsol added a commit that referenced this issue May 27, 2023

Allow walrus in slices

110109a

See python/cpython#23317 Raised in #930.

This was referenced May 27, 2023

Allow walrus in slices #931

Closed

Fix parsing of nested f-string specifiers #932

Closed

zsol mentioned this issue May 27, 2023

Fix tokenizing 0else #933

Closed

zsol added a commit that referenced this issue May 27, 2023

Allow no whitespace between lambda keyword and params in certain cases

744d6fd

Python accepts code where `lambda` follows a `*`, so this PR relaxes validation rules for Lambdas. Raised in #930.

zsol mentioned this issue May 27, 2023

Allow no whitespace between lambda keyword and params in certain cases #934

Closed

zsol mentioned this issue May 28, 2023

Allow no space around an ifexp in certain cases #935

Closed

zsol added a commit that referenced this issue May 28, 2023

Allow no space before an ifexp in certain cases

f7182bb

For example in `_ if _ else""if _ else _`. Raised in #930

zsol added a commit that referenced this issue May 28, 2023

Allow no space around an ifexp in certain cases

aa51671

For example in `_ if _ else""if _ else _`. Raised in #930. Also fixes #854.

zsol added a commit that referenced this issue May 28, 2023

Allow no spaces after as in a contextmanager in certain cases

4ce3660

Like in `with foo()as():pass` Raised in #930.

zsol mentioned this issue May 28, 2023

Allow no spaces after as in a contextmanager in certain cases #937

Closed

zsol added a commit that referenced this issue May 28, 2023

Allow no spaces around walrus in certain cases

862b702

Like in `[_:=''for _ in _]` Raised in #930.

This was referenced May 28, 2023

Allow no spaces around walrus in certain cases #938

Closed

Allow no whitespace after lambda body in certain cases #939

Merged

zsol added a commit that referenced this issue May 28, 2023

Allow no whitespace after lambda body in certain cases

ea4ea34

Like in `[lambda:()for _ in _]` Reported in #930.

zsol mentioned this issue May 28, 2023

Fix parsing of code without trailing newlines #940

Merged

zsol closed this as completed Jun 7, 2023

This was referenced Jun 8, 2023

Parse error when no space between keyword and float #950

Open

Parse error when non-standard indentation is encountered #951

Open

Parse error when keyword adjacent to parenthesis #952

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real-world code snippets which `libcst` fails to parse #930

Real-world code snippets which `libcst` fails to parse #930

Zac-HD commented May 26, 2023

zsol commented May 27, 2023

zsol commented May 28, 2023

DRMacIver commented May 30, 2023 •

edited

Loading

zsol commented Jun 7, 2023 •

edited

Loading

Zac-HD commented Jun 8, 2023 •

edited

Loading

zsol commented Jun 8, 2023 •

edited

Loading

DRMacIver commented Jun 8, 2023

DRMacIver commented Jun 8, 2023

Real-world code snippets which libcst fails to parse #930

Real-world code snippets which libcst fails to parse #930

Comments

Zac-HD commented May 26, 2023

zsol commented May 27, 2023

zsol commented May 28, 2023

DRMacIver commented May 30, 2023 • edited Loading

zsol commented Jun 7, 2023 • edited Loading

Zac-HD commented Jun 8, 2023 • edited Loading

zsol commented Jun 8, 2023 • edited Loading

DRMacIver commented Jun 8, 2023

DRMacIver commented Jun 8, 2023

Real-world code snippets which `libcst` fails to parse #930

Real-world code snippets which `libcst` fails to parse #930

DRMacIver commented May 30, 2023 •

edited

Loading

zsol commented Jun 7, 2023 •

edited

Loading

Zac-HD commented Jun 8, 2023 •

edited

Loading

zsol commented Jun 8, 2023 •

edited

Loading