-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Real-world code snippets which libcst
fails to parse
#930
Comments
See python/cpython#23317 Raised in #930.
For an expression like `f"{one:{two:}{three}}"`, `three` is not in an f-string spec, and should be tokenized accordingly. This PR fixes the `format_spec_count` bookkeeping in the tokenizer, so it properly decrements it when a closing `}` is encountered but only if the `}` closes a format_spec. Reported in #930.
This is an obscure one. `_ if 0else _` failed to parse with some very weird errors. It turns out that the tokenizer tries to parse `0else` as a single number, but when it encounters `l` it realizes it can't be a single number and it backtracks. Unfortunately the backtracking logic was broken, and it failed to correctly backtrack one of the offsets used for whitespace parsing (the byte offset since the start of the line). This caused whitespace nodes to refer to incorrect parts of the input text, eventually resulting in the above behavior. This PR fixes the bookkeeping when the tokenizer backtracks. Reported in #930.
Python accepts code where `lambda` follows a `*`, so this PR relaxes validation rules for Lambdas. Raised in #930.
OK I've got a fix for some of these, and have a pretty good idea about how to fix the remaining whitespace validation issues (similar to the lambda situation). I haven't looked into the indentation issues yet. About to head out on vacation, so if anyone wants to jump in on these, feel free to 😝 |
For example in `_ if _ else""if _ else _`. Raised in #930
Like in `with foo()as():pass` Raised in #930.
Like in `[_:=''for _ in _]` Raised in #930.
Like in `[lambda:()for _ in _]` Reported in #930.
When the input doesn't have a trailing newline, but the last line had exactly the amount of bytes as the current indentation level, the tokenizer didn't emit a fake newline, causing parse errors (the grammar expects newlines to conform with the Python spec). I don't see any reason for fake newlines to be omitted in these cases, so this PR removes that condition from the tokenizer. Reported in #930.
When the input doesn't have a trailing newline, but the last line had exactly the amount of bytes as the current indentation level, the tokenizer didn't emit a fake newline, causing parse errors (the grammar expects newlines to conform with the Python spec). I don't see any reason for fake newlines to be omitted in these cases, so this PR removes that condition from the tokenizer. Reported in #930.
When the input doesn't have a trailing newline, but the last line had exactly the amount of bytes as the current indentation level, the tokenizer didn't emit a fake newline, causing parse errors (the grammar expects newlines to conform with the Python spec). I don't see any reason for fake newlines to be omitted in these cases, so this PR removes that condition from the tokenizer. Reported in #930.
BTW owing to the way we found these, it's likely that many of them are actually different bugs than the ones we found in the wild and we may have a bunch more for you once we've retested with #939 and #940. Will let you know when we do! (It's a test-case reduction thing called "slippage": Automatically transforming a test case into a smaller one subject only to the condition that it still has a bug in it doesn't guarantee that it's the same bug. In particular I think probably these no-whitespace-between-tokens ones are something that was introduced by deleting the whitespace rather than whatever the original bug was) |
#936, #935, #934, #933, #932, #931) * Allow walrus in slices See python/cpython#23317 Raised in #930. * Fix parsing of nested f-string specifiers For an expression like `f"{one:{two:}{three}}"`, `three` is not in an f-string spec, and should be tokenized accordingly. This PR fixes the `format_spec_count` bookkeeping in the tokenizer, so it properly decrements it when a closing `}` is encountered but only if the `}` closes a format_spec. Reported in #930. * Fix tokenizing `0else` This is an obscure one. `_ if 0else _` failed to parse with some very weird errors. It turns out that the tokenizer tries to parse `0else` as a single number, but when it encounters `l` it realizes it can't be a single number and it backtracks. Unfortunately the backtracking logic was broken, and it failed to correctly backtrack one of the offsets used for whitespace parsing (the byte offset since the start of the line). This caused whitespace nodes to refer to incorrect parts of the input text, eventually resulting in the above behavior. This PR fixes the bookkeeping when the tokenizer backtracks. Reported in #930. * Allow no whitespace between lambda keyword and params in certain cases Python accepts code where `lambda` follows a `*`, so this PR relaxes validation rules for Lambdas. Raised in #930. * Allow any expression in comprehensions' evaluated expression This PR relaxes the accepted types for the `elt` field in `ListComp`, `SetComp`, and `GenExp`, as well as the `key` and `value` fields in `DictComp`. Fixes #500. * Allow no space around an ifexp in certain cases For example in `_ if _ else""if _ else _`. Raised in #930. Also fixes #854. * Allow no spaces after `as` in a contextmanager in certain cases Like in `with foo()as():pass` Raised in #930. * Allow no spaces around walrus in certain cases Like in `[_:=''for _ in _]` Raised in #930. * Allow no whitespace after lambda body in certain cases Like in `[lambda:()for _ in _]` Reported in #930.
When the input doesn't have a trailing newline, but the last line had exactly the amount of bytes as the current indentation level, the tokenizer didn't emit a fake newline, causing parse errors (the grammar expects newlines to conform with the Python spec). I don't see any reason for fake newlines to be omitted in these cases, so this PR removes that condition from the tokenizer. Reported in #930.
Sounds good. Closing this for now, as all fixes have landed. Thanks for reporting! |
I installed handcrafted = [
# looks like an attribute lookup, actually keyword + float
'_ in.0',
'while.0:_',
'_ if _ else.0',
# escaped indentation of some kind?
'fr"\\\n"',
'if _:\n _\n\\\nif _:\ _',
# mixed tabs and spaces; many variations here
'if _:\n if _:\n\t _',
# various ways to look like you're calling a keyword
'(_ if _ else _)in _',
'_ if _ else(lambda:_)',
'_ if(_ if _ else _) else _',
'_ if _ else(_ if _ else _)',
'def _(): return(lambda:_)',
'def _(): return(_ if _ else _)',
]
for s in handcrafted:
assert compiles(s), repr(s)
assert not libcst_native_parses(s), repr(s) |
Some of these trigger an existential crisis in me. Let me open separate issues. How do you find these? |
I think part of this is that we're accidentally fuzzing libcst with our test-case reducer. Short version is:
Which is why we get monstrosities like I'll have a go at trying to write a more precise condition that tries to capture the original bug, but it's necessarily a bit tricky to do. |
To my total bafflement, it turns out that no this isn't actually slippage from our reducer and people really do write code like this. e.g. |
@DRMacIver and I have been on a bit of a bughunt lately: the following cases are accepted by Python 3.10's
compile()
function, but causelibcst
1.0.0 to raise an exception:Note that each was extracted from code in the wild, so there's already more user impact than from (say) #446.
I was hoping to finish this before 1.0 was released, but you're too fast 💖
The text was updated successfully, but these errors were encountered: