Skip to content

bpo-40661: Fix segfault when parsing invalid input #20165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 18, 2020

Conversation

lysnikolaou
Copy link
Member

@lysnikolaou lysnikolaou commented May 17, 2020

Fix segfaults when parsing very complex invalid input, like
import äˆ ð£„¯ð¢·žð±‹á”€ð””ð‘©±å®ä±¬ð©¾\nð—¶½.

Initially reported by @pablogsal.

https://bugs.python.org/issue40661

Fix segfaults when parsing very complex invalid input, like
`import äˆ ð£„¯ð¢·žð±‹á”€ð””ð‘©±å®ä±¬ð©¾\nð—¶½`.

Initially reported by @pablogsal.
This is not really needed, but it's nice to have, so that we don't
unnecessariy call `_PyPegen_fill_token` when there are errors and
the parser will abort anyway.
Copy link
Contributor

@aeros aeros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was experimenting with the reproducer locally and might have discovered a more minimalist solution. I'm not certain as to whether this approach is any better or worse (my experience and knowledge w/ the parser is very limited), but this also seems to address the segfault:

diff --git a/Lib/test/test_peg_parser.py b/Lib/test/test_peg_parser.py
index 9614e45799..05de8582e7 100644
--- a/Lib/test/test_peg_parser.py
+++ b/Lib/test/test_peg_parser.py
@@ -591,6 +591,7 @@ FAIL_TEST_CASES = [
     ("f-string_single_closing_brace", "f'}'"),
     ("from_import_invalid", "from import import a"),
     ("from_import_trailing_comma", "from a import b,"),
+    ("import_very_complex", "import äˆ ð£„¯ð¢·žð±‹á”€ð””ð‘©±å®ä±¬ð©¾\nð—¶½"),
     # This test case checks error paths involving tokens with uninitialized
     # values of col_offset and end_col_offset.
     ("invalid indentation",
diff --git a/Parser/pegen/pegen.c b/Parser/pegen/pegen.c
index 7f3e4561de..25b8db0982 100644
--- a/Parser/pegen/pegen.c
+++ b/Parser/pegen/pegen.c
@@ -758,6 +758,9 @@ _PyPegen_lookahead(int positive, void *(func)(Parser *), Parser *p)
 Token *
 _PyPegen_expect_token(Parser *p, int type)
 {
+    if (p->error_indicator) {
+        return NULL;
+    }
     if (p->mark == p->fill) {

Would it make any sense to check the error indicator at the beginning of _PyPegen_expect_token()?

@lysnikolaou
Copy link
Member Author

lysnikolaou commented May 18, 2020

IMHO the solution Pablo found might be a bit more verbose and add 3 lines of code per alternative, but it makes more sense and is something we should have done, when we initially replaced setjmp with p->error_indicator. We generally want the parser to abort as quickly as possible, when there's an error, and that includes not going through the other alternatives of the rule that fails and all those above it in call tree.

@gvanrossum
Copy link
Member

@aeros What would really be helpful here is to understand the sequence of events that happens when the parser encounters that input, and compare it to what happens on similar ASCII input like import a ?. What path (presumably involving the tokenizer) is taken differently?

@pablogsal
Copy link
Member

pablogsal commented May 18, 2020

What path (presumably involving the tokenizer) is taken differently?

The difference is that when the import is not ascii, this code path is taken:

https://github.com/python/cpython/blob/master/Parser/pegen/pegen.c#L90-L105

and the call to id2 = _PyObject_FastCall(p->normalize, args, 2); is the one that fails because when we enter that call, we already have an exception set from tokenize_error, which is the exception that should be reported.

So basically: for ASCII characters there is one call less that makes the error surface without problems, but with non-ASCII characters, there is some call into Python code that fails because is done with an exception already set.

This can happen in so many other ways, so I think this PR is the safest thing to do because we make sure that nothing more is tried (and therefore no other calls can occur) when the error indicator is set.

@@ -591,6 +591,7 @@ def f(*a, b):
("f-string_single_closing_brace", "f'}'"),
("from_import_invalid", "from import import a"),
("from_import_trailing_comma", "from a import b,"),
("import_non_ascii_syntax_error", "import ä £"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add this to test_syntax as well? This file will likely go away when we don't have two parsers to compare against anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed this myself in commit 7353b6c

@pablogsal pablogsal merged commit 7b7a21b into python:master May 18, 2020
@lysnikolaou lysnikolaou deleted the fix-segfault branch May 18, 2020 17:33
arturoescaip pushed a commit to arturoescaip/cpython that referenced this pull request May 24, 2020
Fix segfaults when parsing very complex invalid input, like `import äˆ ð£„¯ð¢·žð±‹á”€ð””ð‘©±å®ä±¬ð©¾\nð—¶½`.

Co-authored-by: Guido van Rossum <guido@python.org>
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants