bpo-40661: Fix segfault when parsing invalid input #20165

lysnikolaou · 2020-05-17T23:11:05Z

Fix segfaults when parsing very complex invalid input, like
import äˆ ð£„¯ð¢·žð±‹á”€ð””ð‘©±å®ä±¬ð©¾\nð—¶½.

Initially reported by @pablogsal.

https://bugs.python.org/issue40661

@pablogsal

Fix segfaults when parsing very complex invalid input, like `import äˆ ð£„¯ð¢·žð±‹á”€ð””ð‘©±å®ä±¬ð©¾\nð—¶½`. Initially reported by @pablogsal.

This is not really needed, but it's nice to have, so that we don't unnecessariy call `_PyPegen_fill_token` when there are errors and the parser will abort anyway.

aeros

I was experimenting with the reproducer locally and might have discovered a more minimalist solution. I'm not certain as to whether this approach is any better or worse (my experience and knowledge w/ the parser is very limited), but this also seems to address the segfault:

diff --git a/Lib/test/test_peg_parser.py b/Lib/test/test_peg_parser.py
index 9614e45799..05de8582e7 100644
--- a/Lib/test/test_peg_parser.py
+++ b/Lib/test/test_peg_parser.py
@@ -591,6 +591,7 @@ FAIL_TEST_CASES = [
     ("f-string_single_closing_brace", "f'}'"),
     ("from_import_invalid", "from import import a"),
     ("from_import_trailing_comma", "from a import b,"),
+    ("import_very_complex", "import äˆ ð£„¯ð¢·žð±‹á”€ð””ð‘©±å®ä±¬ð©¾\nð—¶½"),
     # This test case checks error paths involving tokens with uninitialized
     # values of col_offset and end_col_offset.
     ("invalid indentation",
diff --git a/Parser/pegen/pegen.c b/Parser/pegen/pegen.c
index 7f3e4561de..25b8db0982 100644
--- a/Parser/pegen/pegen.c
+++ b/Parser/pegen/pegen.c
@@ -758,6 +758,9 @@ _PyPegen_lookahead(int positive, void *(func)(Parser *), Parser *p)
 Token *
 _PyPegen_expect_token(Parser *p, int type)
 {
+    if (p->error_indicator) {
+        return NULL;
+    }
     if (p->mark == p->fill) {

Would it make any sense to check the error indicator at the beginning of _PyPegen_expect_token()?

lysnikolaou · 2020-05-18T10:45:36Z

IMHO the solution Pablo found might be a bit more verbose and add 3 lines of code per alternative, but it makes more sense and is something we should have done, when we initially replaced setjmp with p->error_indicator. We generally want the parser to abort as quickly as possible, when there's an error, and that includes not going through the other alternatives of the rule that fails and all those above it in call tree.

Lib/test/test_peg_parser.py

gvanrossum · 2020-05-18T16:26:19Z

@aeros What would really be helpful here is to understand the sequence of events that happens when the parser encounters that input, and compare it to what happens on similar ASCII input like import a ?. What path (presumably involving the tokenizer) is taken differently?

pablogsal · 2020-05-18T16:52:33Z

What path (presumably involving the tokenizer) is taken differently?

The difference is that when the import is not ascii, this code path is taken:

https://github.com/python/cpython/blob/master/Parser/pegen/pegen.c#L90-L105

and the call to id2 = _PyObject_FastCall(p->normalize, args, 2); is the one that fails because when we enter that call, we already have an exception set from tokenize_error, which is the exception that should be reported.

So basically: for ASCII characters there is one call less that makes the error surface without problems, but with non-ASCII characters, there is some call into Python code that fails because is done with an exception already set.

This can happen in so many other ways, so I think this PR is the safest thing to do because we make sure that nothing more is tried (and therefore no other calls can occur) when the error indicator is set.

pablogsal · 2020-05-18T16:56:04Z

Lib/test/test_peg_parser.py

@@ -591,6 +591,7 @@ def f(*a, b):
    ("f-string_single_closing_brace", "f'}'"),
    ("from_import_invalid", "from import import a"),
    ("from_import_trailing_comma", "from a import b,"),
+    ("import_non_ascii_syntax_error", "import ä £"),


Could you add this to test_syntax as well? This file will likely go away when we don't have two parsers to compare against anymore.

I have pushed this myself in commit 7353b6c

Fix segfaults when parsing very complex invalid input, like `import äˆ ð£„¯ð¢·žð±‹á”€ð””ð‘©±å®ä±¬ð©¾\nð—¶½`. Co-authored-by: Guido van Rossum <guido@python.org> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

bpo-40661: Fix segfault when parsing invalid input

a1a9d28

Fix segfaults when parsing very complex invalid input, like `import äˆ ð£„¯ð¢·žð±‹á”€ð””ð‘©±å®ä±¬ð©¾\nð—¶½`. Initially reported by @pablogsal.

lysnikolaou requested a review from pablogsal as a code owner May 17, 2020 23:11

the-knights-who-say-ni added the CLA signed label May 17, 2020

bedevere-bot added the awaiting review label May 17, 2020

lysnikolaou added the skip news label May 17, 2020

Check for errors at the beginning of each rule as well

9ec03ba

This is not really needed, but it's nice to have, so that we don't unnecessariy call `_PyPegen_fill_token` when there are errors and the parser will abort anyway.

aeros reviewed May 18, 2020

View reviewed changes

gvanrossum approved these changes May 18, 2020

View reviewed changes

Lib/test/test_peg_parser.py Outdated Show resolved Hide resolved

bedevere-bot added awaiting merge and removed awaiting review labels May 18, 2020

Update Lib/test/test_peg_parser.py

b7be72c

pablogsal reviewed May 18, 2020

View reviewed changes

Add test also to test_syntax

7353b6c

pablogsal approved these changes May 18, 2020

View reviewed changes

pablogsal merged commit 7b7a21b into python:master May 18, 2020

bedevere-bot removed the awaiting merge label May 18, 2020

lysnikolaou deleted the fix-segfault branch May 18, 2020 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bpo-40661: Fix segfault when parsing invalid input #20165

bpo-40661: Fix segfault when parsing invalid input #20165

Uh oh!

lysnikolaou commented May 17, 2020 •

edited by bedevere-bot

Loading

Uh oh!

aeros left a comment

Uh oh!

lysnikolaou commented May 18, 2020 •

edited

Loading

Uh oh!

Uh oh!

gvanrossum commented May 18, 2020

Uh oh!

pablogsal commented May 18, 2020 •

edited

Loading

Uh oh!

pablogsal May 18, 2020

Uh oh!

pablogsal May 18, 2020

Uh oh!

Uh oh!

Uh oh!

bpo-40661: Fix segfault when parsing invalid input #20165

bpo-40661: Fix segfault when parsing invalid input #20165

Uh oh!

Conversation

lysnikolaou commented May 17, 2020 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aeros left a comment

Choose a reason for hiding this comment

Uh oh!

lysnikolaou commented May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gvanrossum commented May 18, 2020

Uh oh!

pablogsal commented May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal May 18, 2020

Choose a reason for hiding this comment

Uh oh!

pablogsal May 18, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lysnikolaou commented May 17, 2020 •

edited by bedevere-bot

Loading

lysnikolaou commented May 18, 2020 •

edited

Loading

pablogsal commented May 18, 2020 •

edited

Loading