-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
bpo-40661: Fix segfault when parsing invalid input #20165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fix segfaults when parsing very complex invalid input, like `import äˆ ð£„¯ð¢·žð±‹á”€ð””ð‘©±å®ä±¬ð©¾\nð—¶½`. Initially reported by @pablogsal.
This is not really needed, but it's nice to have, so that we don't unnecessariy call `_PyPegen_fill_token` when there are errors and the parser will abort anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was experimenting with the reproducer locally and might have discovered a more minimalist solution. I'm not certain as to whether this approach is any better or worse (my experience and knowledge w/ the parser is very limited), but this also seems to address the segfault:
diff --git a/Lib/test/test_peg_parser.py b/Lib/test/test_peg_parser.py
index 9614e45799..05de8582e7 100644
--- a/Lib/test/test_peg_parser.py
+++ b/Lib/test/test_peg_parser.py
@@ -591,6 +591,7 @@ FAIL_TEST_CASES = [
("f-string_single_closing_brace", "f'}'"),
("from_import_invalid", "from import import a"),
("from_import_trailing_comma", "from a import b,"),
+ ("import_very_complex", "import äˆ ð£„¯ð¢·žð±‹á”€ð””ð‘©±å®ä±¬ð©¾\nð—¶½"),
# This test case checks error paths involving tokens with uninitialized
# values of col_offset and end_col_offset.
("invalid indentation",
diff --git a/Parser/pegen/pegen.c b/Parser/pegen/pegen.c
index 7f3e4561de..25b8db0982 100644
--- a/Parser/pegen/pegen.c
+++ b/Parser/pegen/pegen.c
@@ -758,6 +758,9 @@ _PyPegen_lookahead(int positive, void *(func)(Parser *), Parser *p)
Token *
_PyPegen_expect_token(Parser *p, int type)
{
+ if (p->error_indicator) {
+ return NULL;
+ }
if (p->mark == p->fill) {
Would it make any sense to check the error indicator at the beginning of _PyPegen_expect_token()
?
IMHO the solution Pablo found might be a bit more verbose and add 3 lines of code per alternative, but it makes more sense and is something we should have done, when we initially replaced |
@aeros What would really be helpful here is to understand the sequence of events that happens when the parser encounters that input, and compare it to what happens on similar ASCII input like |
The difference is that when the import is not ascii, this code path is taken: https://github.com/python/cpython/blob/master/Parser/pegen/pegen.c#L90-L105 and the call to So basically: for ASCII characters there is one call less that makes the error surface without problems, but with non-ASCII characters, there is some call into Python code that fails because is done with an exception already set. This can happen in so many other ways, so I think this PR is the safest thing to do because we make sure that nothing more is tried (and therefore no other calls can occur) when the error indicator is set. |
@@ -591,6 +591,7 @@ def f(*a, b): | |||
("f-string_single_closing_brace", "f'}'"), | |||
("from_import_invalid", "from import import a"), | |||
("from_import_trailing_comma", "from a import b,"), | |||
("import_non_ascii_syntax_error", "import ä £"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this to test_syntax
as well? This file will likely go away when we don't have two parsers to compare against anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed this myself in commit 7353b6c
Fix segfaults when parsing very complex invalid input, like `import äˆ ð£„¯ð¢·žð±‹á”€ð””ð‘©±å®ä±¬ð©¾\nð—¶½`. Co-authored-by: Guido van Rossum <guido@python.org> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
Fix segfaults when parsing very complex invalid input, like
import äˆ ð£„¯ð¢·žð±‹á”€ð””ð‘©±å®ä±¬ð©¾\nð—¶½
.Initially reported by @pablogsal.
https://bugs.python.org/issue40661