Skip to content

The new parser segfaults when parsing invalid input #84838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pablogsal opened this issue May 17, 2020 · 13 comments
Closed

The new parser segfaults when parsing invalid input #84838

pablogsal opened this issue May 17, 2020 · 13 comments
Labels
3.9 only security fixes deferred-blocker interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@pablogsal
Copy link
Member

BPO 40661
Nosy @gvanrossum, @vstinner, @lysnikolaou, @Zac-HD, @pablogsal
PRs
  • bpo-40661: Fix segfaults when parsing invalid input #20159
  • bpo-40661: Fix segfault when parsing invalid input #20165
  • Files
  • reproducer.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-05-18.17:32:26.774>
    created_at = <Date 2020-05-17.16:01:10.114>
    labels = ['interpreter-core', 'deferred-blocker', '3.9', 'type-crash']
    title = 'The new parser segfaults when parsing invalid input'
    updated_at = <Date 2020-05-18.17:32:26.773>
    user = 'https://github.com/pablogsal'

    bugs.python.org fields:

    activity = <Date 2020-05-18.17:32:26.773>
    actor = 'pablogsal'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-05-18.17:32:26.774>
    closer = 'pablogsal'
    components = ['Interpreter Core']
    creation = <Date 2020-05-17.16:01:10.114>
    creator = 'pablogsal'
    dependencies = []
    files = ['49169']
    hgrepos = []
    issue_num = 40661
    keywords = ['patch']
    message_count = 13.0
    messages = ['369132', '369133', '369134', '369136', '369142', '369144', '369164', '369176', '369181', '369235', '369249', '369263', '369280']
    nosy_count = 5.0
    nosy_names = ['gvanrossum', 'vstinner', 'lys.nikolaou', 'Zac Hatfield-Dodds', 'pablogsal']
    pr_nums = ['20159', '20165']
    priority = 'deferred blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue40661'
    versions = ['Python 3.9']

    @pablogsal
    Copy link
    Member Author

    The new peg parser segfaults when parsing the attached reproducer.

    this is due to the fact that the exception set by tokenizer_error is not properly propagated.

    @pablogsal pablogsal added release-blocker 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump labels May 17, 2020
    @pablogsal
    Copy link
    Member Author

    I think we may need to test for the error indicator (and maybe PyErr_Ocurred for safety) before every alternative. Something like:

    diff --git a/Tools/peg_generator/pegen/c_generator.py b/Tools/peg_generator/pegen/c_generator.py
    index 8f9972bb41..61cb694628 100644
    --- a/Tools/peg_generator/pegen/c_generator.py
    +++ b/Tools/peg_generator/pegen/c_generator.py
    @@ -468,10 +468,6 @@ class CParserGenerator(ParserGenerator, GrammarVisitor):
    memoize = self._should_memoize(node)

             with self.indent():
    -            self.print("if (p->error_indicator) {")
    -            with self.indent():
    -                self.print("return NULL;")
    -            self.print("}")
                 self.print(f"{result_type} _res = NULL;")
                 if memoize:
                     self.print(f"if (_PyPegen_is_memoized(p, {node.name}_type, &_res))")
    @@ -685,6 +681,12 @@ class CParserGenerator(ParserGenerator, GrammarVisitor):
         def visit_Alt(
             self, node: Alt, is_loop: bool, is_gather: bool, rulename: Optional[str]
         ) -> None:
    +        self.print("if (p->error_indicator == 1 || PyErr_Occurred()) {")
    +        with self.indent():
    +            self.print("p->error_indicator = 1;")
    +            self.print("return NULL;")
    +        self.print("}")
    +
             self.print(f"{{ // {node}")
             with self.indent():
                 # Prepare variable declarations for the alternative

    @pablogsal
    Copy link
    Member Author

    Indeed, that diff solves the problem

    @gvanrossum
    Copy link
    Member

    How costly is PyErr_Occurred()? That worries me most, otherwise I’d accept this right away.

    @pablogsal
    Copy link
    Member Author

    A quick benchmark using xxl.py:

    Base time (master):
    Time: 9.386 seconds on an average of 20 runs

    With the patch in this issue:
    Time: 9.498 seconds on an average of 20 runs

    Sadly I could not test with PGO/LTO and without CPU isolation, so it would be great if someone could double-check these numbers.

    Also, I will be unable to do a PR until this night/tomorrow morning (London time) :(

    @gvanrossum
    Copy link
    Member

    I see almost no time difference for 'make time_stdlib': before 3.471, after 3.451. But I see a serious difference for 'make time_compile': before 3.474, after 4.996. That's over 40% slower (on the extreme case, xxl.py).

    I'll prepare a PR just in case.

    @Zac-HD
    Copy link
    Mannequin

    Zac-HD mannequin commented May 18, 2020

    I understand from Paul Ganssle that this bug was found using Hypothesmith in my stdlib property tests (reported at Zac-HD/stdlib-property-tests#14).

    As discussed in we-like-parsers#91 and https://pyfound.blogspot.com/2020/05/property-based-testing-for-python.html I'm keen to help out how I can, so if there's anything more specific than "write tools, write test, and wait" please let me know!

    Best,
    Zac

    @gvanrossum
    Copy link
    Member

    Zac: The reproducer here apparently uses a long string of weird accented characters. I'm not sure how to generalize from that to other things that Hyothes* could find. But maybe this helps: #20106 (comment)

    @Zac-HD
    Copy link
    Mannequin

    Zac-HD mannequin commented May 18, 2020

    I know what else it might find either, but I still think it's worth running property-based tests in CI to find out! The demo I wrote for my language summit talk doesn't have any parser tests, but still would have caught this bug in the pull request that introduced it.

    The specific reproducer here is odd, because it's reported as an internal error in Hypothesmith - I use the compile() builtin to check that strings valid against an approximate grammar are actually valid.

    It's structurally less complex than typical outputs because it's only a fragment of the tree being generated, but because shrinking doesn't run for generation-time errors it's also much harder to interpret than usual.

    @gvanrossum
    Copy link
    Member

    Unfortunately, I do not understand enough about how Hypothes* works to be helpful here, other than by offering encouragement. (And please don't try to educate me. I have too many other things. Sorry.)

    @vstinner
    Copy link
    Member

    I don't think that such bug should block Python 3.9 beta1 (I'm talking about the "release blocker" priority). There is still time before 3.9 final to fix it.

    @gvanrossum
    Copy link
    Member

    Okay, deferring the blocker. But I'll still land Lysandros' fix ASAP.

    @pablogsal
    Copy link
    Member Author

    New changeset 7b7a21b by Lysandros Nikolaou in branch 'master':
    bpo-40661: Fix segfault when parsing invalid input (GH-20165)
    7b7a21b

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes deferred-blocker interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants