Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when parsing with TOML_EXCEPTIONS=0 #65

Closed
sneves opened this issue Oct 9, 2020 · 14 comments
Closed

Segfault when parsing with TOML_EXCEPTIONS=0 #65

sneves opened this issue Oct 9, 2020 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@sneves
Copy link

sneves commented Oct 9, 2020

A very simple case, once again: parsing "#\xf1\x63" when disabling exceptions. The crash occurs in parser::parse_document when trying to continue parsing after reader.read_next() returns nullptr.

I'm not entirely sure why it happens; it looks like consume_comment should return false on advance_and_return_if_error({}), but somehow it does return true instead.

@sneves sneves added the bug Something isn't working label Oct 9, 2020
@marzer
Copy link
Owner

marzer commented Oct 9, 2020

Interesting. The test suite does run builds with TOML_EXCEPTIONS=0, so I need more information. Environment, compiler flags, etc., if you have them handy. (the bug report template does ask for these things)

I'll investigate on my machine all the same, but every bit helps.

@marzer
Copy link
Owner

marzer commented Oct 9, 2020

Oh, you know what, scratch that! I see what the issue is immediately. The core parse loop doesn't account for the possibility of a comment or malformed UTF-8 right at EOF. Should be easy to fix, thanks.

@marzer marzer closed this as completed in 6255dd7 Oct 9, 2020
@sneves
Copy link
Author

sneves commented Oct 9, 2020

Another input resulting in a null pointer dereference: "1= 0x6cA#+\xf1". This one fails when setting the error string, in to_sv(*cp) with a null cp.

I should mentioned that this happens after applying the 6255dd7 fix.

@marzer
Copy link
Owner

marzer commented Oct 9, 2020

Fixed, thanks @sneves!

Since you seem to enjoy digging a little bit, I'll explain what was actually happening here.

  1. The string contained a malformed utf8 byte sequence so the stream reader's read_next() captured an error and returned a null codepoint
  2. consume_comment() was detecting the error state and returning false (which is the intended behaviour; an error while consuming something is a failure to consume it)
  3. parse_document() was assuming the only false results here meant "nothing to consume", not taking into account the possibility for an error at this level.

Adding a single return_if_error_or_eof() to parse_document() was sufficient to handle the error propagation.

Worth noting that all this complexity evaporates if the library is built with exceptions, so if you can use them I highly recommend it (especially since all the explicit error-handling code dissapearing means the parser is a decent bit faster).

There's now a unit test to catch this, too:

SECTION("feedback - github/issues/65")
{
// see: https://github.com/marzer/tomlplusplus/issues/65
// this tests two things:
// - a comment at EOF
// - a malformed UTF-8 sequence
//
// it should fail to parse, but correctly issue an error (not crash!)
parsing_should_fail(FILE_LINE_ARGS, "#\xf1\x63");
}

@marzer
Copy link
Owner

marzer commented Oct 9, 2020

Another input resulting in a null pointer dereference: "1= 0x6cA#+\xf1". This one fails when setting the error string, in to_sv(*cp) with a null cp.

Yup, that will be the same issue. \xf1 is in-and-of-itself not a complete utf8 byte sequence.

@marzer
Copy link
Owner

marzer commented Oct 9, 2020

I should mentioned that this happens after applying the 6255dd7 fix.

Oh 😓

@marzer
Copy link
Owner

marzer commented Oct 9, 2020

These are pretty weird inputs @sneves; out of curiosity, where are you getting them from? Are you just making them up to stress-test the library (which I appreciate!), or is there some example garbage TOML you're plucking it from?

@marzer
Copy link
Owner

marzer commented Oct 9, 2020

Alright, should be good now. I also audited the rest of the to_sv(*cp) callsites for the possibility of cp being nullptr in the event of a stream error, and added a null-safe version instead.

This is good stuff, keep it coming!

@sneves
Copy link
Author

sneves commented Oct 9, 2020

These are inputs from fuzzing; they're not "natural" inputs.

@marzer
Copy link
Owner

marzer commented Oct 9, 2020

Ah, makes sense. Fuzzing has been on my to-do list for a while; next time I get a batch of more serious toml++ time (beyond basic bugfixes) I'd like to add a fuzzer to the CI.

@sneves
Copy link
Author

sneves commented Oct 9, 2020

Against the latest master, the following input crashes in consume_comment: "p=06:06:06#\x0b\xff".

I managed to fix it by adding return_if_error({}); after if (consume_line_break()) return true; but no idea if that's a proper fix.

@marzer
Copy link
Owner

marzer commented Oct 9, 2020

Ah, thanks. I'm heading out for the night but will have a look into this tomorrow (that and any other broken inputs you find).

@sneves
Copy link
Author

sneves commented Oct 10, 2020

A couple more cases:

  • "''''d' 't' '+o\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\r\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0cop1\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c' 'ml'\n\n%\x87" causes an infinite loop in parse_literal_string. This is actually caught with a debug assert, so I fixed it as
-                                       assert_not_error();
+                                       return_if_error({});
  • "bwit = [ 9,2, 3, "four", 5.0 ]\n t =[ 9, 2, 1,"r", 9999999999999999999999999999999999999999999999999999999999999995.0 ]" results in a stack overflow in parse_float, with chars[64] getting written to. I fixed this by moving the length check to immediately before the writing:
@@ -9911,10 +9912,6 @@ TOML_IMPL_NAMESPACE_START
                                                else if (!is_match(*prev, U'e', U'E'))
                                                        set_error_and_return_default("expected exponent digit, saw '"sv, to_sv(*cp), "'"sv);
                                        }
-                                       else if (length == sizeof(chars))
-                                               set_error_and_return_default(
-                                                       "exceeds maximum length of "sv, static_cast<uint64_t>(sizeof(chars)), " characters"sv
-                                               );
                                        else if (is_decimal_digit(*cp))
                                        {
                                                if (!seen_decimal)
@@ -9928,6 +9925,11 @@ TOML_IMPL_NAMESPACE_START
                                        else
                                                set_error_and_return_default("expected decimal digit, saw '"sv, to_sv(*cp), "'"sv);
 
+                                       if (length == sizeof(chars))
+                                               set_error_and_return_default(
+                                                       "exceeds maximum length of "sv, static_cast<uint64_t>(sizeof(chars)), " characters"sv
+                                               );
+
                                        chars[length++] = static_cast<char>(cp->bytes[0]);
                                        prev = cp;
                                        advance_and_return_if_error({});

Other than this, nothing's popped up for quite a while, so I think the low-hanging fruit could be over..

PS: As before, I can't swear by the correctness of these fixes, only that they made the immediate problem go away.

marzer added a commit that referenced this issue Oct 10, 2020
@marzer
Copy link
Owner

marzer commented Oct 10, 2020

Fantastic, thanks @sneves! I've applied fixes for these in master.

As before, I can't swear by the correctness of these fixes, only that they made the immediate problem go away.

Your fixes are sound. The parser is deliberately procedural and "C-like" so there's not a lot of hidden voodoo that goes into fixes and modifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants