-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to set error handler for decoding errors #1314
Conversation
Proof of concept; currently only as parameter to the internal dump_escaped function; that is, not yet exposed to the dump function.
Test every prefix of Unicode sequences against the different dump functions.
Looks great!! Out of curiosity, I have added some tests for the "correct" number of replacement characters (as in Unicode 11, Section 3.9 -- U+FFFD Substitution of Maximal Subparts). All tests pass. Great work! SECTION("U+FFFD Substitution of Maximal Subparts")
{
// Some tests (mostly) from
// https://www.unicode.org/versions/Unicode11.0.0/ch03.pdf
// Section 3.9 -- U+FFFD Substitution of Maximal Subparts
auto test = [&](std::string const& input, std::string const& expected) {
json j = input;
CHECK(j.dump(-1, ' ', true, json::error_handler_t::replace) == "\"" + expected + "\"");
};
test("\xC2", "\\ufffd");
test("\xC2\x41\x42", "\\ufffd" "\x41" "\x42");
test("\xC2\xF4", "\\ufffd" "\\ufffd");
test("\xF0\x80\x80\x41", "\\ufffd" "\\ufffd" "\\ufffd" "\x41");
test("\xF1\x80\x80\x41", "\\ufffd" "\x41");
test("\xF2\x80\x80\x41", "\\ufffd" "\x41");
test("\xF3\x80\x80\x41", "\\ufffd" "\x41");
test("\xF4\x80\x80\x41", "\\ufffd" "\x41");
test("\xF5\x80\x80\x41", "\\ufffd" "\\ufffd" "\\ufffd" "\x41");
test("\xF0\x90\x80\x41", "\\ufffd" "\x41");
test("\xF1\x90\x80\x41", "\\ufffd" "\x41");
test("\xF2\x90\x80\x41", "\\ufffd" "\x41");
test("\xF3\x90\x80\x41", "\\ufffd" "\x41");
test("\xF4\x90\x80\x41", "\\ufffd" "\\ufffd" "\\ufffd" "\x41");
test("\xF5\x90\x80\x41", "\\ufffd" "\\ufffd" "\\ufffd" "\x41");
test("\xC0\xAF\xE0\x80\xBF\xF0\x81\x82\x41", "\\ufffd" "\\ufffd" "\\ufffd" "\\ufffd" "\\ufffd" "\\ufffd" "\\ufffd" "\\ufffd" "\x41");
test("\xED\xA0\x80\xED\xBF\xBF\xED\xAF\x41", "\\ufffd" "\\ufffd" "\\ufffd" "\\ufffd" "\\ufffd" "\\ufffd" "\\ufffd" "\\ufffd" "\x41");
test("\xF4\x91\x92\x93\xFF\x41\x80\xBF\x42", "\\ufffd" "\\ufffd" "\\ufffd" "\\ufffd" "\\ufffd" "\x41" "\\ufffd""\\ufffd" "\x42");
test("\xE1\x80\xE2\xF0\x91\x92\xF1\xBF\x41", "\\ufffd" "\\ufffd" "\\ufffd" "\\ufffd" "\x41");
} |
@abolz Thanks for the tests, I shall add them to the test suite. FYI: I used the library's Unicode test suite to systematically create a 7.5 million valid and invalid byte sequences and compared the dump outputs with those of Python. Good to know that I also covered those from the Unicode spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart from a typo. I must say however that I'm not really familiar mit the test macros used e.g. in the unit-unicode.cpp (in particular CAPTURE()
)
|
Thanks everyone! |
Fixes #1198