-
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Soften the landing when dumping non-UTF8 strings (type_error.316 exception) #1198
Comments
I have slapped together a general purpose utf8 trimming function that you may find useful. It takes the unsafe std::string as an argument and removes all invalid UTF8 sequences from it. void trim_utf8(std::string& hairy) {
std::vector<bool> results;
std::string smooth;
size_t len = hairy.size();
results.reserve(len);
smooth.reserve(len);
const unsigned char *bytes = (const unsigned char *) hairy.c_str();
auto read_utf8 = [](const unsigned char *bytes, size_t len, size_t *pos) -> unsigned {
int code_unit1 = 0;
int code_unit2, code_unit3, code_unit4;
if (*pos >= len) goto ERROR1;
code_unit1 = bytes[(*pos)++];
if (code_unit1 < 0x80) return code_unit1;
else if (code_unit1 < 0xC2) goto ERROR1; // continuation or overlong 2-byte sequence
else if (code_unit1 < 0xE0) {
if (*pos >= len) goto ERROR1;
code_unit2 = bytes[(*pos)++]; //2-byte sequence
if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
return (code_unit1 << 6) + code_unit2 - 0x3080;
}
else if (code_unit1 < 0xF0) {
if (*pos >= len) goto ERROR1;
code_unit2 = bytes[(*pos)++]; // 3-byte sequence
if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
if (code_unit1 == 0xE0 && code_unit2 < 0xA0) goto ERROR2; // overlong
if (*pos >= len) goto ERROR2;
code_unit3 = bytes[(*pos)++];
if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
return (code_unit1 << 12) + (code_unit2 << 6) + code_unit3 - 0xE2080;
}
else if (code_unit1 < 0xF5) {
if (*pos >= len) goto ERROR1;
code_unit2 = bytes[(*pos)++]; // 4-byte sequence
if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
if (code_unit1 == 0xF0 && code_unit2 < 0x90) goto ERROR2; // overlong
if (code_unit1 == 0xF4 && code_unit2 >= 0x90) goto ERROR2; // > U+10FFFF
if (*pos >= len) goto ERROR2;
code_unit3 = bytes[(*pos)++];
if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
if (*pos >= len) goto ERROR3;
code_unit4 = bytes[(*pos)++];
if ((code_unit4 & 0xC0) != 0x80) goto ERROR4;
return (code_unit1 << 18) + (code_unit2 << 12) + (code_unit3 << 6) + code_unit4 - 0x3C82080;
}
else goto ERROR1; // > U+10FFFF
ERROR4:
(*pos)--;
ERROR3:
(*pos)--;
ERROR2:
(*pos)--;
ERROR1:
return code_unit1 + 0xDC00;
};
unsigned c;
size_t pos = 0;
size_t pos_before;
size_t inc = 0;
bool valid;
for (;;) {
pos_before = pos;
c = read_utf8(bytes, len, &pos);
inc = pos - pos_before;
if (!inc) break; // End of string reached.
valid = false;
if ( ( c <= 0x00007F)
|| (c >= 0x000080 && c <= 0x0007FF)
|| (c >= 0x000800 && c <= 0x000FFF)
|| (c >= 0x001000 && c <= 0x00CFFF)
|| (c >= 0x00D000 && c <= 0x00D7FF)
|| (c >= 0x00E000 && c <= 0x00FFFF)
|| (c >= 0x010000 && c <= 0x03FFFF)
|| (c >= 0x040000 && c <= 0x0FFFFF)
|| (c >= 0x100000 && c <= 0x10FFFF) ) valid = true;
if (c >= 0xDC00 && c <= 0xDCFF) {
valid = false;
}
do results.push_back(valid); while (--inc);
}
size_t sz = results.size();
for (size_t i = 0; i < sz; ++i) {
if (results[i]) smooth.append(1, hairy.at(i));
}
hairy.swap(smooth);
} |
Thanks! I'll check. I already have some code to translate from UTF-16 and UTF-32 to UTF-8 to parse from wide strings. But this would only help if your string is properly encoded with these encodings. |
Yeah, I can't agree that the correct solution is to remove the invalid UTF-8. And I'm also strongly against the practice of "rolling your own unicode conversion routines." If the unicode consortium had to withdraw the much-used source-code that forms the basis of 80% of (the better) conversions due to it being wildly inaccurate with respect to today's unicode, I dread the corner cases that would arise from something you wrote yourself. The only library I trust is boost::nowide (available as a seperate, non-boost package), and that does not handle UTF-32. |
see: http://unicode.org/forum/viewtopic.php?f=9&t=90 for more information on the ConvertUTF8.h library and it's withdrawal |
... excerpt from http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#G7404 (page 122)
See page 126 (4 pages down) for the tables in question. |
At this point, I'd be much more concerned with the requirement to have an additional library (even without boost, it's still a few files - though purely .hpp). I think this falls under "author's discretion."
|
Right now, the library only supports UTF-8. (It recently added support to parse from wide strings, but internally, everything is stored as UTF-8). We had the issue that invalid UTF-8 was stored in the library and dumped without notice, and the issue was only apparent when this value was parsed again. Therefore, we added an exception when dumping invalid UTF-8. One may argue that it would be even better to throw when invalid UTF-8 is stored in the first place. I am not sure what |
@nlohmann exactly! thanks for at least getting my point. I would definitely argue that if invalid utf8 will throw then it should throw during assignment not during dumping. @sfinktah IF you didn't know, then JSON has no formal meaning for non-utf8 byte sequences. it's as if JSON was color-blind to a certain group of byte sequences. Do color-blind people get thrown an exception every time they look at something that is either red or green? NO. It would be silly. Do you suffer a seizure when infrared light gets into your eyes? Think of nlohmann's json library the same way. Its main purpose is to make the developer's life easier, not harder. Stripping out invalid UTF8 sequences is the way to go. I already proposed a std::nothrow version for the assignment, so if you love your exceptions so much you'd have to do nothing as the proposed solution would be opt-in. I would advise you to rethink your position. |
@1Hyena Traffic lights are arranged in a specific color sequence, precisely so that people who suffer red/green color blindness, can determine what "color" a light is. In fact, I believe the positioning/order of colors is actually a world-wide standard. JSON also has no formal definition for capital letters, French, or spacing between words. @nlohmann I've completed my proof of concept, showing how nowide can be of use. Please do not confuse this with a recommendation that you use nowide. It was just a trusted tool that I had to hand. You may check out the function proof-of-concept at https://github.com/sfinktah/utftest/ It is coded a bit absurdly, but it proves a point. Namely, that an input of
can be encoded without any loss, to
Using (purely as an example) the boost::nowide library as a filter for valid UTF-8 sequences. And, FTR, I'm a subscriber of the UTF-8 Everywhere manifesto. I despise UTF-16. // g++ -I. -std=c++1z testutf2.cpp -o testutf2 && ./testutf2
#include <algorithm>
#include <array>
#include <nowide/convert.hpp>
#include <string>
#include <vector>
using namespace std::string_literals;
using namespace nowide;
std::string escape_json_char(int c) {
using T = char;
static auto escape_code = [](auto d) { return "\\"s + (T)d; };
static auto unicode_escape = [](auto d) {
static const std::array<char, 16> hexify = {
{'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'}
};
auto esc = escape_code('u');
// this will spaz on numbers greaters than 0xffff
for (int i = 0; i < 4; ++i)
esc += (T)hexify[(d & 0xf000) >> 12], d <<= 4;
return esc;
};
switch (c) {
case '\b': return escape_code('b'); // 0x08
case '\t': return escape_code('t'); // 0x09
case '\n': return escape_code('n'); // 0x0a
case '\r': return escape_code('r'); // 0x0d
case '\\': return escape_code('\\');
case '"': return escape_code('"');
}
if (c >= 32 && c <= 126)
return std::basic_string<T>((T)c, 1);
return unicode_escape(c);
}
template <class T> auto escape_json(T const *p) { return escape_json(std::basic_string<T>(p)); }
template <class T> auto escape_json(std::basic_string<T> s) {
std::vector<std::string> escaped;
std::transform(s.begin(), s.end(), std::back_inserter(escaped), escape_json_char);
return escaped;
}
int main() {
std::string invalid("\u0006\t\t\u2514\u2500\u00e4\u2518\u00b4\u00d8\u008d\"");
auto _wide = widen(invalid);
auto _narrow = narrow(_wide);
printf("%s\n", invalid.c_str()); fflush(stdout);
printf("%s\n", _narrow.c_str()); fflush(stdout);
printf("%s\n", invalid == _narrow ? "==" : "!="); fflush(stdout);
printf("narrow was %lli characters, wide was %lli characters.\n", _narrow.length(), _wide.length());
auto escaped = escape_json(_wide);
for (const auto& e : escaped)
printf("%s", e.c_str());
printf("\n\ntest complete.\n"); fflush(stdout);
exit(0);
}
/*
� └─ä┘´Ø�"
� └─ä┘´Ø�"
==
narrow was 21 characters, wide was 11 characters.
\u0006\t\t\u2514\u2500\u00e4\u2518\u00b4\u00d8\u008d\"
test complete.
*/
/* vim: set ts=4 sts=4 sw=4 et: */ |
@sfinktah the traffic lights comparison does not make any sense. I am asking you a simple question. Your data format is JSON and you are given a string that contains byte sequences that can't be expressed in JSON. What are you going to do? You have to do something, you just can't shut down the application. You also can't ignore the input because you need as much information as possible. You can't reject the input either. How are you going to solve this problem in the REAL WORLD that is often dirty and not perfect. You are appearing as someone living in a glass castle on a puffy cloud. Sure, it's easy to whine and moan about abstract ideals like a dogmatic zealot of some particular programming religion. But in the real world, problems have to be solved in a best possible way. Sacrifices and compromises are a normal thing in the real world. And I see nlohmann::json as a library for the real world not for a mathematical utopia. So think again, is it really a bad idea to discard the byte sequences that couldn't ever be expressed in JSON anyway, especially when the programmer has deliberately indicated that it is okey by calling the nothrow version of string assignment. As for UTF16 and anything other, I agree with you. UTF8 is the only sensible character encoding and anything else is a cancer. Call me Hitler I don't care. I am proudly a UTF8 supremacist. |
For what it's worth: I think filtering invalid UTF-8 sequences is not a good idea. |
@nlohmann any better alternative? if you're building an application that must overcome an exceptional situation and still give the best possible performance then I would say this is the only ideal solution. |
I think it is out of scope of the library to fix invalid UTF-8. I further do not think that silently fixing invalid inputs makes the world a better place, but rather rejecting such inputs with as much noise and as early as possible. |
@nlohmann I understand your frustration that json isn't compatible with non-utf8 strings, this indeed causes a lot of mess, but we have to deal with this mess with the best possible way. the header is already a giant, adding one more function to it won't be much of a problem, but it saves a ton of lines of code for the devs using your lib. Besides, with my proposed solution you can get rid of this exception once and for all. When assigning a non-utf8 string to the json object, the contained string will be shorter than the string that was originally assigned. it makes it possible for the dev to quickly check whether the original string contained any invalid utf8 sequences. You already have utf related code in the header anyway, so "out of scope" doesn't really apply here. std::string invalid_utf8 = "\xF0\xA4\xAD\xC0";
nlohmann::json json = invalid_utf8; // This would work and would never throw.
// Oh but I want to know if the input was invalid utf8? What should I do?
if (json.length() != invalid_utf8.length()) {
// we have a problem
} In this situation, if you rely on try catch blocks, every string assignment should be surrounded with them, littering the code and making it ugly with no benefit. If I have to write extra code EITHER WAY then I could as well as trim the strings myself. So no one would never actually need your exception to begin with. I hope this finally explained my reasoning because it seems as if we have a situation where some of us have developed a Stockholm syndrome regarding exceptions. Don't just sweep the problem under the carpet, solve it. Assume that the devs using this lib are not going to surround each string assignment with try/catch blocks no matter how much you would like it. And you already agreed that throwing from dumping is an even worse fallacy than throwing from string assignments, so it pretty much leaves you with no other sane solutions than to trim the utf8. Prove me wrong or admit your bias. |
Listen, I do not need to prove anyone wrong or admit anything. |
@1Hyena or you could just do this, which also avoids the exceptions:
|
I think it's perfectly reasonable for this library to assume the strings it gets are valid (ie, perfect UTF8). Imagine the chaos if programmers claimed that Sure, this library is about text parsing and needs to be aligned towards text-processing things, but that doesn't mean this library should be burdened with a grab-bag of Unicode transcoding functions. There are libraries whose sole purpose is do exactly that, and they'll do it far better, far faster, and with far fewer bugs than if this library took those on as a nice-to-have side-project. If the programmer is unsure whether a string may or may not be valid UTF-8 at any given point in the program, the onus is on them to make sure their code translates them to "valid" inputs for whatever functions they call (and gregmarr provides a great example for how to do that in his comment). |
@gregmarr yes that would work, although it would be only slightly better/prettier than surrounding each string assignment with try/catch blocks. @jaredgrubb contrary to @nlohmann 's response I respect yours for the reason that you actually bothered to explain. However, I still cannot completely agree with you. do you know what overexposure is in photography? well, it gets handled the same way as trimming utf8 would be. if you want to insist that utf8 related code has no place in this lib then go to its source code and search for the word
All I'm trying to say is that the current implementation of this json lib is a fruitful grounds for shooting yourself in the foot, and I was hoping to find a way how to avoid it. I like your reasoning though, @nlohmann should try to learn from it. I think this situation is similar to escaping in general, for example to avoid SQL injections. Should the lib escape the potentially harmful string for the dev or not? It depends, but most of the times such escapes are not automatically made, and in that light I tend to agree with you. If I am absolutely 100% sure that my strings are always utf8 then of course such trimming would be pointless overhead. Yeah, in the end, after giving this a second thought, I think the correct approach would be to throw from string assignments but never from |
You do realize that this is his project, and he's free to do with it as he wishes, right? For you to come in here and insist that he must "prove" or "admit" or "learn" anything just because you say so is extremely rude. |
It's part of sanitizing user input. The user input you're getting may be improper in any number of ways. It's up to you to make sure that it is properly formed before passing it along to any other tool. There could be any number of ways for the library to respond to invalid strings, and not one is proper for every use case. Your application is the best one to know whether an invalid string should result in an error being returned to the user, or a warning being returned to the user with a best effort result, or no error being returned to the user with the invalid portions just removed in an arbitrary fashion. |
That's not what was said. What was said was that code which takes things which are supposed to be valid UTF-8 but are not, and somehow fixing them to be UTF-8, such as by silently dropping those bytes, does not belong here. |
This comment has been minimized.
This comment has been minimized.
@1Hyena Could you please point out where I was "rude to begin with"? I would also like to remind you of the Code of Conduct of this project. |
I try to summarize some options we have in this situation. Not all of them were mentioned, but I try to get a picture of all possible ways to proceed:
Did I miss anything? |
String assignments with unknown, unvalidated contents, yes.
Uh, no:
There's been plenty of discussion here.
It's a fairly common practice for personal projects to use the github username to avoid namespace conflicts. |
For completeness sake, pointing out that this is the previous behavior, which people didn't like because it meant that the library could create a string that it couldn't parse. |
This is a core idea of input sanitization: You should know 100% whether your strings are valid or not. The areas of your code where you allow uncertainty should be at the "edge" and isolated to as small a location as possible so that the rest of your code can be simple and assume validity. If you need to propagate "maybe-valid" strings deeper into a program, then create a class whose sole purpose is to hold them (such that you get type safety and cannot accidentally send possibly-invalid strings into API that require valid ones; you can do the same thing with sensitive strings like passwords). That class can provide unbox methods to turn the "maybe-valid" into valid strings using whatever error recovery/substitution is appropriate for the use-case. There is no one obvious way to sanitize strings and encodings (as nlohmann is showing by enumerating all the options). It's up to the client to do that validation before they attempt to use this library. Also, please be more civil in the discourse. It's ok to disagree, argue, and storm out and fork your own vision. But @nlohmann has been a great project maintainer (with thicker skin than I would have in his shoes!). Personal attacks are not helpful and have no place in GitHub. If this entreaty doesn't work, just remember that these comments are forever and a future employer may stumble on them one day -- put your best image forward! |
Oh dear. cough... The Unicode Standard (http://www.unicode.org/versions/Unicode11.0.0/) expressly forbids the fixing or repair of broken Unicode, there are significant Security Issues with doing so. One may take the view, that removing invalid portions of an input string is "repairing." There are certainly still security issues involved. _e.g., you filter out files with an extension of "exe", someone packages a file called "trojan.ex\0xd8\0xd8e", it passes the filter, but by the time it's recorded in the database, the file has been silently altered to "trojan.exe" @nlohmann Yes, I believe you did miss something. Or mislabeled it... My provided POC encoded binary strings that were UTF-8 invalid. It did not fix the UTF-8. It merely encoded the non-UTF-8 values as individual bytes. The resultant decode would prove identical to the broken input, therefore nothing has been repaired. It has simply been encoded correctly (if you accept "the way The fact that the encoded result will decode without modification to the existing decoder is both a bonus, and supporting evidence that this is indeed the correct way to encode non-conformant strings. @1Hyena if my JSON library couldn't support encoding the strings I was using, I would use another library or encoding method. P.S. did nobody notice that I wrote a spec-compliant, fully function, JSON string encoder in less than 38 lines of code? n.b.: actual function may not be fully functional, spec-complaint, and should never be used in any code, ever. If design simplicity has any correlation with correctness (and I'm fairly sure it doesn't) ... |
Ok, forget this... it's actually nonsense: E.g., serializing |
Proof of concept; currently only as parameter to the internal dump_escaped function; that is, not yet exposed to the dump function.
I opened a work-in-progress branch (https://github.com/nlohmann/json/compare/feature/codec_errors) for the idea sketched in #1198 (comment). Feedback is more than welcome! The branch currently only handles serialization. |
I feel the term "ignore" is unclear about whether the invalid sequence is preserved or dropped. Perhaps "strip" or "remove"? OTOH, I think you're copying Python's terms and that is a valuable thing to do -- so maybe just update the comments to say that "ignore" means the invalid bytes are skipped. |
@jaredgrubb not sure if you meant my opening comment, but I added a small edit to clarify what I meant by ignoring. Later I proposed a trim function which always produces a valid UTF8 string. Should probably have named it |
@jaredgrubb I took the names from Python. You're right the public API needs better documentation. But I would like to focus on the implementation before. |
Oh, sorry for the confusion from my comment; it was a reply to @nlohmann's code sketch. |
@nlohmann oh wow, thanks for putting in the work, this looks exactly like what I imagined. It turns out, that in our case we will probably just not store the offending data in JSON in the future because we do actually need the broken sequences. Still I think this would be a great addition to your library and make it much more robust. |
I overworked the code and added some test cases. The error handler is now a new optional parameter to the |
One good thing about handling this situation in the dump function (rather than during assignment) is that it respects the principle of lazy evaluation. We could argue that not all json instances end up getting dumped and thus checking for invalid UTF8 in those does not make any difference (wastes CPU). |
Test every prefix of Unicode sequences against the different dump functions.
I now think I finished the implementation of the error handlers for the dump function, see #1314. It would be great if someone could have a look. |
i just think it will support some other natural language encoding, just like GB18030(gb2312/gbk)、BIG5、etc...if just english letters, they likes utf8, but parse will be throw too. This has caused a lot of trouble for other natural language developers |
and,,,jsoncpp can successfully parse, so, |
@nlohmann i just want it support parser other natural language encoding |
@fawdlstty This library only supports UTF-8, sorry. |
When dumping a json object that contains an std:string we could be thrown the type_error.316 exception if the string contains invalid UTF8 sequences. While this sounds like a reasonable thing to do, it actually causes more problems than it fixes. In all the cases where user-entered (unsafe) strings end up stored in nlohmann::json objects the developer now has to "do something" before assigning a string value to some nlohmann::json variable. This effectively renders the whole string assignment functionality unsafe and defeats its purpose.
Below is the wrapper code I had to write in order to investigate the random crashes my application went through due to the 316 exception.
This led me to the discovery of the code in my application that was sending json formatted log events to the log server. The log event has to store user entered data and I would expect the dump function to deal with invalid UTF8 sequences.
If I have to use my json dump wrapper everywhere in my application code then of what use is the original dump function to begin with? What is more, I'd actually have to enhance the wrapper function to iterate over all strings stored in the json object and do something about the possible invalid UTF8 sequences. Not very convenient.
Therefore, I would propose the default behavior of the
dump
function to simply ignore (skip or replace) invalid UTF8 sequences and NOT throw. Or perhaps add a nothrow parameter to the string assignment=
operator.If left like that, I probably won't be the last developer who assumed that the dump method is safe to use. After all, if the lib allows me to assign a value to the json object then how can it be that it lets me assign values that later invalidate the whole json? This is illogical.
One more thing. Why not include the file and line from where the exception originated in the exception description? You can see in my above wrapper function how I am using
__builtin_LINE()
and__builtin_FILE()
to store the file and line information from where the wrapper call was made. It is super useful when debugging and exception description is all about debugging.The text was updated successfully, but these errors were encountered: