Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soften the landing when dumping non-UTF8 strings (type_error.316 exception) #1198

Closed
1Hyena opened this issue Aug 17, 2018 · 67 comments
Closed
Assignees
Labels
release item: ✨ new feature solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Milestone

Comments

@1Hyena
Copy link

1Hyena commented Aug 17, 2018

When dumping a json object that contains an std:string we could be thrown the type_error.316 exception if the string contains invalid UTF8 sequences. While this sounds like a reasonable thing to do, it actually causes more problems than it fixes. In all the cases where user-entered (unsafe) strings end up stored in nlohmann::json objects the developer now has to "do something" before assigning a string value to some nlohmann::json variable. This effectively renders the whole string assignment functionality unsafe and defeats its purpose.

Below is the wrapper code I had to write in order to investigate the random crashes my application went through due to the 316 exception.

// Declaration:
std::string dump_json(const nlohmann::json &json, const int indent =-1, const char* file = __builtin_FILE(), int line = __builtin_LINE()) const;

// Definition
std::string MANAGER::dump_json(const nlohmann::json &json, const int indent, const char* file, int line) const {
    std::string result;
    try {
        result = json.dump(indent);
    }
    catch (nlohmann::json::type_error& e) {
        bug("%s: %s (%s:%d)", __FUNCTION__, e.what(), file, line);
    }
    return result;
}

This led me to the discovery of the code in my application that was sending json formatted log events to the log server. The log event has to store user entered data and I would expect the dump function to deal with invalid UTF8 sequences.

If I have to use my json dump wrapper everywhere in my application code then of what use is the original dump function to begin with? What is more, I'd actually have to enhance the wrapper function to iterate over all strings stored in the json object and do something about the possible invalid UTF8 sequences. Not very convenient.

Therefore, I would propose the default behavior of the dump function to simply ignore (skip or replace) invalid UTF8 sequences and NOT throw. Or perhaps add a nothrow parameter to the string assignment = operator.

If left like that, I probably won't be the last developer who assumed that the dump method is safe to use. After all, if the lib allows me to assign a value to the json object then how can it be that it lets me assign values that later invalidate the whole json? This is illogical.

// nothrow assignment would look like this
nlohmann::json j_nothrow = (std::nothrow) "\xF0\xA4\xAD\xC0"; // Does not throw.
// I am not 100% sure if this can be done, but since the `new` operator can have such
// a parameter I would suppose the assignment operator = could also be customized
// like that.

std::string dumped = j_nothrow.dump(); // Never throws, just works.

nlohmann::json j_throw;
try {
    j_throw = "\xF0\xA4\xAD\xC0"; // Throws immediately.
}
catch (nlohmann::json::type_error& e) {
    // handle the invalid UTF8 string
}

One more thing. Why not include the file and line from where the exception originated in the exception description? You can see in my above wrapper function how I am using __builtin_LINE() and __builtin_FILE() to store the file and line information from where the wrapper call was made. It is super useful when debugging and exception description is all about debugging.

@1Hyena
Copy link
Author

1Hyena commented Aug 17, 2018

I have slapped together a general purpose utf8 trimming function that you may find useful.

It takes the unsafe std::string as an argument and removes all invalid UTF8 sequences from it.

void trim_utf8(std::string& hairy) {
    std::vector<bool> results;
    std::string smooth;
    size_t len = hairy.size();
    results.reserve(len);
    smooth.reserve(len);
    const unsigned char *bytes = (const unsigned char *) hairy.c_str();

    auto read_utf8 = [](const unsigned char *bytes, size_t len, size_t *pos) -> unsigned {
        int code_unit1 = 0;
        int code_unit2, code_unit3, code_unit4;

        if (*pos >= len) goto ERROR1;
        code_unit1 = bytes[(*pos)++];

             if (code_unit1 < 0x80) return code_unit1;
        else if (code_unit1 < 0xC2) goto ERROR1; // continuation or overlong 2-byte sequence
        else if (code_unit1 < 0xE0) {
            if (*pos >= len) goto ERROR1;
            code_unit2 = bytes[(*pos)++]; //2-byte sequence
            if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
            return (code_unit1 << 6) + code_unit2 - 0x3080;
        }
        else if (code_unit1 < 0xF0) {
            if (*pos >= len) goto ERROR1;
            code_unit2 = bytes[(*pos)++]; // 3-byte sequence
            if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
            if (code_unit1 == 0xE0 && code_unit2 < 0xA0) goto ERROR2; // overlong
            if (*pos >= len) goto ERROR2;
            code_unit3 = bytes[(*pos)++];
            if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
            return (code_unit1 << 12) + (code_unit2 << 6) + code_unit3 - 0xE2080;
        }
        else if (code_unit1 < 0xF5) {
            if (*pos >= len) goto ERROR1;
            code_unit2 = bytes[(*pos)++]; // 4-byte sequence
            if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
            if (code_unit1 == 0xF0 && code_unit2 <  0x90) goto ERROR2; // overlong
            if (code_unit1 == 0xF4 && code_unit2 >= 0x90) goto ERROR2; // > U+10FFFF
            if (*pos >= len) goto ERROR2;
            code_unit3 = bytes[(*pos)++];
            if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
            if (*pos >= len) goto ERROR3;
            code_unit4 = bytes[(*pos)++];
            if ((code_unit4 & 0xC0) != 0x80) goto ERROR4;
            return (code_unit1 << 18) + (code_unit2 << 12) + (code_unit3 << 6) + code_unit4 - 0x3C82080;
        }
        else goto ERROR1; // > U+10FFFF

        ERROR4:
        (*pos)--;
        ERROR3:
        (*pos)--;
        ERROR2:
        (*pos)--;
        ERROR1:
        return code_unit1 + 0xDC00;
    };

    unsigned c;
    size_t pos = 0;
    size_t pos_before;
    size_t inc = 0;
    bool valid;

    for (;;) {
        pos_before = pos;
        c = read_utf8(bytes, len, &pos);
        inc = pos - pos_before;
        if (!inc) break; // End of string reached.

        valid = false;

        if ( (                 c <= 0x00007F)
        ||   (c >= 0x000080 && c <= 0x0007FF)
        ||   (c >= 0x000800 && c <= 0x000FFF)
        ||   (c >= 0x001000 && c <= 0x00CFFF)
        ||   (c >= 0x00D000 && c <= 0x00D7FF)
        ||   (c >= 0x00E000 && c <= 0x00FFFF)
        ||   (c >= 0x010000 && c <= 0x03FFFF)
        ||   (c >= 0x040000 && c <= 0x0FFFFF)
        ||   (c >= 0x100000 && c <= 0x10FFFF) ) valid = true;

        if (c >= 0xDC00 && c <= 0xDCFF) {
            valid = false;
        }

        do results.push_back(valid); while (--inc);
    }

    size_t sz = results.size();
    for (size_t i = 0; i < sz; ++i) {
        if (results[i]) smooth.append(1, hairy.at(i));
    }

    hairy.swap(smooth);
}

@nlohmann
Copy link
Owner

Thanks! I'll check. I already have some code to translate from UTF-16 and UTF-32 to UTF-8 to parse from wide strings. But this would only help if your string is properly encoded with these encodings.

@sfinktah
Copy link

Yeah, I can't agree that the correct solution is to remove the invalid UTF-8. And I'm also strongly against the practice of "rolling your own unicode conversion routines."

If the unicode consortium had to withdraw the much-used source-code that forms the basis of 80% of (the better) conversions due to it being wildly inaccurate with respect to today's unicode, I dread the corner cases that would arise from something you wrote yourself.

The only library I trust is boost::nowide (available as a seperate, non-boost package), and that does not handle UTF-32.

@sfinktah
Copy link

see: http://unicode.org/forum/viewtopic.php?f=9&t=90 for more information on the ConvertUTF8.h library and it's withdrawal

@sfinktah
Copy link

... excerpt from http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#G7404 (page 122)

The UTF-8 code unit sequence <41 C3 B1 42> is well-formed, because it can be
partitioned into subsequences, all of which match the specification for UTF-8
in Table 3-7. It consists of the following minimal well-formed code unit subsequences:
<41>, <C3 B1>, and <42>.

• The UTF-8 code unit sequence <41 C2 C3 B1 42> is ill-formed, because it contains
one ill-formed subsequence. There is no subsequence for the C2 byte
which matches the specification for UTF-8 in Table 3-7. 

See page 126 (4 pages down) for the tables in question.

@sfinktah
Copy link

boost::nowide appears to be capable of lossless conversion of invalid code sequences between UTF-8 and UTF-16. Though as mentioned, it has no UTF-32 support. This may not be an issue, since simple iteration through the std::wstring UTF-16 version should allow one to distinguish between valid and invalid UTF, and encode appropriately.

At this point, I'd be much more concerned with the requirement to have an additional library (even without boost, it's still a few files - though purely .hpp).

I think this falls under "author's discretion."

#include <string>
#include <nowide/convert.hpp>

using namespace nowide;

int main(int argc, const char **argv) {
    std::string invalid("\u0006\u00e4\u00b4\u00d8\u008d\"");

    auto _wide = widen(invalid);
    auto _narrow = narrow(_wide);

    printf("%s\n", invalid.c_str());                  fflush(stdout);
    printf("%s\n", _narrow.c_str());                  fflush(stdout);
    printf("%s\n", invalid == _narrow ? "==" : "!="); fflush(stdout);
    printf("test complete.\n");                       fflush(stdout);

    printf("narrow was %lli characters, wide was %lli characters.\n",
            _narrow.length(), _wide.length());
    exit(0);
}

/*
�ä´Ø�"
�ä´Ø�"
==
test complete.
narrow was 10 characters, wide was 6 characters.
*/

https://github.com/artyom-beilis/nowide

@nlohmann
Copy link
Owner

Right now, the library only supports UTF-8. (It recently added support to parse from wide strings, but internally, everything is stored as UTF-8). We had the issue that invalid UTF-8 was stored in the library and dumped without notice, and the issue was only apparent when this value was parsed again. Therefore, we added an exception when dumping invalid UTF-8. One may argue that it would be even better to throw when invalid UTF-8 is stored in the first place.

I am not sure what nowide would help here.

@1Hyena
Copy link
Author

1Hyena commented Aug 26, 2018

@nlohmann exactly! thanks for at least getting my point. I would definitely argue that if invalid utf8 will throw then it should throw during assignment not during dumping.

@sfinktah IF you didn't know, then JSON has no formal meaning for non-utf8 byte sequences. it's as if JSON was color-blind to a certain group of byte sequences. Do color-blind people get thrown an exception every time they look at something that is either red or green? NO. It would be silly. Do you suffer a seizure when infrared light gets into your eyes? Think of nlohmann's json library the same way. Its main purpose is to make the developer's life easier, not harder. Stripping out invalid UTF8 sequences is the way to go. I already proposed a std::nothrow version for the assignment, so if you love your exceptions so much you'd have to do nothing as the proposed solution would be opt-in. I would advise you to rethink your position.

@sfinktah
Copy link

sfinktah commented Aug 27, 2018

@1Hyena Traffic lights are arranged in a specific color sequence, precisely so that people who suffer red/green color blindness, can determine what "color" a light is. In fact, I believe the positioning/order of colors is actually a world-wide standard.

JSON also has no formal definition for capital letters, French, or spacing between words.

@nlohmann I've completed my proof of concept, showing how nowide can be of use. Please do not confuse this with a recommendation that you use nowide. It was just a trusted tool that I had to hand.

You may check out the function proof-of-concept at https://github.com/sfinktah/utftest/

It is coded a bit absurdly, but it proves a point. Namely, that an input of

std::string invalid("^F     └─ä┘´Ø<8d>");

can be encoded without any loss, to

\u0006\t\t\u2514\u2500\u00e4\u2518\u00b4\u00d8\u008d\"

Using (purely as an example) the boost::nowide library as a filter for valid UTF-8 sequences.

And, FTR, I'm a subscriber of the UTF-8 Everywhere manifesto. I despise UTF-16.

    // g++ -I. -std=c++1z testutf2.cpp -o testutf2 && ./testutf2
    #include <algorithm>
    #include <array>
    #include <nowide/convert.hpp>
    #include <string>
    #include <vector>
    
    using namespace std::string_literals;
    using namespace nowide;
    
    std::string escape_json_char(int c) {
        using T                 = char;
        static auto escape_code = [](auto d) { return "\\"s + (T)d; };
    
        static auto unicode_escape = [](auto d) {
            static const std::array<char, 16> hexify = {
                {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'}
            };
            auto esc = escape_code('u');
    
            // this will spaz on numbers greaters than 0xffff
            for (int i = 0; i < 4; ++i)
                esc += (T)hexify[(d & 0xf000) >> 12], d <<= 4;
    
            return esc;
        };
    
        switch (c) {
            case '\b': return escape_code('b');  // 0x08
            case '\t': return escape_code('t');  // 0x09
            case '\n': return escape_code('n');  // 0x0a
            case '\r': return escape_code('r');  // 0x0d
            case '\\': return escape_code('\\');
            case '"':  return escape_code('"');
        }
    
        if (c >= 32 && c <= 126)
            return std::basic_string<T>((T)c, 1);
    
        return unicode_escape(c);
    }
    
    template <class T> auto escape_json(T const *p) { return escape_json(std::basic_string<T>(p)); }
    template <class T> auto escape_json(std::basic_string<T> s) {
        std::vector<std::string> escaped;
        std::transform(s.begin(), s.end(), std::back_inserter(escaped), escape_json_char);
        return escaped;
    }
    
    int main() {
        std::string invalid("\u0006\t\t\u2514\u2500\u00e4\u2518\u00b4\u00d8\u008d\"");
    
        auto _wide   = widen(invalid);
        auto _narrow = narrow(_wide);
    
        printf("%s\n", invalid.c_str()); fflush(stdout);
        printf("%s\n", _narrow.c_str()); fflush(stdout);
        printf("%s\n", invalid == _narrow ? "==" : "!="); fflush(stdout);
    
        printf("narrow was %lli characters, wide was %lli characters.\n", _narrow.length(), _wide.length());
    
        auto escaped = escape_json(_wide);
    
        for (const auto& e : escaped)
            printf("%s", e.c_str());
    
        printf("\n\ntest complete.\n"); fflush(stdout);
        exit(0);
    }
    
    /*
    �		└─ä┘´Ø�"
    �		└─ä┘´Ø�"
    ==
    narrow was 21 characters, wide was 11 characters.
    \u0006\t\t\u2514\u2500\u00e4\u2518\u00b4\u00d8\u008d\"
    
    test complete.
    */
    /* vim: set ts=4 sts=4 sw=4 et: */

@1Hyena
Copy link
Author

1Hyena commented Aug 27, 2018

@sfinktah the traffic lights comparison does not make any sense. I am asking you a simple question. Your data format is JSON and you are given a string that contains byte sequences that can't be expressed in JSON. What are you going to do? You have to do something, you just can't shut down the application. You also can't ignore the input because you need as much information as possible. You can't reject the input either. How are you going to solve this problem in the REAL WORLD that is often dirty and not perfect. You are appearing as someone living in a glass castle on a puffy cloud. Sure, it's easy to whine and moan about abstract ideals like a dogmatic zealot of some particular programming religion. But in the real world, problems have to be solved in a best possible way. Sacrifices and compromises are a normal thing in the real world. And I see nlohmann::json as a library for the real world not for a mathematical utopia. So think again, is it really a bad idea to discard the byte sequences that couldn't ever be expressed in JSON anyway, especially when the programmer has deliberately indicated that it is okey by calling the nothrow version of string assignment.

As for UTF16 and anything other, I agree with you. UTF8 is the only sensible character encoding and anything else is a cancer. Call me Hitler I don't care. I am proudly a UTF8 supremacist.

@nlohmann
Copy link
Owner

For what it's worth: I think filtering invalid UTF-8 sequences is not a good idea.

@1Hyena
Copy link
Author

1Hyena commented Aug 27, 2018

@nlohmann any better alternative?

if you're building an application that must overcome an exceptional situation and still give the best possible performance then I would say this is the only ideal solution.

@nlohmann
Copy link
Owner

I think it is out of scope of the library to fix invalid UTF-8. I further do not think that silently fixing invalid inputs makes the world a better place, but rather rejecting such inputs with as much noise and as early as possible.

@1Hyena
Copy link
Author

1Hyena commented Aug 27, 2018

@nlohmann
you want the devs to litter their codebase with try catch block around each string assignment just because you think utf8 trimming it is out of the scope? it's enough headache that you rely on exceptions as much as as you already do, not to mention that the namespace nlohmann:: is tediously long.

I understand your frustration that json isn't compatible with non-utf8 strings, this indeed causes a lot of mess, but we have to deal with this mess with the best possible way. the header is already a giant, adding one more function to it won't be much of a problem, but it saves a ton of lines of code for the devs using your lib.

Besides, with my proposed solution you can get rid of this exception once and for all. When assigning a non-utf8 string to the json object, the contained string will be shorter than the string that was originally assigned. it makes it possible for the dev to quickly check whether the original string contained any invalid utf8 sequences. You already have utf related code in the header anyway, so "out of scope" doesn't really apply here.

std::string invalid_utf8 = "\xF0\xA4\xAD\xC0";
nlohmann::json json = invalid_utf8; // This would work and would never throw.

// Oh but I want to know if the input was invalid utf8? What should I do?
if (json.length() != invalid_utf8.length()) {
    // we have a problem
}

In this situation, if you rely on try catch blocks, every string assignment should be surrounded with them, littering the code and making it ugly with no benefit. If I have to write extra code EITHER WAY then I could as well as trim the strings myself. So no one would never actually need your exception to begin with. I hope this finally explained my reasoning because it seems as if we have a situation where some of us have developed a Stockholm syndrome regarding exceptions.

Don't just sweep the problem under the carpet, solve it. Assume that the devs using this lib are not going to surround each string assignment with try/catch blocks no matter how much you would like it. And you already agreed that throwing from dumping is an even worse fallacy than throwing from string assignments, so it pretty much leaves you with no other sane solutions than to trim the utf8. Prove me wrong or admit your bias.

@nlohmann
Copy link
Owner

Listen, I do not need to prove anyone wrong or admit anything.

@nlohmann nlohmann added the state: please discuss please discuss the issue or vote for your favorite option label Aug 27, 2018
@gregmarr
Copy link
Contributor

@1Hyena or you could just do this, which also avoids the exceptions:

std::string invalid_utf8 = "\xF0\xA4\xAD\xC0";
nlohmann::json json = fix_my_invalid_string(invalid_utf8); // This would work and would never throw.

@jaredgrubb
Copy link
Contributor

I think it's perfectly reasonable for this library to assume the strings it gets are valid (ie, perfect UTF8).

Imagine the chaos if programmers claimed that strdup should fix UTF-8 for you, or that printf and fopen and std::regex_match and all other library functions needed to become encoding-aware and must also sanitize their string-inputs better. No, on the contrary, each library function has a precondition that the programmer must meet, and this library says that "strings are valid UTF-8" -- which is perfectly reasonable.

Sure, this library is about text parsing and needs to be aligned towards text-processing things, but that doesn't mean this library should be burdened with a grab-bag of Unicode transcoding functions. There are libraries whose sole purpose is do exactly that, and they'll do it far better, far faster, and with far fewer bugs than if this library took those on as a nice-to-have side-project.

If the programmer is unsure whether a string may or may not be valid UTF-8 at any given point in the program, the onus is on them to make sure their code translates them to "valid" inputs for whatever functions they call (and gregmarr provides a great example for how to do that in his comment).

@1Hyena
Copy link
Author

1Hyena commented Aug 27, 2018

@gregmarr yes that would work, although it would be only slightly better/prettier than surrounding each string assignment with try/catch blocks.

@jaredgrubb contrary to @nlohmann 's response I respect yours for the reason that you actually bothered to explain. However, I still cannot completely agree with you. strdup and your other comparisons you brought up do not need to know about utf8 but json specifically says it can only store utf8 strings.

do you know what overexposure is in photography? well, it gets handled the same way as trimming utf8 would be.

if you want to insist that utf8 related code has no place in this lib then go to its source code and search for the word utf. IT IS FULL OF IT. So it is annoyingly hypocritical to see some of the participants here talking that stripping out utf8 bytes does not belong into the context while the context is full of utf8 code already.

If the programmer is unsure whether a string may or may not be valid UTF-8 at any given point in the program, the onus is on them to make sure their code translates them to "valid" inputs for whatever functions they call (and gregmarr provides a great example for how to do that in his comment).

All I'm trying to say is that the current implementation of this json lib is a fruitful grounds for shooting yourself in the foot, and I was hoping to find a way how to avoid it. I like your reasoning though, @nlohmann should try to learn from it. I think this situation is similar to escaping in general, for example to avoid SQL injections. Should the lib escape the potentially harmful string for the dev or not? It depends, but most of the times such escapes are not automatically made, and in that light I tend to agree with you. If I am absolutely 100% sure that my strings are always utf8 then of course such trimming would be pointless overhead.

Yeah, in the end, after giving this a second thought, I think the correct approach would be to throw from string assignments but never from dump(), but the named utf8 trimming function could still be useful for this lib. At least I would find it useful :P

@gregmarr
Copy link
Contributor

@jaredgrubb contrary to @nlohmann 's response I respect yours for the reason that you actually bothered to explain.
I like your reasoning though, @nlohmann should try to learn from it.

You do realize that this is his project, and he's free to do with it as he wishes, right? For you to come in here and insist that he must "prove" or "admit" or "learn" anything just because you say so is extremely rude.

@gregmarr
Copy link
Contributor

@gregmarr yes that would work, although it would be only slightly better/prettier than surrounding each string assignment with try/catch blocks.

It's part of sanitizing user input. The user input you're getting may be improper in any number of ways. It's up to you to make sure that it is properly formed before passing it along to any other tool.

There could be any number of ways for the library to respond to invalid strings, and not one is proper for every use case. Your application is the best one to know whether an invalid string should result in an error being returned to the user, or a warning being returned to the user with a best effort result, or no error being returned to the user with the invalid portions just removed in an arbitrary fashion.

@gregmarr
Copy link
Contributor

if you want to insist that utf8 related code has no place in this lib then go to its source code and search for the word utf. IT IS FULL OF IT. So it is annoyingly hypocritical to see some of the participants here talking that stripping out utf8 bytes does not belong into the context while the context is full of utf8 code already.

That's not what was said. What was said was that code which takes things which are supposed to be valid UTF-8 but are not, and somehow fixing them to be UTF-8, such as by silently dropping those bytes, does not belong here.

@1Hyena

This comment has been minimized.

@nlohmann
Copy link
Owner

@1Hyena Could you please point out where I was "rude to begin with"? I would also like to remind you of the Code of Conduct of this project.

@nlohmann
Copy link
Owner

I try to summarize some options we have in this situation. Not all of them were mentioned, but I try to get a picture of all possible ways to proceed:

  • Assignment of invalid Unicode:

    • Option 1: The library should "fix" invalid Unicode input (e.g. by removing invalid parts) and do not throw.
    • Option 2: The library should throw an exception.
    • Option 3: The library should just assign the string (this is the status quo).
  • Serialization of previously stored invalid Unicode:

    • Option 1: This should be made impossible, e.g. by options 1 or 2 above.
    • Option 2: The library should dump the invalid Unicode as is.
    • Option 3: The library should "fix" the invalid Unicode (e.g. by not dumping invalid parts) and proceed.
    • Option 4: The library should throw an exception (this is the status quo).

Did I miss anything?

@gregmarr
Copy link
Contributor

you do realize that string assignments in its current form are dangerous?

String assignments with unknown, unvalidated contents, yes.

And speaking of @nlohmann he's the one who was rude to begin with,

Uh, no:

You are appearing as someone living in a glass castle on a puffy cloud. Sure, it's easy to whine and moan about abstract ideals like a dogmatic zealot of some particular programming religion.
not to mention that the namespace nlohmann:: is tediously long.
Don't just sweep the problem under the carpet, solve it.
it pretty much leaves you with no other sane solutions than to trim the utf8
Prove me wrong or admit your bias.

A fork should be the last resort, which will unfortunately happen more often if some of the parties is arrogant enough to not discuss things.

There's been plenty of discussion here.

And of course speaking of egomania, who names their lib after themselves? Not only it is long and tedious to write it, it's also pathetic.

It's a fairly common practice for personal projects to use the github username to avoid namespace conflicts.
If you don't want to write it, then C++ has this wonderful feature called using which will get rid of it for you.

@gregmarr
Copy link
Contributor

Option 2: The library should dump the invalid Unicode as is.

For completeness sake, pointing out that this is the previous behavior, which people didn't like because it meant that the library could create a string that it couldn't parse.

@jaredgrubb
Copy link
Contributor

If I am absolutely 100% sure that my strings are always utf8 then of course such trimming would be pointless overhead.

This is a core idea of input sanitization: You should know 100% whether your strings are valid or not. The areas of your code where you allow uncertainty should be at the "edge" and isolated to as small a location as possible so that the rest of your code can be simple and assume validity. If you need to propagate "maybe-valid" strings deeper into a program, then create a class whose sole purpose is to hold them (such that you get type safety and cannot accidentally send possibly-invalid strings into API that require valid ones; you can do the same thing with sensitive strings like passwords). That class can provide unbox methods to turn the "maybe-valid" into valid strings using whatever error recovery/substitution is appropriate for the use-case.

There is no one obvious way to sanitize strings and encodings (as nlohmann is showing by enumerating all the options). It's up to the client to do that validation before they attempt to use this library.

Also, please be more civil in the discourse. It's ok to disagree, argue, and storm out and fork your own vision. But @nlohmann has been a great project maintainer (with thicker skin than I would have in his shoes!). Personal attacks are not helpful and have no place in GitHub. If this entreaty doesn't work, just remember that these comments are forever and a future employer may stumble on them one day -- put your best image forward!

@sfinktah
Copy link

sfinktah commented Aug 28, 2018

Oh dear.

cough... The Unicode Standard (http://www.unicode.org/versions/Unicode11.0.0/) expressly forbids the fixing or repair of broken Unicode, there are significant Security Issues with doing so.

One may take the view, that removing invalid portions of an input string is "repairing." There are certainly still security issues involved. _e.g., you filter out files with an extension of "exe", someone packages a file called "trojan.ex\0xd8\0xd8e", it passes the filter, but by the time it's recorded in the database, the file has been silently altered to "trojan.exe"

@nlohmann Yes, I believe you did miss something. Or mislabeled it...

My provided POC encoded binary strings that were UTF-8 invalid. It did not fix the UTF-8. It merely encoded the non-UTF-8 values as individual bytes.

The resultant decode would prove identical to the broken input, therefore nothing has been repaired.

It has simply been encoded correctly (if you accept "the way python -mjson.tool would encode it" as the definition of correct.)

The fact that the encoded result will decode without modification to the existing decoder is both a bonus, and supporting evidence that this is indeed the correct way to encode non-conformant strings.

@1Hyena if my JSON library couldn't support encoding the strings I was using, I would use another library or encoding method.

P.S. did nobody notice that I wrote a spec-compliant, fully function, JSON string encoder in less than 38 lines of code? n.b.: actual function may not be fully functional, spec-complaint, and should never be used in any code, ever.

If design simplicity has any correlation with correctness (and I'm fairly sure it doesn't) ...

@abolz
Copy link
Contributor

abolz commented Oct 16, 2018

It is already possible to get an invalid UTF-8 sequence in strings while decoding a JSON file ("\\u0080" is a valid JSON-string), so the most consistent option would probably be to replace invalid UTF-8 sequences with \u-escaped sequences instead of throwing an exception while serializing strings. This will result in valid (UTF-8 encoded) JSON files and invalid UTF-8 (or other binary data) would round-trip. It's then up to the user to check the strings stored in a JSON value for correct encodings.

Ok, forget this... it's actually nonsense:

E.g., serializing \x80 would result in \\u0080. While deserializing \\u0080, the resulting character U+0080 will be encoded in UTF-8 and result in \xC2\x80.

nlohmann added a commit that referenced this issue Oct 16, 2018
Proof of concept; currently only as parameter to the internal dump_escaped function; that is, not yet exposed to the dump function.
@nlohmann
Copy link
Owner

I opened a work-in-progress branch (https://github.com/nlohmann/json/compare/feature/codec_errors) for the idea sketched in #1198 (comment). Feedback is more than welcome! The branch currently only handles serialization.

@jaredgrubb
Copy link
Contributor

I feel the term "ignore" is unclear about whether the invalid sequence is preserved or dropped. Perhaps "strip" or "remove"? OTOH, I think you're copying Python's terms and that is a valuable thing to do -- so maybe just update the comments to say that "ignore" means the invalid bytes are skipped.

@1Hyena
Copy link
Author

1Hyena commented Oct 17, 2018

@jaredgrubb not sure if you meant my opening comment, but I added a small edit to clarify what I meant by ignoring. Later I proposed a trim function which always produces a valid UTF8 string. Should probably have named it shave though : P ( can't have strings if you're not shaved, right? )

@nlohmann
Copy link
Owner

@jaredgrubb I took the names from Python. You're right the public API needs better documentation. But I would like to focus on the implementation before.

@jaredgrubb
Copy link
Contributor

Oh, sorry for the confusion from my comment; it was a reply to @nlohmann's code sketch.

@niklas88
Copy link

@nlohmann oh wow, thanks for putting in the work, this looks exactly like what I imagined. It turns out, that in our case we will probably just not store the offending data in JSON in the future because we do actually need the broken sequences. Still I think this would be a great addition to your library and make it much more robust.

nlohmann added a commit that referenced this issue Oct 21, 2018
@nlohmann
Copy link
Owner

I overworked the code and added some test cases. The error handler is now a new optional parameter to the dump function. I am not sure whether I exactly implemented Python's behavior, so more test cases may be required.

@1Hyena
Copy link
Author

1Hyena commented Oct 21, 2018

One good thing about handling this situation in the dump function (rather than during assignment) is that it respects the principle of lazy evaluation. We could argue that not all json instances end up getting dumped and thus checking for invalid UTF8 in those does not make any difference (wastes CPU).

nlohmann added a commit that referenced this issue Oct 21, 2018
Test every prefix of Unicode sequences against the different dump functions.
nlohmann added a commit that referenced this issue Oct 22, 2018
@nlohmann
Copy link
Owner

I now think I finished the implementation of the error handlers for the dump function, see #1314. It would be great if someone could have a look.

@nlohmann nlohmann added solution: proposed fix a fix for the issue has been proposed and waits for confirmation and removed state: please discuss please discuss the issue or vote for your favorite option labels Oct 23, 2018
nlohmann added a commit that referenced this issue Oct 23, 2018
@nlohmann nlohmann added this to the Release 3.4.0 milestone Oct 28, 2018
@nlohmann nlohmann self-assigned this Oct 28, 2018
@fawdlstty
Copy link

i just think it will support some other natural language encoding, just like GB18030(gb2312/gbk)、BIG5、etc...if just english letters, they likes utf8, but parse will be throw too. This has caused a lot of trouble for other natural language developers

@fawdlstty
Copy link

and,,,jsoncpp can successfully parse, so,

@fawdlstty
Copy link

@nlohmann i just want it support parser other natural language encoding

@nlohmann
Copy link
Owner

@fawdlstty This library only supports UTF-8, sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release item: ✨ new feature solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Projects
None yet
Development

No branches or pull requests

8 participants