Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues parsing a previously encoded binary (non-UTF8) string. #1211

Closed
sfinktah opened this issue Aug 23, 2018 · 6 comments
Closed

Issues parsing a previously encoded binary (non-UTF8) string. #1211

sfinktah opened this issue Aug 23, 2018 · 6 comments
Labels
solution: proposed fix a fix for the issue has been proposed and waits for confirmation state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated

Comments

@sfinktah
Copy link

This is a bit of an edge case, but it seems to be handled in other platforms (to varying degrees).

The simplest explanation I can give you is just to show you this code:

#include <string>
#include "json.hpp"
using json = nlohmann::json;

int main(int argc, const char **argv) {
    std::string encoded("\u0006\u00e4\u00b4\u00d8\u008d\"");
    auto j3 = json::parse(encoded);
}

Which produced this exception:

JSON Exception: [json.exception.parse_error.101] parse error at 1: syntax error - invalid literal; last read: '<U+0006>'test complete.

Note: the same error occurs when nlohmann has done the previous encoding of the string (as shown below).

Whilst I fully understand the reason for the error, I don't believe that it is valid to assume all strings will be valid UTF-8, merely because the JSON spec includes \u0000 unicode escape sequences. I might be wrong, I haven't read the RFC lately.

Ignore everything below this line, as it's just more details. :)

Expanded tests

int main(int argc, const char **argv) {
    // binary string, not valid utf-8
    std::string ugly("\x06\xe4\xb4\xd8\x8d\x22");

    // will encode okay
    json j = {{ "text", ugly.c_str() }};
    printf("encoding complete.\n");                fflush(stdout);
    
    // 2.1.1 can dump okay
    // dumping: {"text":"\u0006▒؍\""}
    
    // 3.2.0 will trigger exception when dumping
    printf("dumping: %s\n", j.dump().c_str());     fflush(stdout);

    // both will trigger exception when parsing
    auto j2 = json::parse(j.dump());

    printf("test complete.\n");                    fflush(stdout);

Some tests with other platforms showed that php will not encode a raw non-UTF8 binary string, JavaScript (v8) had no issues, and python actually produced the "every character escaped" version I used in the initial example.

    // JavaScript tests (performed in Chrome console window)
    // helper functions from: https://nt4.com/js/hexdump.js
    
    /*
     * > j = JSON.stringify(sprintf("\x06\xe4\xb4\xd8\x8d\x22"))
     * < ""\u0006ä´Ø�\"""
     * 
     * > hexdump(j)
     * < 0000000: 22 5c 75 30 30 30 36 e4 b4 d8 8d 5c 22 22          "\u0006....\""
     * 
     * > JSON.parse(j)
     * < "�ä´Ø�""
     * 
     * > hexdump(JSON.parse(j))
     * < 0000000: 06 e4 b4 d8 8d 22                                  ....."
     */

PHP test:

    // PHP test (performed from bash shell, with the aid of a private helper)
    // (encode fails, but decode succeeds)

    /*
     * root@ec js $ echo -ne "\x06\xe4\xb4\xd8\x8d\x22" | phpparse - json_encode
     * false
     * 
     * root@ec js $ echo -n '"\u0006ä´Ø�\""' | phpparse - json_decode
     * ä´Ø�"
     */
}
@nlohmann
Copy link
Owner

The JSON text in your initial program needs to be properly escaped. Here is a version with a raw string literal that works:

#include <string>
#include "json.hpp"

using json = nlohmann::json;

int main(int argc, const char **argv) {
    std::string encoded(R"("\u0006\u00e4\u00b4\u00d8\u008d")");
    auto j3 = json::parse(encoded);
}

@nlohmann nlohmann added the solution: proposed fix a fix for the issue has been proposed and waits for confirmation label Aug 25, 2018
@sfinktah
Copy link
Author

sfinktah commented Aug 26, 2018

I did not know that. Unfortunately, there still remains an issue with your library - though fortunately, the solution is clearer now.

#include <string>
#include "json.hpp"

using json = nlohmann::json;

int main(int argc, const char **argv) {
    // binary string, not valid utf-8
    std::string ugly("\x06\xe4\xb4\xd8\x8d\x22");

    // will encode okay
    json j = {{ "text", ugly.c_str() }};
    printf("encoding complete.\n");                fflush(stdout);
    
    // 2.1.1 can dump okay
    // dumping: {"text":"\u0006▒؍\""}
    
    // 3.2.0 will trigger exception when dumping
    printf("dumping: %s\n", j.dump().c_str());     fflush(stdout);

    // both will trigger exception when parsing
    auto j2 = json::parse(j.dump());

    printf("test complete.\n");                    fflush(stdout);

More details in original post about how other languages handle this.

I have a proposed solution, not sure if I will be able to whip up a PR for it - I haven't checked what you use for unicode translation. But essentially, there seem two open paths.

JavaScript (well, V8 anyway) passes through the invalid UTF-8 binary sequences as-is, or python escapes at a byte level when the UTF-8 encoding is not valid.

Your library will decode the python version, but not the JavaScript version, which can be tested thusly:

int main(int argc, const char **argv) {
    std::string encoded1(  "\\u0006\u00e4\u00b4\u00d8\u008d\"");
    auto j3 = json::parse(encoded);

This time the \u and \\u are used correctly (hopefully), and are intended to convey:

> j = JSON.stringify(sprintf("\x06\xe4\xb4\xd8\x8d\x22"))
< ""\u0006ä´Ø�\"""

@nlohmann
Copy link
Owner

Related: #1198

@sfinktah
Copy link
Author

Quite. As someone who has always treated JSON as alternative to uuencode or base64_encode or similar utilities that can serialize and deserialize any binary string, I do feel strongly that any encoded input should be identical after encoding and decoding, even if only within the same library.

I'll take a look at the code now.

@sfinktah
Copy link
Author

sfinktah commented Aug 28, 2018

@nlohmann This is quite interesting. A small alternation to my original test code, to properly encapsulate the string inside a key/value object in JSON format, and the library seems to be able to decode, re-encode, and dump, without apparent issue.

Is this all just a storm in a tea-cup? Do you already have the functionality in place to properly handle binary data, and it simply doesn't come into play when using the library for simple string encoding?

encoded string: {"test":"\u0006\u00e4\u00b4\u00d8\u008d\""}
decoding complete.
dumping: {"test":"\u0006ä´Ø�\""}
encoding complete.
dumping: {"test":"\u0006ä´Ø�\""}
re-encoded complete.
dumping: {"test":"\u0006ä´Ø�\""}
#include <string>
#include "json320.hpp"

#define JSON_TRY(...)                                                       \
    try {                                                                   \
        __VA_ARGS__;                                                        \
    }                                                                       \
    catch (json::exception e)                                               \
    {                                                                       \
        fprintf(stderr, "%s: JSON Exception: %s", __FUNCTION__, e.what());  \
        return 1;                                                           \
    }

using json = nlohmann::json;

int main(int argc, const char **argv) {
    json j;
    std::string encoded1(R"({"test":"\u0006\u00e4\u00b4\u00d8\u008d\""})");
    printf("encoded string: %s\n", encoded1.c_str());  fflush(stdout);
    JSON_TRY(
        j = json::parse(encoded1);
        printf("decoding complete.\n");                fflush(stdout);
    )

    JSON_TRY(
        printf("dumping: %s\n", j.dump().c_str());     fflush(stdout);
    )

    std::string encoded;

    JSON_TRY(
        encoded = j.dump();
        printf("encoding complete.\n");                fflush(stdout);
    )
    
    JSON_TRY(
        printf("dumping: %s\n", j.dump().c_str());     fflush(stdout);
    )

    json j2;
    JSON_TRY(
        j2 = json::parse(encoded);
        printf("re-encoded complete.\n");              fflush(stdout);
    )

    JSON_TRY(
        printf("dumping: %s\n", j2.dump().c_str());    fflush(stdout);
    )
}

@stale
Copy link

stale bot commented Sep 27, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Sep 27, 2018
@stale stale bot closed this as completed Oct 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solution: proposed fix a fix for the issue has been proposed and waits for confirmation state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated
Projects
None yet
Development

No branches or pull requests

2 participants