Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect parsing of indefinite length CBOR strings. #961

Closed
johnfb opened this issue Feb 6, 2018 · 7 comments
Closed

Incorrect parsing of indefinite length CBOR strings. #961

johnfb opened this issue Feb 6, 2018 · 7 comments
Assignees
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON confirmed kind: bug solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Milestone

Comments

@johnfb
Copy link

johnfb commented Feb 6, 2018

I found this library and thought it was very interesting and like a lot about it and decided to brows the source code a little bit. I am thinking I will try using this in my next project. In my browsing I found the following problem:

case 0x7F: // UTF-8 string (indefinite length)

According to section 2.2.2 of the CBOR RFC 7049 an indefinite length string is a list of definite length strings terminated by the terminal byte 0xFF. As such the following code snippet should print the same string twice.

    using json = nlohmann::json;
    std::vector<uint8_t> v_cbor = {
        0x7F,
            0x64,
                'a', 'b', 'c', 'd',
            0x63,
                '1', '2', '3',
            0xFF
    };
    json j = json::from_cbor(v_cbor);
    std::cout << "\"abcd123\"\n";
    std::cout <<  j << std::endl;

But as of version 3.1.0 it prints:

"abcd123"
"dabcdc123"

Changing that case to something like:

            case 0x7F: // UTF-8 string (indefinite length)
            {
                std::vector<string_t> result_list;
                std::size_t size = 0;
                while (get() != 0xFF)
                {
                    unexpect_eof();
                    result_list.push_back(get_cbor_string());
                    size += result_list.back().size();
                }
                string_t result;
                result.reserve(size);
                for (auto r: result_list) { result.append(r); }
                return result;
            }

Fixes it. I'm not sure if this is the way you want to do it though.

@nlohmann nlohmann added kind: bug confirmed aspect: binary formats BSON, CBOR, MessagePack, UBJSON labels Feb 6, 2018
@nlohmann
Copy link
Owner

nlohmann commented Feb 6, 2018

I can confirm your observation, and I also checked with the CBOR playground:

cbor

I honestly must have missed that part and implemented indefinite length strings just like maps and arrays. Thanks for checking the code so thoroughly!

I think your fix is valid - I just have the feeling there is an STL algorithm (accumulate?) to merge the strings. Alternatively, one could just add the parsed strings directly to the result string.

@johnfb
Copy link
Author

johnfb commented Feb 6, 2018

Simply appending to the string directly certainly is simpler, and only a one line change:

            case 0x7F: // UTF-8 string (indefinite length)
            {
                string_t result;
                while (get() != 0xFF)
                {
                    unexpect_eof();
                    result.append(get_cbor_string());
                }
                return result;
            }

Indefinite length strings are probably such a rare case that doing things like reserving capacity to save allocation as the first one does... premature optimization and therefore evil. 😜

@gregmarr
Copy link
Contributor

gregmarr commented Feb 6, 2018

With the secondary array, you're going to end up with guaranteed two string copies, one allocation, and one deallocation per string in the indefinite string (unless those strings fit in the small string optimization). If you do it as.... as @johnfb just wrote as I was writing this... you will have the same count as upper bound, and better performance in general.

@nlohmann
Copy link
Owner

nlohmann commented Feb 6, 2018

Thanks! And after all, you would also need allocations to copy the strings into the vector.

@nlohmann
Copy link
Owner

nlohmann commented Feb 6, 2018

Kudos for @gregmarr to always be so fast ;)

@nlohmann
Copy link
Owner

nlohmann commented Feb 6, 2018

Alright - I'll add a fix later today.

@nlohmann nlohmann self-assigned this Feb 6, 2018
@nlohmann nlohmann added this to the Release 3.1.1 milestone Feb 6, 2018
nlohmann added a commit that referenced this issue Feb 6, 2018
Beside the fix discussed in #961, we also had to re-adjust a test case. It seems that it was failing before, and I "fixed" it to work with the broken implementation...
@nlohmann nlohmann added the solution: proposed fix a fix for the issue has been proposed and waits for confirmation label Feb 6, 2018
@nlohmann
Copy link
Owner

nlohmann commented Feb 6, 2018

Thanks everybody!

@nlohmann nlohmann closed this as completed Feb 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON confirmed kind: bug solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Projects
None yet
Development

No branches or pull requests

3 participants