Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of the binary type in CBOR and Message Pack #601

Closed
Type1J opened this issue Jun 1, 2017 · 13 comments
Closed

Use of the binary type in CBOR and Message Pack #601

Type1J opened this issue Jun 1, 2017 · 13 comments
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON

Comments

@Type1J
Copy link
Contributor

Type1J commented Jun 1, 2017

For types such as std::vector<uint8_t> the CBOR and Message Pack array type is currently used and each value is written as a numeric value, but this has a high overhead in output size (for byte sized value types) due to each value costing more than 1 byte in CBOR and (most of the time) in Message Pack.

I'd like to propose that the to_* functions for binary formats take an additional bool argument that causes array types that are known to be numeric and byte-sized to serialize using the binary string type of the respective format. The from_* functions should accept either the current style array of numeric types or the binary.

This proposal might suggest that the nlohmann::json C++ type be augmented with a bytearray discriminator in addition to the normal array discriminator. However there might be an easier way to know that the array is an array of numeric 8-bit values. To be clear, the JSON form would still be an array, so the discriminator would only be set to bytearray if the values given to the array were numeric and inside of the range [0, 255].

Thoughts?

@nlohmann nlohmann added the aspect: binary formats BSON, CBOR, MessagePack, UBJSON label Jun 1, 2017
@nlohmann
Copy link
Owner

nlohmann commented Jun 1, 2017

Could be related to #373.

@nlohmann
Copy link
Owner

nlohmann commented Jun 1, 2017

Could you please provide a concrete example?

@Type1J
Copy link
Contributor Author

Type1J commented Jun 1, 2017

I'm not sure if this is what you meant, but:

The Message Pack and CBOR formats both only have 1 typed array form: a binary string. Using it eliminates the per element size overhead that arrays carry for byte-sized data.

I have an app that has a 131072 byte array (128KiB). There's about 2KiB worth of other data in the object containing that array. It is sent over a binary websocket in Message Pack. If I serialize it using the reference Message Pack library and use the binary string type for that 1 value, and use the same values that this library uses for all other types, then I get a little over 130KiB. Serializing it with this library using Message Pack, I get a little under 260KiB, which is still much smaller and quicker to parse than the JSON version which has commas and 1 to 3 bytes per value (somewhere around 450KiB), but it could be 130KiB by using a small bit of hint information about the type. FYI, I have to send this data in Message Pack often, but periodically I send it to a JSON REST web service as well, so I need to be able to quickly convert between these formats, and this library thankfully offers me the capability to do that with minimal additional code, but I'd like to get my size down (AWS I/O costs per byte, and this data is sent many times per hour).

I'm thinking that the deserialized type must have an array type, and then be manually hinted (post deserialization) that it is a byte array for when you want to serialize it. The part that I'm not sure about is whether that hint should be kept in the nlohmann::json object in, for example, an additional discriminator type OR a std::set<std::string> that keeps keys of the byte string capable values could be used (these would be checked on serialization), or a std::set<std::string> could be passed with JSON pointers or just keys to the to_msgpack() or to_cbor() functions as an additional, optional, argument.

Basically, my idea about how the binary string formats should be used is that they are only hints, and if you don't think about them, then everything should still work as you expect. If you do think about them, then giving these hints will result in much smaller sizes for the binary formats, and my AWS bill will be smaller.

@nlohmann
Copy link
Owner

nlohmann commented Jun 1, 2017

So what you mean is support for MessagePack's bin format with start bytes 0xc4, 0xc5, and 0xc6?

@Type1J
Copy link
Contributor Author

Type1J commented Jun 1, 2017

Yes

@nlohmann
Copy link
Owner

nlohmann commented Jun 1, 2017

Hm. This is tricky, because there is no JSON type for which a serialization to bin natural. I wonder how an interface would look like that tells the serialization to use bin for one specific array. And then the binary representation of the elements is not clear to me.

So how does your array look like as JSON value? What kind of data is this and how would a binary encoding look like?

@Type1J
Copy link
Contributor Author

Type1J commented Jun 1, 2017

The data is 8-bit, normalized sensor data. It would look like the following in JSON {time:353971, temp: 103.2, dist: [3,68,234,140,74,110,37,190]}. The "dist" key would be hinted. For that array all values will always properly fit in a uint8_t.

@Type1J
Copy link
Contributor Author

Type1J commented Jun 1, 2017

The binary version (of just the dist array value) would be c4 08 03 44 ea 8c 4a 6e 25 be.

@nlohmann
Copy link
Owner

nlohmann commented Jun 1, 2017

OK, now I understand. So the current implementation yields 98 03 44 ccea cc8c 4a 6e 25 cc be. The issue is that you have values in the range 0..255 and MessagePack only guarantees positive values 0..127 to be encoded with 1 byte. That's why 234 and 140 need two bytes each.

@Type1J
Copy link
Contributor Author

Type1J commented Jun 1, 2017

Yes

@Type1J
Copy link
Contributor Author

Type1J commented Jun 1, 2017

So, there is a JSON type for which serialization to bin is natural (array), but it must be constrained to numeric values in the range 0..255.

I don' think that it's practical to scan an array to see if it meets those requirements, but I do think that a set of JSON pointers or just a set of key strings could be given on serialization to make the serializer attempt to use bin, and if something prevents that assumption from holding, then throw.

On deserialization, bring the bin type in as an array of numerics, and make no assuptions about it (if it's reserialized, it doesn't use bin) unless the hint is given again for those fields.

A JSON pointer could hint that the root element was a bin candidate, so I'm leaning toward that, but i haven't used JSON pointers, yet, so I don't know if checking a set of them has any unwanted overhead.

@Type1J
Copy link
Contributor Author

Type1J commented Jun 5, 2017

After trying it out, I'm thinking that root hints may not be totally necessary for binary formats, but having the hints stored in the json object as a std::setstd::string of keys (but not represented in the output) would allow the hints to be applied in the to_json() function of a type, which would remove the need to think about the hints outside of the definition (or adl_serializer specialization) of a class. I'd like it to be as transparent to use as possible. I'll try to get a PR together for this feature as soon as I can.

@nlohmann
Copy link
Owner

I am not sure how to implement binary types without changing a lot in the library - somewhere, the information that a certain value (like the numeric vector) should be encoded as binary need to be passed to the library.

The proposal of

a set of JSON pointers or just a set of key strings could be given on serialization to make the serializer attempt to use bin, and if something prevents that assumption from holding, then throw.

may work, but this would mean a lot of work for a very specific scenario. If I missed a simple way, PRs are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON
Projects
None yet
Development

No branches or pull requests

2 participants