feat: Binary value type for optimized binary arrays #6

nebkat · 2024-05-11T01:00:11Z

The original UBJSON solution for binary data was an array of uint8 values. While this does sufficiently address the encoding of such data in the UBJSON format, it does not allow parsers to differentiate between a generic list of numbers and binary data.

When dealing with large quantities of binary data this can have a significant negative impact on performance, as many languages provide optimized storage for binary data that is much more efficient than a standard array.

In the nlohmann C++ JSON library for example, a standard array can require 16 bytes per byte of data, while an optimized binary format would require exactly one.

The introduction of the other unsigned data types in BJData furthers the need for a dedicated byte type. uint8 is no longer the lone unsigned data type, and for parsers to treat uint8 arrays differently as suggested in the UBJSON solution would lead to further confusion.

This proposal aims to address this issue with the introduction of a dedicated byte (B) type. This type would be identical to a uint8, but would be explicitly recommended for serializers/parsers to implement as an optimized data format type. Where such a type is not available, or parsers have not been upgraded to support the format, a standard integer array can be used instead.

C++ provides std::vector<std::byte or uint8_t>, JavaScript provides Uint8Array, Dart provides Uint8List and Python provides bytearray.

UBJSON also states:

BSON, for example, defines types for binary data, regular expressions, JavaScript code blocks and other constructs that have no equivalent data type in JSON. BJSON defines a binary data type as well, again leaving the door wide open to interpretation that can potentially lead to incompatibilities between two implementations of the spec and Smile, while the closest, defines more complex data constructs and generation/parsing rules in the name of absolute space efficiency. These are not short-comings, just trade-offs the different specs made in order to service specific use-cases.

This solution does not fundamentally add any complexity, and without it many may be forced to use these other data formats along with all their baggage in order to achieve the desired efficiency.

nebkat · 2024-09-28T15:04:41Z

@fangq Have you perhaps had a chance to consider this proposal? Thanks in advance!

fangq · 2024-09-28T17:23:54Z

@nebkat, thanks for the patch and rationales above.

I understand the needs for semantically separating byte vs uint8/char. my main concern is that not all programming environments have this differentiation - for example, MATLAB has uint8 and char, but it does not have a native byte type; in other languages, this often fall back to uint8 type or is an alias - when some language do support this - such as python bytearray vs np.uint8, then it adds additional burden of conversion between.

other than refined semantics, can you provide some use cases where such distinction offers notably better data handling?

nebkat · 2024-09-30T01:56:28Z

@fangq My particular application is in C++ Currently nlohmann/json will encode it's internal "binary" type as a uint8 array, but will decode it as a generic array containing numbers, which means that:

nlohmann::json test = nonstd::json::binary(std::vector<uint8_t> { 1 });
test != json::from_bjdata(json::to_bjdata(test, true, true));// 🙁

If this was changed to perform binary decoding by default then similarly:

nlohmann::json test = nonstd::json::array({ 1 });
test != json::from_bjdata(json::to_bjdata(test, true, true)); // 🙁

While this allows the binary data to remain efficient (by a factor of 16), it introduces a nasty problem: as the array type in nlohmann/json is dynamically chosen based on the values contained within, an array that happens to hold all values < 256 would end up encoded/decoded as the special case U binary array, while arrays containing >= 256 would not.

nlohmann::json small = nonstd::json::array({ 1 });
nlohmann::json big = nonstd::json::array({ 1024 });
small != json::from_bjdata(json::to_bjdata(small, true, true)); // 🙁
big == json::from_bjdata(json::to_bjdata(big, true, true));

I have not used the Python package as much, but I suspect the same occurs when encoding an np.uint8 array, which returns bytearray instead.

As for individual byte values vs uint8/char I am not all that concerned and would not be opposed to restricting a B data type to only be used as an array type. Among the common languages I only know of C++ having a dedicated std::byte type, (which nlohmann/json doesn't support anyway).

That said, I also see no harm in permitting it as an additional semantic type beside char, considering it also can't be represented in Python or JavaScript. This may be easier than introducing a dedicated array-only type that is different from the rest.

If this was accepted I think the path forward would be to introduce flags in libraries to continue encoding/decoding uint8 arrays to maintain compatibility with languages that do not have the ability to distinguish between uint8[] and byte[].

I have a patch ready for nebkat/nlohmann-json@37066db, can provide patches for the other libraries if needed, and will hopefully soon have a Dart library ready: nebkat/dart-bjdata.

nebkat · 2024-11-22T01:02:47Z

@fangq Sorry to ping again - we are approaching a production release on an embedded project where we make extensive use of binary arrays.

Currently we are using this proposal as an unofficial extension but we would love to avoid further fragmentation in case a different solution is eventually accepted (as we will have to support this indefinitely going forward).

As mentioned above the current implementation of nlohmann/json is essentially unusable with binary data, and there is no solution that does not further break encode/decode idempotence.

I have since also implemented the necessary changes in nebkat/pybj@afa0a23.

Would really appreciate if we could get this or an alternative solution approved before we are locked in with our release!

fangq · 2024-11-22T03:12:08Z

Thanks, I think this is meaningful addition to bjdata, and I am happy to merge this to the bjdata spec.

when your pybj patch is ready, happy to merge it and make a new release. I will also work on my matlab/octave and javascript bjdata parsers.

Introduces a dedicated `B` marker for bytes. This is used as the strong type marker in optimized array format to encode binary data such that it can also be decoded back to binary data (instead of decoding as an integer array). See NeuroJSON/bjdata#6 for further information.

Introduces a dedicated `B` marker for bytes. This is used as the strong type marker in optimized array format to encode binary data such that it can also be decoded back to binary data (instead of decoding as an integer array). See NeuroJSON/bjdata#6 for further information. Signed-off-by: Nebojsa Cvetkovic <nebkat@gmail.com>

feat: Binary value type

f9cd4e3

nebkat force-pushed the patch-1 branch from ed59798 to f9cd4e3 Compare May 11, 2024 01:45

fangq merged commit df14c14 into NeuroJSON:master Nov 22, 2024

nebkat mentioned this pull request Nov 24, 2024

Clarify annotated array format aliases NeuroJSON/jdata#15

Open

nebkat mentioned this pull request Nov 25, 2024

BJData optimized binary array type nlohmann/json#4513

Merged

4 tasks

nebkat deleted the patch-1 branch December 5, 2024 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Binary value type for optimized binary arrays #6

feat: Binary value type for optimized binary arrays #6

nebkat commented May 11, 2024

nebkat commented Sep 28, 2024

fangq commented Sep 28, 2024

nebkat commented Sep 30, 2024 •

edited

Loading

nebkat commented Nov 22, 2024

fangq commented Nov 22, 2024

feat: Binary value type for optimized binary arrays #6

feat: Binary value type for optimized binary arrays #6

Conversation

nebkat commented May 11, 2024

nebkat commented Sep 28, 2024

fangq commented Sep 28, 2024

nebkat commented Sep 30, 2024 • edited Loading

nebkat commented Nov 22, 2024

fangq commented Nov 22, 2024

nebkat commented Sep 30, 2024 •

edited

Loading