Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Binary value type for optimized binary arrays #6

Merged
merged 1 commit into from
Nov 22, 2024

Conversation

nebkat
Copy link
Contributor

@nebkat nebkat commented May 11, 2024

The original UBJSON solution for binary data was an array of uint8 values. While this does sufficiently address the encoding of such data in the UBJSON format, it does not allow parsers to differentiate between a generic list of numbers and binary data.


When dealing with large quantities of binary data this can have a significant negative impact on performance, as many languages provide optimized storage for binary data that is much more efficient than a standard array.

In the nlohmann C++ JSON library for example, a standard array can require 16 bytes per byte of data, while an optimized binary format would require exactly one.

The introduction of the other unsigned data types in BJData furthers the need for a dedicated byte type. uint8 is no longer the lone unsigned data type, and for parsers to treat uint8 arrays differently as suggested in the UBJSON solution would lead to further confusion.


This proposal aims to address this issue with the introduction of a dedicated byte (B) type. This type would be identical to a uint8, but would be explicitly recommended for serializers/parsers to implement as an optimized data format type. Where such a type is not available, or parsers have not been upgraded to support the format, a standard integer array can be used instead.

C++ provides std::vector<std::byte or uint8_t>, JavaScript provides Uint8Array, Dart provides Uint8List and Python provides bytearray.


UBJSON also states:

BSON, for example, defines types for binary data, regular expressions, JavaScript code blocks and other constructs that have no equivalent data type in JSON. BJSON defines a binary data type as well, again leaving the door wide open to interpretation that can potentially lead to incompatibilities between two implementations of the spec and Smile, while the closest, defines more complex data constructs and generation/parsing rules in the name of absolute space efficiency. These are not short-comings, just trade-offs the different specs made in order to service specific use-cases.

This solution does not fundamentally add any complexity, and without it many may be forced to use these other data formats along with all their baggage in order to achieve the desired efficiency.

@nebkat
Copy link
Contributor Author

nebkat commented Sep 28, 2024

@fangq Have you perhaps had a chance to consider this proposal? Thanks in advance!

@fangq
Copy link
Member

fangq commented Sep 28, 2024

@nebkat, thanks for the patch and rationales above.

I understand the needs for semantically separating byte vs uint8/char. my main concern is that not all programming environments have this differentiation - for example, MATLAB has uint8 and char, but it does not have a native byte type; in other languages, this often fall back to uint8 type or is an alias - when some language do support this - such as python bytearray vs np.uint8, then it adds additional burden of conversion between.

other than refined semantics, can you provide some use cases where such distinction offers notably better data handling?

@nebkat
Copy link
Contributor Author

nebkat commented Sep 30, 2024

@fangq My particular application is in C++ Currently nlohmann/json will encode it's internal "binary" type as a uint8 array, but will decode it as a generic array containing numbers, which means that:

nlohmann::json test = nonstd::json::binary(std::vector<uint8_t> { 1 });
test != json::from_bjdata(json::to_bjdata(test, true, true));// 🙁

If this was changed to perform binary decoding by default then similarly:

nlohmann::json test = nonstd::json::array({ 1 });
test != json::from_bjdata(json::to_bjdata(test, true, true)); // 🙁

While this allows the binary data to remain efficient (by a factor of 16), it introduces a nasty problem: as the array type in nlohmann/json is dynamically chosen based on the values contained within, an array that happens to hold all values < 256 would end up encoded/decoded as the special case U binary array, while arrays containing >= 256 would not.

nlohmann::json small = nonstd::json::array({ 1 });
nlohmann::json big = nonstd::json::array({ 1024 });
small != json::from_bjdata(json::to_bjdata(small, true, true)); // 🙁
big == json::from_bjdata(json::to_bjdata(big, true, true));

I have not used the Python package as much, but I suspect the same occurs when encoding an np.uint8 array, which returns bytearray instead.


As for individual byte values vs uint8/char I am not all that concerned and would not be opposed to restricting a B data type to only be used as an array type. Among the common languages I only know of C++ having a dedicated std::byte type, (which nlohmann/json doesn't support anyway).

That said, I also see no harm in permitting it as an additional semantic type beside char, considering it also can't be represented in Python or JavaScript. This may be easier than introducing a dedicated array-only type that is different from the rest.


If this was accepted I think the path forward would be to introduce flags in libraries to continue encoding/decoding uint8 arrays to maintain compatibility with languages that do not have the ability to distinguish between uint8[] and byte[].

I have a patch ready for nebkat/nlohmann-json@37066db, can provide patches for the other libraries if needed, and will hopefully soon have a Dart library ready: nebkat/dart-bjdata.

@nebkat
Copy link
Contributor Author

nebkat commented Nov 22, 2024

@fangq Sorry to ping again - we are approaching a production release on an embedded project where we make extensive use of binary arrays.

Currently we are using this proposal as an unofficial extension but we would love to avoid further fragmentation in case a different solution is eventually accepted (as we will have to support this indefinitely going forward).

As mentioned above the current implementation of nlohmann/json is essentially unusable with binary data, and there is no solution that does not further break encode/decode idempotence.

I have since also implemented the necessary changes in nebkat/pybj@afa0a23.

Would really appreciate if we could get this or an alternative solution approved before we are locked in with our release!

@fangq fangq merged commit df14c14 into NeuroJSON:master Nov 22, 2024
@fangq
Copy link
Member

fangq commented Nov 22, 2024

Thanks, I think this is meaningful addition to bjdata, and I am happy to merge this to the bjdata spec.

when your pybj patch is ready, happy to merge it and make a new release. I will also work on my matlab/octave and javascript bjdata parsers.

nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Nov 24, 2024
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Nov 24, 2024
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Nov 24, 2024
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Nov 24, 2024
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Nov 24, 2024
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Nov 24, 2024
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Nov 24, 2024
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Nov 25, 2024
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Nov 25, 2024
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Nov 25, 2024
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Nov 25, 2024
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Nov 28, 2024
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Dec 5, 2024
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
@nebkat nebkat deleted the patch-1 branch December 5, 2024 23:05
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Jan 5, 2025
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.
nebkat added a commit to nebkat/nlohmann-json that referenced this pull request Jan 5, 2025
Introduces a dedicated `B` marker for bytes. This is used as the strong
type marker in optimized array format to encode binary data such that
it can also be decoded back to binary data (instead of decoding as an
integer array).

See NeuroJSON/bjdata#6 for further information.

Signed-off-by: Nebojsa Cvetkovic <nebkat@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants