Rethink the `serde` `[u8; 16]` binary representation #557

KodrAus · 2021-11-02T22:15:19Z

Blocking #553
Following #329

The currently proposed stable uuid API changes the serde implementation from &[u8] to [u8; 16]. This is a breaking change to existing serialized data.

It's been pointed out on Reddit that [u8; 16] isn't really a universal improvement over &[u8] for binary formats, because some can treat byte slices specially.

Before stabilizing, we should try come to some consensus on this one way or another. If we can't, then I think we should fall back to the status-quo of &[u8]. So I think the options are:

&[u8]: the current approach. Can be optimized by some binary formats compared to tuples/sequences. Produces a redundant length field. May be possible to do bytemucking to re-interpret serialized bits directly as a UUID.
[u8; 16]: what's on main. Is better for some binary formats that don't have an overhead per field on sequences, but is less optimal for some binary formats than &[u8]. Will cause breakage for existing data.
u128: what I'm currently proposing. Can be formatted optimally by any binary format. Isn't necessarily guaranteed to be supported by all serializers/deserializers. The encoded UUID won't look like a UUID. Will cause breakage for existing data.
Some kind of Serializer::serialize_byte_array<const N: usize>(bytes: &[u8; N]) in serde that forwards to &[u8]?
Anything else?

cc @smarnach @Marwes @mitsuhiko (as having originally provided input on the serde format)

Also just to re-iterate, this only affects non-human-readable formats like bincode. Other formats like serde-json are still going to use hyphenated strings by default.

The text was updated successfully, but these errors were encountered:

KodrAus · 2021-11-02T22:16:43Z

From @mitsuhiko #553 (comment)

I think the serde change is not positive. It's untypical, generally serde hasn't been used like this and it's going to break a lot of formats. I'm also not convinced that u128 is a better solution here. Does the length information cause a problem in practice that it's worth considering going out of the way to do something untypical here?

KodrAus · 2021-11-02T22:18:34Z

From @Icarium-Lifestealer on Reddit:

I don't understand the serde change. Which type of the serde data model does a uuid map to now, if not byte array [u8]? A tuple with 16 members? (that sounds very inefficient for formats where fields have overhead)

rakshith-ravi · 2021-11-03T00:32:26Z

Just a random idea, but how about a feature flag for the different formats. For example, a serde-simple format can use the SimpleFormat in serde, and accordingly we can have different feature flags for different formats, with the default ones being the existing implementations, to avoid breaking existing crates. Additionally, people can enable their own feature flags for whichever format they specifically want to use, in case they want something different

kinggoesgaming · 2021-11-03T00:36:54Z

Just a random idea, but how about a feature flag for the different formats. For example, a serde-simple format can use the SimpleFormat in serde, and accordingly we can have different feature flags for different formats, with the default ones being the existing implementations, to avoid breaking existing crates. Additionally, people can enable their own feature flags for whichever format they specifically want to use, in case they want something different

The problem with this is conflicting feature flags downstream, when different libraries decide to use one or other format by default.

KodrAus · 2021-11-03T04:21:56Z

I think the serde change is not positive. It's untypical, generally serde hasn't been used like this and it's going to break a lot of formats.

Hmm, I wouldn't consider serializing as [u8; 16] as particularly untypical. If we'd just dropped a #[derive(Serialize)] on Uuid that's exactly what we'd get.

If we are going to change the format at all, I think u128 is probably a better bet than [u8; 16]. It gives any serializer a chance to do the most optimal thing with a 128bit value, rather than &[u8] where the length is redundant or [u8; 16] where you can't distinguish from an array of bytes or an array of something else. Not necessarily knowing those 128bits encoded in the format are actually a UUID doesn't seem like a major drawback. If you do want to do zero-copy deserialization then there are other tools besides serde (zerocopy and bytemuck) that facilitate that.

rakshith-ravi · 2021-11-03T08:25:11Z

The problem with this is conflicting feature flags downstream, when different libraries decide to use one or other format by default.

So if someone wanted to use a specific format for their UUIDs in their application, how would they go about doing it without manually implementing Serialize or Deserialize?

KodrAus · 2021-11-03T09:35:29Z

@rakshith-ravi They'd have to manually implement Serialize or Deserialize to tweak the way specific fields are serialized. There are attributes you can use with #[derive(Serialize, Deserialize)] to change things without needing to manually implement the traits, but Cargo features aren't really a tool for altering the public contract of a library in any way that isn't purely additive (and the serialization format is a part of the public contract).

Marwes · 2021-11-03T10:18:56Z

From @Icarium-Lifestealer on Reddit:

I don't understand the serde change. Which type of the serde data model does a uuid map to now, if not byte array [u8]? A tuple with 16 members? (that sounds very inefficient for formats where fields have overhead)

We could equally well state this problem as an issue with the format, rather than with serde/uuid. I suppose the tuple serializing does not give the information upfront that the fields all have the same time, so perhaps this indicates that serde should have a way to serialize/deserialize fixed length arrays (currently you get either fixed size in tuples for fixed element type in sequences, but not both).

KodrAus · 2021-11-03T10:48:12Z

perhaps this indicates that serde should have a way to serialize/deserialize fixed length arrays

If you peel the crossed out sticker off that fourth point I was wondering the same thing 🙂 Some kind of serialize_array<T: Serialize, const N: usize>(self, arr: &[T; N]) that forwards to serialize_seq (or possible to serialize_tuple for compatibility) would be a nice fit. It could be something to propose to serde.

KodrAus · 2021-11-03T11:14:20Z

Unless there’s some clear desire to figure out [T; N] support in serde in the immediate future though I don’t think we should let that block us.

mitsuhiko · 2021-11-03T13:01:50Z

I think the problem is not so much that serde does not have array serialization but that serde does not have a good strategy for bytes. In most formats there is a significant difference between bytes and arrays or vectors of other types. That's why serde_bytes is needed for instance to invoke the bytes interface. There is however no equivalent to "byte array of size N" or "byte tuple".

It's true that serde generates out tuple serialization code for any fixed size array and I think this is absolutely valid for when you're working with non u8 types. For u8 however this has been shown to be problematic.

mitsuhiko · 2021-11-03T13:04:26Z

Unless there’s some clear desire to figure out [T; N] support in serde in the immediate future though I don’t think we should let that block us.

There is no way to support this in serde because of an unfortunate decision to have different trait bounds for [T; 0]. So we're going to be stuck with this for a long time.

Marwes · 2021-11-03T13:46:41Z

Serde wouldn't need to explicitly support [T; N] though, a method like fn serialize_fixed_array(&mut self, array: &[T]) would work just as well.

sfackler · 2021-11-03T18:53:05Z

IMO going through serialize_bytes explicitly for non-human-readable serializers would be a better approach than going through arrays. It both more closely matches the behavior of serializers in other languages (e.g. Jackson), and should be more efficient since the serializer gets the whole buffer up-front rather than being fed each byte in sequence.

smarnach · 2021-11-03T20:23:03Z

I still think the most natural approach is to serialize a Uuid as a [u8; 16], because it is a [u8;16].

[…] should be more efficient since the serializer gets the whole buffer up-front rather than being fed each byte in sequence.

At least for the bincode serde format, it's less efficient to serialize it as a byte array, since that requires a redundant length tag, so it end up serializing to 24 bytes rather than 16. This was the motivation for changing the serialization format.

What binary serde formats would be adversely affected by serlializing as a [u8; 16]?

sfackler · 2021-11-03T20:31:26Z

What binary serde formats would be adversely affected by serlializing as a [u8; 16]?

Any binary JSON-style format: serde-cbor, serde-smile, serde-bson, etc.

KodrAus · 2021-11-03T20:50:25Z

It’s sounding like [u8; 16] is probably not the way forward then for various reasons. So our remaining options are:

Go back to serialize_bytes and continue to wear the redundant length.
See if there’s any interest in a serialize_array or even more scoped serialize_byte_array method in serde that would let us tell a format the length of the byte array is fixed. I could imagine this would need some coordinated support in various formats to be usable.
Use u128.

Are there any thoughts on u128?

sfackler · 2021-11-03T20:52:29Z

I'd be worried about the portability of u128 across formats - for example, serde-cbor can't encode them.

KodrAus · 2021-11-03T20:57:32Z

@sfackler Ah ok, is it that it simply can’t encode them? Or that it needs to use a fallback? I’ve seen some formats route 128bit numbers through serialize_bytes.

I’d kind of expected 128bit number support to be pretty good by now (at least not hotting the error path in the majority of formats), but maybe it’s just not yet.

KodrAus · 2021-11-03T23:38:20Z

Ok, I've had a look at a couple binary formats to see how size compares between &[u8] and [u8; 16]. These results are for a struct containing a single field, where that field is one of the proposed representations:

bincode (bytes): Ok(24)
bincode (tuple): Ok(16)
bincode (u128) : Ok(16)

msgpack (bytes): Ok(19)
msgpack (tuple): Ok(20)
msgpack (u128) : Ok(19)

flexbuf (bytes): Ok(33)
flexbuf (tuple): Ok(33)
flexbuf (u128) : Err(Serde("u128 is not supported"))

postcrd (bytes): Ok(17)
postcrd (tuple): Ok(16)
postcrd (u128) : Ok(16)

pickle  (bytes): Ok(34)
pickle  (tuple): Ok(50)
pickle  (u128) : Err(Syntax(Structure("u128 is not supported")))

cbor    (bytes): Ok(24)
cbor    (tuple): Ok(24)
cbor    (u128) : Ok(25)

It looks like 128bit numbers actually are rather spottily supported indeed, so I think that rules u128 out as an option.

But [u8; 16] doesn't actually look too bad on this sample:

pickle has a very bad time with it. It becomes ~1.5x larger. This is equivalent to the decrease in size in bincode between tuples and slices.
msgpack requires an extra byte.

I'd be keen to add other non-human-readable formats to this list and evaluate them, but on balance [u8; 16] actually looks ok for UUIDs.

KodrAus · 2021-11-04T04:20:16Z

I think @sfackler's original concern about the lazily written tuple performance is also justified though.

Writing a struct with a single field containing the various representations on my M1 MacBook with maximum optimizations results in something like:

running 18 tests
test bincode_bytes     ... bench:          25 ns/iter (+/- 0)
test bincode_tuple     ... bench:          37 ns/iter (+/- 1)
test bincode_u128      ... bench:          25 ns/iter (+/- 0)

test cbor_bytes        ... bench:          89 ns/iter (+/- 0)
test cbor_tuple        ... bench:         161 ns/iter (+/- 23)
test cbor_u128         ... bench:         122 ns/iter (+/- 1)

test flexbuffers_bytes ... bench:         376 ns/iter (+/- 18)
test flexbuffers_tuple ... bench:         747 ns/iter (+/- 24)
test flexbuffers_u128  ... bench:         283 ns/iter (+/- 1)

test msgpack_bytes     ... bench:          31 ns/iter (+/- 0)
test msgpack_tuple     ... bench:          84 ns/iter (+/- 0)
test msgpack_u128      ... bench:          31 ns/iter (+/- 0)

test pickle_bytes      ... bench:          43 ns/iter (+/- 0)
test pickle_tuple      ... bench:         112 ns/iter (+/- 10)
test pickle_u128       ... bench:         104 ns/iter (+/- 0)

test postcard_bytes    ... bench:          97 ns/iter (+/- 12)
test postcard_tuple    ... bench:          63 ns/iter (+/- 0)
test postcard_u128     ... bench:          29 ns/iter (+/- 0)

It's more expensive in all cases except for postcard.

This still seems to be the case even when we amortize the cost of re-allocating the Vec multiple times:

test bincode_bytes      ... bench:          25 ns/iter (+/- 0)
test bincode_bytes_into ... bench:           4 ns/iter (+/- 0)
test bincode_tuple      ... bench:          38 ns/iter (+/- 0)
test bincode_tuple_into ... bench:          33 ns/iter (+/- 0)
test bincode_u128       ... bench:          25 ns/iter (+/- 0)
test bincode_u128_into  ... bench:           0 ns/iter (+/- 0)

KodrAus · 2021-11-04T05:02:42Z

I started writing up a proposal for supporting fixed-size byte arrays in serde and hit a roadblock where I don't think it's possible to avoid the trade-offs of &[u8] vs [u8; N] without explicit Deserializer support. Edit: When it's all mapped out it looks a bit uncompelling, but I've opened serde-rs/serde#2120

Since serde has made the decision that [T; N] is the same as (T, ..N) I don't think any proposal for general-purpose array support beyond tuples is likely to get any traction.

I think that leaves us picking between the trade-offs of &[u8] and [u8; 16]. So now it's a question of:

Potential space saving for some formats and overheads for others with &[u8].
Potential space saving for some formats and overheads for others with [u8; 16]. Potential performance overheads with [u8; 16]. Need for ecosystem migration with [u8; 16].

KodrAus · 2021-11-04T05:06:27Z

Maybe we can do a bit of a reaction poll to gauge the current temperature. If it's not possible to have everything through explicit fixed-size-byte-array support should we:

🚀 to revert back to &[u8]
🎉 to stick to [u8; 16]

(I couldn't think of which option should get the 👍 vs the 👎 so we get rockets and party poppers)

sfackler · 2021-11-04T15:46:39Z

If we want to hit the serialize_bytes codepath, self.as_bytes().serialize(ser) won't work - all slices use serialize_seq instead afaik.

KodrAus · 2021-11-04T22:26:57Z

If we want to hit the serialize_bytes codepath, self.as_bytes().serialize(ser) won't work

Yeh, the old implementation would call serializer.serialize_bytes itself.

KodrAus · 2021-11-08T05:12:41Z

So, based on everything we've learned about the trade-offs of &[u8] vs [u8; 16] vs u128 I think we're stuck without a single ideal approach that's going to suit everybody's needs. Uuid needs to pick one of them though. It seems that when we weigh everything up with each option, the potential space savings, performance implications, compatibility, and potential for churn and breakage, the most appropriate choice is to revert back to serialize_bytes. I'm sorry to the folks who wanted [u8; 16]. I can appreciate that for your needs it's the best choice, and the 8 redundant bytes in bincode is definitely unfortunate, but we don't have the luxury of an option that appeals to even a majority of users, so in this case I think the status-quo wins.

It's been great to really try dig into this, I appreciate everyone's involvement and I'll still give serde-rs/serde#2120 some time to play out, but realistically I expect we'll be back at serialize_bytes.

KodrAus mentioned this issue Nov 2, 2021

Prepare for 1.0.0-alpha.1 release #553

Merged

KodrAus changed the title ~~Rethink the [u8; 16] binary representation~~ Rethink the serde [u8; 16] binary representation Nov 3, 2021

KodrAus mentioned this issue Nov 4, 2021

Consider a Serializer::serialize_byte_array method serde-rs/serde#2120

Closed

kinggoesgaming mentioned this issue Nov 6, 2021

Inconsistent inlining? #551

Closed

KodrAus mentioned this issue Nov 15, 2021

Revert default compact serialization #564

Merged

KodrAus closed this as completed in #564 Nov 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink the `serde` `[u8; 16]` binary representation #557

Rethink the `serde` `[u8; 16]` binary representation #557

KodrAus commented Nov 2, 2021 •

edited

Loading

KodrAus commented Nov 2, 2021 •

edited

Loading

KodrAus commented Nov 2, 2021

rakshith-ravi commented Nov 3, 2021

kinggoesgaming commented Nov 3, 2021

KodrAus commented Nov 3, 2021 •

edited

Loading

rakshith-ravi commented Nov 3, 2021 •

edited

Loading

KodrAus commented Nov 3, 2021

Marwes commented Nov 3, 2021

KodrAus commented Nov 3, 2021

KodrAus commented Nov 3, 2021

mitsuhiko commented Nov 3, 2021

mitsuhiko commented Nov 3, 2021

Marwes commented Nov 3, 2021

sfackler commented Nov 3, 2021

smarnach commented Nov 3, 2021

sfackler commented Nov 3, 2021 •

edited

Loading

KodrAus commented Nov 3, 2021

sfackler commented Nov 3, 2021

KodrAus commented Nov 3, 2021

KodrAus commented Nov 3, 2021 •

edited

Loading

KodrAus commented Nov 4, 2021 •

edited

Loading

KodrAus commented Nov 4, 2021 •

edited

Loading

KodrAus commented Nov 4, 2021 •

edited

Loading

sfackler commented Nov 4, 2021

KodrAus commented Nov 4, 2021

KodrAus commented Nov 8, 2021

Rethink the serde [u8; 16] binary representation #557

Rethink the serde [u8; 16] binary representation #557

Comments

KodrAus commented Nov 2, 2021 • edited Loading

KodrAus commented Nov 2, 2021 • edited Loading

KodrAus commented Nov 2, 2021

rakshith-ravi commented Nov 3, 2021

kinggoesgaming commented Nov 3, 2021

KodrAus commented Nov 3, 2021 • edited Loading

rakshith-ravi commented Nov 3, 2021 • edited Loading

KodrAus commented Nov 3, 2021

Marwes commented Nov 3, 2021

KodrAus commented Nov 3, 2021

KodrAus commented Nov 3, 2021

mitsuhiko commented Nov 3, 2021

mitsuhiko commented Nov 3, 2021

Marwes commented Nov 3, 2021

sfackler commented Nov 3, 2021

smarnach commented Nov 3, 2021

sfackler commented Nov 3, 2021 • edited Loading

KodrAus commented Nov 3, 2021

sfackler commented Nov 3, 2021

KodrAus commented Nov 3, 2021

KodrAus commented Nov 3, 2021 • edited Loading

KodrAus commented Nov 4, 2021 • edited Loading

KodrAus commented Nov 4, 2021 • edited Loading

KodrAus commented Nov 4, 2021 • edited Loading

sfackler commented Nov 4, 2021

KodrAus commented Nov 4, 2021

KodrAus commented Nov 8, 2021

Rethink the `serde` `[u8; 16]` binary representation #557

Rethink the `serde` `[u8; 16]` binary representation #557

KodrAus commented Nov 2, 2021 •

edited

Loading

KodrAus commented Nov 2, 2021 •

edited

Loading

KodrAus commented Nov 3, 2021 •

edited

Loading

rakshith-ravi commented Nov 3, 2021 •

edited

Loading

sfackler commented Nov 3, 2021 •

edited

Loading

KodrAus commented Nov 3, 2021 •

edited

Loading

KodrAus commented Nov 4, 2021 •

edited

Loading

KodrAus commented Nov 4, 2021 •

edited

Loading

KodrAus commented Nov 4, 2021 •

edited

Loading