Description
SwiftProtobuf 1.0.2, all platforms
(Disclaimer: I know that Protobuf is not meant for large data structures, but I figured it wouldn't hurt to ask about this anyway.)
I have something similar to the following protobuf hierarchy:
message Wrapper {
Wrapped wrapped = 1;
}
message Wrapped {
repeated string entry = 1; // the actual structure is a bit more complex, but that shouldn't really make a difference
}
In my case, calling serializedData()
on a message of type Wrapped
takes about 0.75 seconds (it contains a lot of data), of which about 0.25 seconds go towards computing the value of wrapped.serializedDataSize()
.
Calling serializedData()
on a message of type Wrapper
wrapping the same Wrapped
instance above takes about 1 second, of which about 0.25 seconds go towards computing the value of wrapper.serializedDataSize()
on the highest (Wrapper
) level and another 0.25 seconds to wrapped.serializedDataSize()
(the same as above). The serialized size of wrapper
, on the other hand is just 5 bytes more than that of wrapped
. Each additional level would introduce another ~0.25 seconds of serializedDataSize()
, essentially computing the same size over and over again.
If I change Wrapper
to
message Wrapper {
bytes wrapped = 1;
}
and manually store a serialized representation of wrapped
in there, this takes about 0.75 seconds to encode Wrapped
(as before), plus a negligible amount of overhead for copying that serialized representation around (much less than ~0.25 seconds).
This means that for cases in which serializedDataSize()
is potentially costly, introducing extra hierarchy levels into my protobuf model can cause significant increases in computation time.
The C++ implementation avoids this problem by caching serialized sizes, as evidenced by the statement "Most of these are just simple wrappers around ByteSize() and SerializeWithCachedSizes()." on https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message_lite#serialization.
Given that manipulating messages that are in the process of being serialized sounds like a bad idea anyway, I wonder if it would be possible introduce similar caching into SwiftProtobuf?
This could either be done by storing an extra cachedSize
field in each message (see https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message_lite#MessageLite.GetCachedSize.details). That would introduce an extra 8 bytes of overhead per message, though. (Edit: might be problematic because this would mutate the struct. Might be used by applying the nonmutating
keyword, though.)
Alternatively, it should be possible to create a temporary (dictionary?) cache in serializedData()
that is then passed to the recursive calls of serializedData()
and serializedDataSize()
on the children. As cache keys, it should be possible to use the pointers to the individual Message
structs as their memory layout is not going to change during serialization.
Option 2 (possibly even more efficient) for cache keys would be the pointer to the individual _StorageClass
members, as caching the size is only relevant for protos that contain child protos (which seems to be the condition for the existence). (Edit: option 2 might be problematic because useHeapStorage
is only true for messages containing single message fields, not repeated ones.)
Either approach (caching the computed size in-line or in a temporary dictionary) would be an implementation detail and should not affect the library's consumers.
I hope this makes sense — would love to hear your thoughts on this.