Current use of serializedDataSize() in serializedData() causes additional runtime linear in the depth of the proto hierarchy

_SwiftProtobuf 1.0.2, all platforms_

(Disclaimer: I know that Protobuf is not meant for large data structures, but I figured it wouldn't hurt to ask about this anyway.)

I have something similar to the following protobuf hierarchy:

    message Wrapper {
        Wrapped wrapped = 1;
    }
    
    message Wrapped {
        repeated string entry = 1; // the actual structure is a bit more complex, but that shouldn't really make a difference
    }

In my case, calling `serializedData()` on a message of type `Wrapped` takes about 0.75 seconds (it contains a _lot_ of data), of which about 0.25 seconds go towards computing the value of `wrapped.serializedDataSize()`.

Calling `serializedData()` on a message of type `Wrapper` wrapping the same `Wrapped` instance above takes about 1 second, of which about 0.25 seconds go towards computing the value of `wrapper.serializedDataSize()` on the highest (`Wrapper`) level and another 0.25 seconds to `wrapped.serializedDataSize()` (the same as above). The serialized size of `wrapper`, on the other hand is just 5 bytes more than that of `wrapped`. Each additional level would introduce another ~0.25 seconds of `serializedDataSize()`, essentially computing the same size over and over again.

If I change `Wrapper` to

    message Wrapper {
        bytes wrapped = 1;
    }

and manually store a serialized representation of `wrapped` in there, this takes about 0.75 seconds to encode `Wrapped` (as before), plus a negligible amount of overhead for copying that serialized representation around (much less than ~0.25 seconds).

This means that for cases in which `serializedDataSize()` is potentially costly, introducing extra hierarchy levels into my protobuf model can cause significant increases in computation time.

The C++ implementation avoids this problem by caching serialized sizes, as evidenced by the statement "Most of these are just simple wrappers around ByteSize() and SerializeWithCachedSizes()." on https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message_lite#serialization.

**Given that manipulating messages that are in the process of being serialized sounds like a _bad_ idea anyway, I wonder if it would be possible introduce similar caching into SwiftProtobuf?**

This could either be done by storing an extra `cachedSize` field in each message (see https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message_lite#MessageLite.GetCachedSize.details). That would introduce an extra 8 bytes of overhead per message, though. (Edit: might be problematic because this would mutate the struct. Might be used by applying the `nonmutating` keyword, though.)

Alternatively, it should be possible to create a temporary (dictionary?) cache in `serializedData()` that is then passed to the recursive calls of `serializedData()` and `serializedDataSize()` on the children. As cache keys, it should be possible to use the pointers to the individual `Message` structs as their memory layout is not going to change during serialization.  
Option 2 (possibly even more efficient) for cache keys would be the pointer to the individual `_StorageClass` members, as caching the size is only relevant for protos that _contain_ child protos (which seems to be the condition for the existence). (Edit: option 2 might be problematic because `useHeapStorage` is only true for messages containing single message fields, not repeated ones.)

Either approach (caching the computed size in-line or in a temporary dictionary) would be an implementation detail and should not affect the library's consumers.

I hope this makes sense — would love to hear your thoughts on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Current use of serializedDataSize() in serializedData() causes additional runtime linear in the depth of the proto hierarchy #713

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Current use of serializedDataSize() in serializedData() causes additional runtime linear in the depth of the proto hierarchy #713

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions