Skip to content

Current use of serializedDataSize() in serializedData() causes additional runtime linear in the depth of the proto hierarchy #713

Open
@MrMage

Description

@MrMage

SwiftProtobuf 1.0.2, all platforms

(Disclaimer: I know that Protobuf is not meant for large data structures, but I figured it wouldn't hurt to ask about this anyway.)

I have something similar to the following protobuf hierarchy:

message Wrapper {
    Wrapped wrapped = 1;
}

message Wrapped {
    repeated string entry = 1; // the actual structure is a bit more complex, but that shouldn't really make a difference
}

In my case, calling serializedData() on a message of type Wrapped takes about 0.75 seconds (it contains a lot of data), of which about 0.25 seconds go towards computing the value of wrapped.serializedDataSize().

Calling serializedData() on a message of type Wrapper wrapping the same Wrapped instance above takes about 1 second, of which about 0.25 seconds go towards computing the value of wrapper.serializedDataSize() on the highest (Wrapper) level and another 0.25 seconds to wrapped.serializedDataSize() (the same as above). The serialized size of wrapper, on the other hand is just 5 bytes more than that of wrapped. Each additional level would introduce another ~0.25 seconds of serializedDataSize(), essentially computing the same size over and over again.

If I change Wrapper to

message Wrapper {
    bytes wrapped = 1;
}

and manually store a serialized representation of wrapped in there, this takes about 0.75 seconds to encode Wrapped (as before), plus a negligible amount of overhead for copying that serialized representation around (much less than ~0.25 seconds).

This means that for cases in which serializedDataSize() is potentially costly, introducing extra hierarchy levels into my protobuf model can cause significant increases in computation time.

The C++ implementation avoids this problem by caching serialized sizes, as evidenced by the statement "Most of these are just simple wrappers around ByteSize() and SerializeWithCachedSizes()." on https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message_lite#serialization.

Given that manipulating messages that are in the process of being serialized sounds like a bad idea anyway, I wonder if it would be possible introduce similar caching into SwiftProtobuf?

This could either be done by storing an extra cachedSize field in each message (see https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message_lite#MessageLite.GetCachedSize.details). That would introduce an extra 8 bytes of overhead per message, though. (Edit: might be problematic because this would mutate the struct. Might be used by applying the nonmutating keyword, though.)

Alternatively, it should be possible to create a temporary (dictionary?) cache in serializedData() that is then passed to the recursive calls of serializedData() and serializedDataSize() on the children. As cache keys, it should be possible to use the pointers to the individual Message structs as their memory layout is not going to change during serialization.
Option 2 (possibly even more efficient) for cache keys would be the pointer to the individual _StorageClass members, as caching the size is only relevant for protos that contain child protos (which seems to be the condition for the existence). (Edit: option 2 might be problematic because useHeapStorage is only true for messages containing single message fields, not repeated ones.)

Either approach (caching the computed size in-line or in a temporary dictionary) would be an implementation detail and should not affect the library's consumers.

I hope this makes sense — would love to hear your thoughts on this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions