-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
What happened?
While working on cleaning up PubsubIO.Write to support writes with ordering keys without explicit bucketing I noticed that PreparePubsubWriteDoFn does not guarantee that a single message fits within the limits of a PublishRequest for either gRPC or REST clients.
The validation routine in PreparePubsubWriteDoFn only validates explicit message size and counts 6 bytes of overhead for every attribute. This will work often enough if maxPublishBatchSize is set to <9MB for the gRPC client, but this is more likely to fail for REST clients since there's additional overhead to consider because the data field is sent as a base64 encoded string and string fields like attribute keys/values and ordering keys may require escape sequences (minimally required: \", \\, \b, \t, \n, \f, \r, \uXXXX for control characters 0-31). For JSON clients the data field must be <7.5 MB, attribute size depends on the overhead on escape sequences, and requests should not exceed 10 MiB.
For gRPC clients, the static overhead of 6 bytes per attribute is valid for any map entry <128 B in protobuf wire format. The TLV overhead may grow to 3 bytes for the entry, key and value (attribute field tag (1 B) + map entry length (1-2 B) + key field tag (1 B) + key length (1-2 B) + value field tag (1 B) + value length (1-2 B)). The total serialized size of any single message publish request must not exceed 10 MB currently. The API accepts requests up to 10 MiB for both JSON and protobuf and may omit the protobuf encoding overhead from the request validation at some point in the future in which case a single message will not exceed 10MiB since its encoding overhead is less than 1 KiB.
Batched requests should be validated again by PubsubClient to ensure a single PublishRequest does not exceed these limits. That's... Not as simple either, since a JSON PublishRequest not exceeding 10 MB or 10 MiB may still exceed those limits when the request is transcoded into a protobuf PublishRequest before it hits Pub/Sub.
Filing this for visibility since I was working on fixes for this while collaborating on #31608, one to correct the validation in PreparePubsubWriteDoFn and one to refactor batching in PubsubClient to account for this.
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner